Tuesday, October 20, 2009

large scale ganglia

Ganglia is currently the end-all be-all of open source host metric aggregation and reporting. There are a couple other solutions which are slowly emerging to replace it, but nothing as well entrenched. Hate on RRDs as much as you want but they're basically the defacto standard for storing and reporting numeric metrics. It can be hairy trying to figure out how to configure it all though, so here's an overview of getting it set up on your network.

First install rrdtool and all of ganglia tools on every host. Keep the web interface ('web' in the ganglia tarball) off to the side for now. The first thing you should consider is configuring IGMP on your switch to take advantage of multicast groups. If you don't you run the risk of causing broadcast storms which could potentially wreak havoc on your network equipment, causing ignored packets and other anomalies depending on your traffic.

In order to properly manage a large-scale installation you'll need to juggle some configuration management software. You need to configure gmond on each cluster node so that the cluster multicasts on a unique port for that lan (in fact, keep all of your clusters on unique ports for simplicity's sake). If you have a monitoring or admin host on each lan you'll configure gmond to listen for multicast [or unicast] metrics from one or more clusters on that lan. Then you'll need to configure your gmetad's to collect stats from either each individual cluster node or the "collector" gmond nodes on the monitoring/admin hosts. Our network topology is as follows:
Monitor box ->
DC1 ->
DC1.LAN1 ->
Cluster 1
Cluster 2
DC1.LAN2 ->
Cluster 3
DC1.LAN3 ->
Cluster 4
DC2 ->
DC2.LAN1 ->
Cluster 1
Cluster 2
Cluster 3
Cluster 4
DC2.LAN2 ->
Cluster 1

The monitor box runs gmetad and only has two data_source entires: DC1 and DC2.
DC1 and DC2 run gmetad and each has a data_source for each LAN.
Each LAN has its own monitor host running gmond which collects metrics for all clusters on their respective LAN.
The clusters are simply multicasting gmond's configured with a specific cluster name and multicast port running on cluster nodes.
The main monitor box, the DC boxes and the LAN boxes all run apache2+php5 with the same docroot (the ganglia web interface). The configs are set to load gmetad data from localhost on one port.
Each gmetad has its "authority" pointed at its own web server URL.

(Tip: in theory you could run all of that off a single host by making sure all the gmetad's use unique ports and modifying the web interface code to load the config settings based on the requesting URL, to change the gmetad port as necessary)

In the end what you get is a main page which only shows the different DCs as grids. As you click one it loads a new page which shows that DC's LANs as grids. Clicking those will show summaries of that LAN's clusters. This allows you to lay out your clusters across your infrastructure in a well-balanced topology and gives you the benefit of some additional redundancy if one LAN or DC's WAN link goes down.

We used to use a single gmetad host and web interface for all clusters. This makes it extremely easy to query any given piece of data at once from a single gmetad instance and see all the clusters on one web page. The problem was we had too much data. Gmetad could not keep up with the load and the box was crushed by disk IO. We lessened this by moving to a tmpfs mount, archiving RRDs and pruning any older than 60 days. Spreading out to multiple hosts also lessened the need for additional RAM and lowered network use and latency across long-distance links.

If you think you won't care about your lost historical data, think again. Always archive your RRDs to keep fine-grained details for later planning. Also keep in mind that as your clusters change your RRD directory can fill up with clutter. Hosts which used to be in one cluster and are now in another are not cleaned up by ganglia; they will sit on the disk unused. Only take recently-modified files into account when determining the total size of your RRDs. Also, for big clusters make your poll time a little longer to prevent load from ramping up too often.

I'll add config examples later. There are many guides out there that show how to set up the fine details. As far as I can tell this is the simplest way to lay out multiple grids and lessen the load on an overtaxed gmetad host.

2 comments:

  1. hi,
    Thanks for a good technical blog.
    I am a graduate student and am trying to document possible configurations for scaling a monitoring sub-system of a large datacenter. I was wondering if you could tell me how big are your data centers (in terms number of monitored nodes); number of gmetads. Also if you can remember what was the number of nodes when your previous configuration with single gmetad started to crash.

    thanks,
    -upendra

    ReplyDelete
  2. I don't remember how many hosts caused gmetad to start crashing, but it was probably in the single-digit thousands. For each gmetad process, having enough RAM available both for the gmetad process and to store RRDs in tmpfs was critical. Each datacenter had a couple thousand hosts.

    ReplyDelete