Everyone is familiar with Nagios, which is often considered the de-facto standard for monitoring. The other tools in that general category are OpenNMS, Zenoss, Groundworks, HyperIQ and others. I am only talking here about tools that would qualify in the NMS category: something that really tracks different systems and devices across the entire infrastructure.

A couple of years ago, I was so tired of Nagios that I was ready to try something new. A couple of tools didn’t make the list, simply because of the “fremium” model. The basics are there, but anything more typically carries a hefty price tag.

I decided to try Zabbix and I have pretty much been a fan ever since. One caveat here, is that I am talking about version 1.8.x. Version 2.0 just came out and offers a few notable improvements, which I haven’t tried out yet. A couple of things that look very promising are: Direct JXM support, multi-homed hosts, and mounted filesystem discovery. Full list of changes is here

As an overview, Zabbix offers the following:

  • Relatively quick & simple install on a variety of platforms
  • Agent-based, but available agentless options.
  • A fairly vibrant community
  • A large amount of templates covering most popular software
  • Integrated graphs
  • Escalation management
More specifically:

 

Graphs

 

There are a lot of graphic front ends for Nagios. In general, they are bolt-ons of varying quality. On the other hand, graphs are probably one of the stronger features of Zabbix. Typically, templates will have a few graphs predefined, but more can be added fairly easily. Any item that’s being collected can also be graphed on-demand. The one small drawback is the inability to save pics on the fly, which is sometimes useful for distribution. A workaround for that is described in this thread.

 

Graphing performance is decent if not spectacular. That will largely depend of data volume, your hardware and range of time.

 

What I found especially valuable is something zabbix refers to as “screens“.  Generally, the entire point of graphing or visualizing something is to be able to easily identify trends and correlations. “Screens” allow you to group disparate items together. For example, if you wanted to see the correlation between your requests per second, queries per second, response time, network traffic and read/write percentage, it’s fairly trivial to put it together. Besides that, I’ve tended to use screens almost as targeted dashboards. Something like putting all the MySQL relevant information on the same screen (disk IO, queries per second, replication lag, cpu/mem, cache hits, etc) can let you know the health of your MySQL infrastructure almost immediately. Same can be done on the web side and other areas.

 

Performance

 

Performance will vary quite a bit. I’ve ran Zabbix on a large instance at EC2, backed by a 4-volume EBS RAID set and was able to receive 600-800 values/second without much of a problem. However, with that setup, the screens (particularly the ones with with a lot of metrics) would load in 2-5 seconds and the lag was noticeable. One key tweak that is absolutely necessary is the polling frequency. Most of the default (and 3rd party) templates will have the polling frequency too high. You generally don’t need to poll for free space every 5 seconds and there are plenty of examples like this. The data retention period also needs to be adjusted in a lot of cases. Reducing those intervals to something more reasonable is going to give a significant performance boost. It will behave better because you’ll reduce the volume of incoming values, but it will also reduce the amount of data you store and query against in the database. You likely don’t need precise-to-the-second numbers for every metric you collect going back a year. Historical data is still available, though in a somewhat less detailed form, which is generally sufficient for trend information.
If the data volume gets too large, the clean up process might start failing. I’ve noticed that around 150GB of data it would start having trouble. At that point there aren’t very many good options and they tend to be quite hairy. It’s best to avoid getting into the situation in the fist place.

 

There are also a couple of options for distributed monitoring, if the performance requirements exceed the capability of a single node. There is a lot of documentation about it on their site, but it generally boils down to a choice between proxy or a node. I tend to prefer a proxy because of easier setup and maintenance. In a more specific example, I’d use proxies in an AWS environment which was spread across different regions. Another good use case in AWS is if you have a mix of a VPC and regular EC2 and you’d place your proxy in the VPC. This method can allow for significant scaling capabilities, though you would still need a very capable central master. The one significant benefit to a node approach is that they can be queried independently and support a hierarchical approach. However, in an environment with 1000s of devices that support different applications, nodes are likely a better approach.

 

Monitoring

 

It’s a fairly standard feature set that is generally similar across other NMS systems. A couple of things worth noting:
  • Web Monitoring – it has a built in web transaction monitoring. It’s decent if not spectacular and doesn’t really compare against sophisticated transaction monitoring systems that are out there. It does support multiple steps and it’s based on curl, though it doesn’t expose all of curl’s functionality. That will present a problem if you need to do extensive cookie manipulation and/or variables. It’s also useless for heavily AJAXed pages and the ones that use flash. Still, it’s decent for basic monitoring and more then most other systems offer.
  • IMPI support is worth noting, but I’ve personally never used it.
  • Log Monitoring – this isn’t going to work well for high traffic web logs, but it does a pretty solid job at picking up exceptions and errors in various files. It does support a full regex engine for pattern matching.  I’ve had it monitoring files that received ~500 lines per second and it had no issues with that.
  • Templates – this is the core approach to monitoring in Zabbix. All your monitoring definitions are ideally grouped in templates. When a new server/instance shows up, you simply apply the template to it or add it to a group to which this template is assigned. There are a few templates that come out of the box of varying quality and there are a lot of user-generated templates for a variety of applications. A lot of them will have a script (PHP/Perl/Python) that polls the application and sends the data back. Typically you’ll have to make a few tweaks that are specific to your environment. Some of the ones that I found useful and better then others are:
    • This is the “default” MySQL template for Zabbix and it’s based on a PHP script.  The description says it wasn’t tested on 5.1, but I didn’t seem to notice any issues.  There are range of values that have to be tuned in order to avoid false alerts.
    • If you’re used to the Cacti templates for MySQL and the data those provide, this is a port to Zabbix. If I remember correctly, this template required a few tweaks to the PHP script, in order to get it working.
    • This is another decent template for MySQL, but you don’t get InnoDB information out of the box. It is good for monitoring multiple MySQL instances on the same box though. The other templates would require modifications in their polling scripts.
    • For Haproxy, I’ve used this template. It’s better than others, since it allows you to look and compare statistics of individual servers behind Haproxy. The downside is that it won’t automatically discover changes. That can be scripted, but it might get a little hairy.
    • For Nginx, this is more than sufficient for most needs.
    • Another one that is useful for Nginx, though the site is in Russian. Google translate does a pretty good job there.  There are a few other templates on that site, but I’ve never tried them.
Misc
  • It does have an API for automation. I think it was improved in 2.0, but in 1.8 it was already solid. There is a decent CLI tool written in Ruby that will interface with the API, called zabcon
  • There isn’t a great way to control alert floods. You can control trigger dependencies, but if something really goes haywire you might be manually clearing SQL tables after that.
  • Alert escalations are a little wonky, but they work reasonably well.
  • It is pretty trivial to port existing Nagios plugins or other scripts into Zabbix.
  • JMX monitoring was done via zapcat. It wasn’t great, but for the lack of better options this was the only thing to work with. Version 2.0 does it natively and if they did it right, that’s probably one of the biggest improvements.
In summary, from what I’ve seen, Zabbix is easily one of the top NMS systems out there, though it’s probably somewhat less popular than others. If you’re fed up with Nagios or doing a brand new deployment, taking a serious look at Zabbix will be worth your while.