Interesting to imagine that in the world of HFTA, latency relevance is measured in numbers that are orders of magnitude different from when “latency” means in the world of web.
Everyone is familiar with Nagios, which is often considered the de-facto standard for monitoring. The other tools in that general category are OpenNMS, Zenoss, Groundworks, HyperIQ and others. I am only talking here about tools that would qualify in the NMS category: something that really tracks different systems and devices across the entire infrastructure.
A couple of years ago, I was so tired of Nagios that I was ready to try something new. A couple of tools didn’t make the list, simply because of the “fremium” model. The basics are there, but anything more typically carries a hefty price tag.
I decided to try Zabbix and I have pretty much been a fan ever since. One caveat here, is that I am talking about version 1.8.x. Version 2.0 just came out and offers a few notable improvements, which I haven’t tried out yet. A couple of things that look very promising are: Direct JXM support, multi-homed hosts, and mounted filesystem discovery. Full list of changes is here
As an overview, Zabbix offers the following:
- Relatively quick & simple install on a variety of platforms
- Agent-based, but available agentless options.
- A fairly vibrant community
- A large amount of templates covering most popular software
- Integrated graphs
- Escalation management
- Web Monitoring – it has a built in web transaction monitoring. It’s decent if not spectacular and doesn’t really compare against sophisticated transaction monitoring systems that are out there. It does support multiple steps and it’s based on curl, though it doesn’t expose all of curl’s functionality. That will present a problem if you need to do extensive cookie manipulation and/or variables. It’s also useless for heavily AJAXed pages and the ones that use flash. Still, it’s decent for basic monitoring and more then most other systems offer.
- IMPI support is worth noting, but I’ve personally never used it.
- Log Monitoring – this isn’t going to work well for high traffic web logs, but it does a pretty solid job at picking up exceptions and errors in various files. It does support a full regex engine for pattern matching. I’ve had it monitoring files that received ~500 lines per second and it had no issues with that.
- Templates – this is the core approach to monitoring in Zabbix. All your monitoring definitions are ideally grouped in templates. When a new server/instance shows up, you simply apply the template to it or add it to a group to which this template is assigned. There are a few templates that come out of the box of varying quality and there are a lot of user-generated templates for a variety of applications. A lot of them will have a script (PHP/Perl/Python) that polls the application and sends the data back. Typically you’ll have to make a few tweaks that are specific to your environment. Some of the ones that I found useful and better then others are:
- This is the “default” MySQL template for Zabbix and it’s based on a PHP script. The description says it wasn’t tested on 5.1, but I didn’t seem to notice any issues. There are range of values that have to be tuned in order to avoid false alerts.
- If you’re used to the Cacti templates for MySQL and the data those provide, this is a port to Zabbix. If I remember correctly, this template required a few tweaks to the PHP script, in order to get it working.
- This is another decent template for MySQL, but you don’t get InnoDB information out of the box. It is good for monitoring multiple MySQL instances on the same box though. The other templates would require modifications in their polling scripts.
- For Haproxy, I’ve used this template. It’s better than others, since it allows you to look and compare statistics of individual servers behind Haproxy. The downside is that it won’t automatically discover changes. That can be scripted, but it might get a little hairy.
- For Nginx, this is more than sufficient for most needs.
- Another one that is useful for Nginx, though the site is in Russian. Google translate does a pretty good job there. There are a few other templates on that site, but I’ve never tried them.
- It does have an API for automation. I think it was improved in 2.0, but in 1.8 it was already solid. There is a decent CLI tool written in Ruby that will interface with the API, called zabcon
- There isn’t a great way to control alert floods. You can control trigger dependencies, but if something really goes haywire you might be manually clearing SQL tables after that.
- Alert escalations are a little wonky, but they work reasonably well.
- It is pretty trivial to port existing Nagios plugins or other scripts into Zabbix.
- JMX monitoring was done via zapcat. It wasn’t great, but for the lack of better options this was the only thing to work with. Version 2.0 does it natively and if they did it right, that’s probably one of the biggest improvements.