CategoryUncategorized

Barbarians at the gate

Interesting to imagine that in the world of HFTA, latency relevance is measured in numbers that are orders of magnitude different from when “latency” means in the world of web.

http://queue.acm.org/detail.cfm?id=2536492

Rubber Bands

The world is a dynamic mess of jiggling things.

That pretty much explains most things in life.


 

Automation

This post about automation drew my attention. It’s well written and tries to address some of the problems with automation and the general attitude with “automate all things”. However, I don’t think the problem is with automation itself. This goes back to the root problem of complex systems that develop emergent properties, resilience engineering and “black swan” events.  The author himself has a great post on the this topic.

When automating a repetitive task, the chance for error and more imporantly the chance for a disproportionately significant impact is very low. When you’re using automation to walk through a complex tree logic, the impact of an error increases considerably. The problem with automating for rare events that include multiple components are:

  1. Especially as it applies to complex systems, it is very difficult to predict every variation. Inevitably something will be missed.
  2. When your automation didn’t work as expected, the best case scenario is that you didn’t handle a particular condition. Worst case senario is that you’ve introduced another (significant) problem into the environment which exascerbates the original. The result is often a cascading failure or a domino effect. There are hundreds of examples with the Github outage and EC2 outage from last year being just a few of them. In my personal experience, I’ve seen dozens of cases like this.
  3. I would argue that with time the problem often gets worse. As automation logic evolves and gets more complex, you believe that it’s getting better. You start accounting for edge cases, you learn from experience and so on.  Unfortunately, as your timeline moves forward, the chance of a “black swan” event is getting higher and higher. And when it does happen, the imact will be proportionally magnified.

So, I think it’s the wrong way to talk about the problem. Automation is a secondary factor which amplifies existing problems with system complexity. These are some of the guidelines to follow to design around it:

  1. KISS. Can’t say that often enough. Too frequently the architecture discussions start too far down the complexity chain. Desire to do something off the charts on the “wickedly awesome” scale leads down the same path.  If your architecture and processes look like this, then you’re going in the wrong direction.
  2. Hire people who understand systemic thinking.
  3. Compartmenalize your application into self-sustaining tiers. If something fails, try to have enough resiliency to continue operating at reduced capacity/functionality.

 

A couple of relevant articles that are really talking about the same thing:

1. An example from aviation, which has been dealing with complexity and resilience for a long time. The title is very fitting: “Want to build resilience? Kill the Complexity”. Equally applicable in almost every field.

2. Architecture of Robust, Evolvable Networks. That’s an abstract and the actual paper is here. He talks about internet as a whole, but smaller networks are often a microcosm of the very same thing.

 

 

Securing users

I was reading an edition of PenTest Magazine (attached here for convenience). They’ve had a few decent articles in there, but one was talking specifically about securing your users. That’s an interesting topic. An attack against your company is very likely to come through the “meatware” vector. It’s often much easier then trying to find the latest 0-day or buffer overflow. Of course you have your security policies and user training, but even the security pros fall for a well crafted phishing attack. Your expectation of the extent that you’ll be able to harden and train your userbase should be limited. You need to be prepared for a breach to come through that direction.

A lot of defenses should be focused on isolating the user population from critical systems, so that when a breach does occur, the impact is limited. Of course users do need some access in order to perform their jobs and that’s where it’s critical to focus on granular access controls, specifically RBAC. You also need to have the capacity to detect and respond to any anomalies in user behavior. That’s what ultimately will allow you to contain the threat and limit it’s impact.

 

Zabbix Review

Everyone is familiar with Nagios, which is often considered the de-facto standard for monitoring. The other tools in that general category are OpenNMS, Zenoss, Groundworks, HyperIQ and others. I am only talking here about tools that would qualify in the NMS category: something that really tracks different systems and devices across the entire infrastructure.

A couple of years ago, I was so tired of Nagios that I was ready to try something new. A couple of tools didn’t make the list, simply because of the “fremium” model. The basics are there, but anything more typically carries a hefty price tag.

I decided to try Zabbix and I have pretty much been a fan ever since. One caveat here, is that I am talking about version 1.8.x. Version 2.0 just came out and offers a few notable improvements, which I haven’t tried out yet. A couple of things that look very promising are: Direct JXM support, multi-homed hosts, and mounted filesystem discovery. Full list of changes is here

As an overview, Zabbix offers the following:

  • Relatively quick & simple install on a variety of platforms
  • Agent-based, but available agentless options.
  • A fairly vibrant community
  • A large amount of templates covering most popular software
  • Integrated graphs
  • Escalation management
More specifically:

 

Graphs

 

There are a lot of graphic front ends for Nagios. In general, they are bolt-ons of varying quality. On the other hand, graphs are probably one of the stronger features of Zabbix. Typically, templates will have a few graphs predefined, but more can be added fairly easily. Any item that’s being collected can also be graphed on-demand. The one small drawback is the inability to save pics on the fly, which is sometimes useful for distribution. A workaround for that is described in this thread.

 

Graphing performance is decent if not spectacular. That will largely depend of data volume, your hardware and range of time.

 

What I found especially valuable is something zabbix refers to as “screens“.  Generally, the entire point of graphing or visualizing something is to be able to easily identify trends and correlations. “Screens” allow you to group disparate items together. For example, if you wanted to see the correlation between your requests per second, queries per second, response time, network traffic and read/write percentage, it’s fairly trivial to put it together. Besides that, I’ve tended to use screens almost as targeted dashboards. Something like putting all the MySQL relevant information on the same screen (disk IO, queries per second, replication lag, cpu/mem, cache hits, etc) can let you know the health of your MySQL infrastructure almost immediately. Same can be done on the web side and other areas.

 

Performance

 

Performance will vary quite a bit. I’ve ran Zabbix on a large instance at EC2, backed by a 4-volume EBS RAID set and was able to receive 600-800 values/second without much of a problem. However, with that setup, the screens (particularly the ones with with a lot of metrics) would load in 2-5 seconds and the lag was noticeable. One key tweak that is absolutely necessary is the polling frequency. Most of the default (and 3rd party) templates will have the polling frequency too high. You generally don’t need to poll for free space every 5 seconds and there are plenty of examples like this. The data retention period also needs to be adjusted in a lot of cases. Reducing those intervals to something more reasonable is going to give a significant performance boost. It will behave better because you’ll reduce the volume of incoming values, but it will also reduce the amount of data you store and query against in the database. You likely don’t need precise-to-the-second numbers for every metric you collect going back a year. Historical data is still available, though in a somewhat less detailed form, which is generally sufficient for trend information.
If the data volume gets too large, the clean up process might start failing. I’ve noticed that around 150GB of data it would start having trouble. At that point there aren’t very many good options and they tend to be quite hairy. It’s best to avoid getting into the situation in the fist place.

 

There are also a couple of options for distributed monitoring, if the performance requirements exceed the capability of a single node. There is a lot of documentation about it on their site, but it generally boils down to a choice between proxy or a node. I tend to prefer a proxy because of easier setup and maintenance. In a more specific example, I’d use proxies in an AWS environment which was spread across different regions. Another good use case in AWS is if you have a mix of a VPC and regular EC2 and you’d place your proxy in the VPC. This method can allow for significant scaling capabilities, though you would still need a very capable central master. The one significant benefit to a node approach is that they can be queried independently and support a hierarchical approach. However, in an environment with 1000s of devices that support different applications, nodes are likely a better approach.

 

Monitoring

 

It’s a fairly standard feature set that is generally similar across other NMS systems. A couple of things worth noting:
  • Web Monitoring – it has a built in web transaction monitoring. It’s decent if not spectacular and doesn’t really compare against sophisticated transaction monitoring systems that are out there. It does support multiple steps and it’s based on curl, though it doesn’t expose all of curl’s functionality. That will present a problem if you need to do extensive cookie manipulation and/or variables. It’s also useless for heavily AJAXed pages and the ones that use flash. Still, it’s decent for basic monitoring and more then most other systems offer.
  • IMPI support is worth noting, but I’ve personally never used it.
  • Log Monitoring – this isn’t going to work well for high traffic web logs, but it does a pretty solid job at picking up exceptions and errors in various files. It does support a full regex engine for pattern matching.  I’ve had it monitoring files that received ~500 lines per second and it had no issues with that.
  • Templates – this is the core approach to monitoring in Zabbix. All your monitoring definitions are ideally grouped in templates. When a new server/instance shows up, you simply apply the template to it or add it to a group to which this template is assigned. There are a few templates that come out of the box of varying quality and there are a lot of user-generated templates for a variety of applications. A lot of them will have a script (PHP/Perl/Python) that polls the application and sends the data back. Typically you’ll have to make a few tweaks that are specific to your environment. Some of the ones that I found useful and better then others are:
    • This is the “default” MySQL template for Zabbix and it’s based on a PHP script.  The description says it wasn’t tested on 5.1, but I didn’t seem to notice any issues.  There are range of values that have to be tuned in order to avoid false alerts.
    • If you’re used to the Cacti templates for MySQL and the data those provide, this is a port to Zabbix. If I remember correctly, this template required a few tweaks to the PHP script, in order to get it working.
    • This is another decent template for MySQL, but you don’t get InnoDB information out of the box. It is good for monitoring multiple MySQL instances on the same box though. The other templates would require modifications in their polling scripts.
    • For Haproxy, I’ve used this template. It’s better than others, since it allows you to look and compare statistics of individual servers behind Haproxy. The downside is that it won’t automatically discover changes. That can be scripted, but it might get a little hairy.
    • For Nginx, this is more than sufficient for most needs.
    • Another one that is useful for Nginx, though the site is in Russian. Google translate does a pretty good job there.  There are a few other templates on that site, but I’ve never tried them.
Misc
  • It does have an API for automation. I think it was improved in 2.0, but in 1.8 it was already solid. There is a decent CLI tool written in Ruby that will interface with the API, called zabcon
  • There isn’t a great way to control alert floods. You can control trigger dependencies, but if something really goes haywire you might be manually clearing SQL tables after that.
  • Alert escalations are a little wonky, but they work reasonably well.
  • It is pretty trivial to port existing Nagios plugins or other scripts into Zabbix.
  • JMX monitoring was done via zapcat. It wasn’t great, but for the lack of better options this was the only thing to work with. Version 2.0 does it natively and if they did it right, that’s probably one of the biggest improvements.
In summary, from what I’ve seen, Zabbix is easily one of the top NMS systems out there, though it’s probably somewhat less popular than others. If you’re fed up with Nagios or doing a brand new deployment, taking a serious look at Zabbix will be worth your while.

 

 

 

 

© 2017 Mind End

Theme by Anders NorenUp ↑