Tag: high availability

AWS outage

AWS had another outage last week and they posted their analysis of the event. There were a few things that really jumped out at me. Granted, it’s easy to be a Monday morning quarterback, not to mention hindsight bias or the fact that there are far more complexities in that system that can be understood from outside. Still….

  1. The original issue was caused by a memory leak bug – that type of a thing is very difficult to catch. I can see how the bug went unnoticed, but it’s the things that followed that were the real problem.
  2. Failed DNS update – assuming that DNS is the “rug that ties the room together” (as it should be), updates and propagation is absolutely key to the stability of the system. That can and should be monitored. It’s not an emergent property of the system.
  3. Memory exhaustion – In retrospect, it’s easy to say that memory should have been monitored per-process. Though that’s not usually done as a standard and generally tends to not be a baseline monitor that people configure. However, given the fact that they said that EBS servers will dynamically consume all available memory, then effectively they had no memory monitoring at all, since all the servers would generally report close to maximum consumption at all time. Still, that’s probably something that’s easier to see in hindsight.
  4. The system was not able to find enough healthy servers for failover – to me, that’s one of the biggest failures during this outage, which should haven’t been that difficult to predict. This is what Kitchen Soap was talking about in their post on automation and something that I’ve posted about as well. What’s especially troubling, is that this was essentially the cause of the outage last year. The failover rate should have been throttled automatically, based on the availability of healthy volumes; with some break points. That’s how the thundering herd or cascading effect could have been avoided. They already have throttling in effect for the API calls, so they are clearly aware of the potential problems of this behavior.
  5. API Throttling – that was actually the right idea, with slightly the wrong policy. They were too aggressive with throttling, but it’s a fine line to tread and to get correctly.
  6. Multi-AZ RDS problem – another very difficult to predict bug in a complicated solution. This will happen in a complex system, but it’s yet another example that you should verify everything and not make assumptions about availability or reliability. If your system is critical, it should not rely on the magic of Multi-AZ RDS. Or at least it needs a contingency plan.
  7. ELB based on EBS – that was news to me. Perhaps it was a fair assumption beforehand, but I don’t recall Amazon explicitly stating that’s the case. Yet another reason to look at HAProxy instead.

Cloud Load Balancing, Part II

I wrote a post yesterday somewhat criticizing the statement by Radware. There was a very legitimate comment by the author asking about details, which I didn’t provide. This response is in a separate post, because it will get long. With a caveat that my knowledge of their product is minimal, here are the details:

  • key differences between a shared, cloud load balancer instance – offered by virtually all cloud providers (i.e. Amazon ELB, Rackspace CLB)” – that statement is misleading at best. AFAIK, Amazon didn’t disclose the architecture of their ELBs. It’s quite possible that they are shared, but that can’t be claimed for certain. In any case, it’s irrelevant. What matters is performance and features. ELB features are bare bones, whereas the performance is debatable and any argument in that regard should be test/data driven.
  • When a load balancer fails, a new one with an identical configuration takes over” – that really depends on the distinction between DR and high availability. Amazon doesn’t provide SLAs for their ELBs, but you could run a single ELB in multiple availability zones, multiple ELBs or ELBs in different regions. In these cases failover would typically be handled by DNS either by distributing multiple A records initially or updating DNS based on failure (albeit that depends on your RTO tolerance). There are other more obscure DNS methods as well. If you’re going with an HAProxy approach, then your failover method likely includes a monitoring daemon (for logs, service state, etc) and kicking off an API call that at a minimum includes DisassociateAddress and AssociateAddress.
  • “a failure induced by any tenant can cause a broader failure impacting multiple tenants, (i.e Amazon ELB failure June-29th, 2012)” – In theory this might be true, but in practice its false. In almost any major cloud you’re in a shared environment. Typically another tenant can affect your workload, but if someone’s failure can impact another tenant, then it’s a security breach of gigantic proportions. More importantly: Amazon’s ELB failure on June 29th had nothing to do with shared tenants whatsoever. At least as far as ELB is concerned it’s a false statement. It was a bug in AWS and that can happen with anyone’s offering.
  • “the need to redesign the application due to lack of advanced…functionality” – compared to ELB, that’s true (if you need these features). However, as with most AWS functions, they don’t claim more than what they do. If you use HAProxy, nginx or another load balancer you’ll get all of this functionality and more. If you’re willing to pay the price you could even run a Netscaler.
  • lack of control over the load balancer performance and capacity” – needs proof with tests/data. Again, with something like Haproxy you have full control, though your performance may be affected
  • inability to define custom health monitoring” – with ELB the functionality is limited, though you could hit an http page that executes a custom written and a more sophisticated health check. That does require more work. Again, I might sound like a broken record, but Haproxy and others will load balance whatever you want with very complex checks.
  • “inability to load balance and optimize application delivery across multiple data centers” – this goes back to an old debate about GSLB and all the issues associated with it. However, that is largely true. Balancing across regions can get complicated, but in my experience it’s driven more by the application data model rather than load balancing itself.
  • The ADC’s enterprise features alleviate all the shortcomings of cloud based load balancer” – honestly, I am not even sure what to say here. FUD * Marketing-talk.
To be perfectly fair here, I don’t want to come off as a staunch defender of AWS. It has a number of significant shortcoming, some of which I’ve written about before. By no means is it perfect for everyone and anyone considering deploying a significant presence in there (or any other cloud provider) should do their own research. Having said that, there are enough legitimate criticisms and no need to resort to FUD. I haven’t touched a Radware product in close to a decade, but I do recall that I had good impressions. You could take an Alteon VA and run it in a colo or your datacenter or on a virtualized platform and load balance your cloud presence. That’s a valid approach and may work for a lot of people. However, my guess is that most customers would probably be better off with some detailed technical analysis, performance tests & data and some thought through technical whitepapers and diagrams.

 

Automation

This post about automation drew my attention. It’s well written and tries to address some of the problems with automation and the general attitude with “automate all things”. However, I don’t think the problem is with automation itself. This goes back to the root problem of complex systems that develop emergent properties, resilience engineering and “black swan” events.  The author himself has a great post on the this topic.

When automating a repetitive task, the chance for error and more imporantly the chance for a disproportionately significant impact is very low. When you’re using automation to walk through a complex tree logic, the impact of an error increases considerably. The problem with automating for rare events that include multiple components are:

  1. Especially as it applies to complex systems, it is very difficult to predict every variation. Inevitably something will be missed.
  2. When your automation didn’t work as expected, the best case scenario is that you didn’t handle a particular condition. Worst case senario is that you’ve introduced another (significant) problem into the environment which exascerbates the original. The result is often a cascading failure or a domino effect. There are hundreds of examples with the Github outage and EC2 outage from last year being just a few of them. In my personal experience, I’ve seen dozens of cases like this.
  3. I would argue that with time the problem often gets worse. As automation logic evolves and gets more complex, you believe that it’s getting better. You start accounting for edge cases, you learn from experience and so on.  Unfortunately, as your timeline moves forward, the chance of a “black swan” event is getting higher and higher. And when it does happen, the imact will be proportionally magnified.

So, I think it’s the wrong way to talk about the problem. Automation is a secondary factor which amplifies existing problems with system complexity. These are some of the guidelines to follow to design around it:

  1. KISS. Can’t say that often enough. Too frequently the architecture discussions start too far down the complexity chain. Desire to do something off the charts on the “wickedly awesome” scale leads down the same path.  If your architecture and processes look like this, then you’re going in the wrong direction.
  2. Hire people who understand systemic thinking.
  3. Compartmenalize your application into self-sustaining tiers. If something fails, try to have enough resiliency to continue operating at reduced capacity/functionality.

 

A couple of relevant articles that are really talking about the same thing:

1. An example from aviation, which has been dealing with complexity and resilience for a long time. The title is very fitting: “Want to build resilience? Kill the Complexity”. Equally applicable in almost every field.

2. Architecture of Robust, Evolvable Networks. That’s an abstract and the actual paper is here. He talks about internet as a whole, but smaller networks are often a microcosm of the very same thing.

 

 

How Complex Systems Fail

A very good article. The author provides 18 points about complex systems and fault tolerance. He talks about complex systems in general, but it translates very well to IT systems. Particularly point #8 that “Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.” is very much true. I’ve engaged in this process more times then I care to remember and nearly every time it leads to fighting yesterday’s war. Complex systems also generate their own emergent properties that are hard if not impossible to see; which is a huge contributing factor to massive failures.

http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf