CERN & Puppet

A presentation from CERN during the PuppetConf. Some very interesting items in there:

  • I was somewhat surprised at how much diversity they have. I thought they ran what effectively is a grid compute network with identical nodes.
  • They operate at huge scale. It requires a completely different way of thinking about power and data and resource management.
  • “Evaluate solutions, identify functional gaps and challenge them” – a very succinct way to describe a core IT function.
  • I like the analogy of thinking of your machines as pets and cattle. You care for your pets, but you shoot your cattle if something is wrong. Your infrastructure should be made out of “cattle”.
  • Their tool chain (puppet/foreman/openstack/mcollective/bamboo/git) is accessible to anyone and they understand the value of active community.

The overview is here. If you want to skip the CERN background, the technical part of the talk starts at ~11:00 minutes. This follow up talk gets into more technical detail of their puppet use.


TripAdvisor’s architecture

A long and a very informative post about what TripAdvisor found out when they tested out AWS for their infrastructure. There are a lot of interesting tidbits in there, some of which are hard to analyze without seeing precise numbers. What I find interesting is that they essentially ported their existing datacenter setup to AWS. Granted, their stated goals were to really look at a cost/performance and not change the operational model. However, in my experience with AWS, simply reusing your datacenter architecture isn’t sufficient and will likely lead to a lot of disappointments. There are couple of things that stood out that would have likely improved their experience:

  • “Cloudwatch/monitoring was sufficient” – that was said with a caveat that it was enough for scaling decisions and detailed monitoring would be more helpful. I would disagree there. Even in their results they didn’t have enough visibility to figure out what was wrong with GC, so they couldn’t see inside the JVM. As far as scaling decisions go, it depends on the complexity of the application and underlying architecture. If you can make the decision simply based on the CPU load of a given instance, then CloudWatch is great. However, in a lot of cases you need far more detail to understand which tier to scale and if that’s even going to help. Also, depending on the availability tolerance of your application, 5 minute intervals might not be good enough.
  • Log collection – that seems to be done in pretty antiquated way and clearly it’s not real time and heavily dependent on local instance storage. Something like Graylog2/Logstack or Flume/Hadoop is far better.
  • Configuration management – they use a custom in-house solution with a naming database. That is usually very difficult to change for historical reasons, but something along the lines of puppet/chef/salt will give better results. The process is somewhat reversed too with an instance responsible for figuring out what it needs to be, though it’s arguable which is the better approach.
  •  Use of ELB – ELB is relatively cheap and pretty fast. Using something like HAProxy would give them far more granularity, visibility and better balancing overall.

In any case, it’s a worthy read if you’re considering AWS.



This post about automation drew my attention. It’s well written and tries to address some of the problems with automation and the general attitude with “automate all things”. However, I don’t think the problem is with automation itself. This goes back to the root problem of complex systems that develop emergent properties, resilience engineering and “black swan” events.  The author himself has a great post on the this topic.

When automating a repetitive task, the chance for error and more imporantly the chance for a disproportionately significant impact is very low. When you’re using automation to walk through a complex tree logic, the impact of an error increases considerably. The problem with automating for rare events that include multiple components are:

  1. Especially as it applies to complex systems, it is very difficult to predict every variation. Inevitably something will be missed.
  2. When your automation didn’t work as expected, the best case scenario is that you didn’t handle a particular condition. Worst case senario is that you’ve introduced another (significant) problem into the environment which exascerbates the original. The result is often a cascading failure or a domino effect. There are hundreds of examples with the Github outage and EC2 outage from last year being just a few of them. In my personal experience, I’ve seen dozens of cases like this.
  3. I would argue that with time the problem often gets worse. As automation logic evolves and gets more complex, you believe that it’s getting better. You start accounting for edge cases, you learn from experience and so on.  Unfortunately, as your timeline moves forward, the chance of a “black swan” event is getting higher and higher. And when it does happen, the imact will be proportionally magnified.

So, I think it’s the wrong way to talk about the problem. Automation is a secondary factor which amplifies existing problems with system complexity. These are some of the guidelines to follow to design around it:

  1. KISS. Can’t say that often enough. Too frequently the architecture discussions start too far down the complexity chain. Desire to do something off the charts on the “wickedly awesome” scale leads down the same path.  If your architecture and processes look like this, then you’re going in the wrong direction.
  2. Hire people who understand systemic thinking.
  3. Compartmenalize your application into self-sustaining tiers. If something fails, try to have enough resiliency to continue operating at reduced capacity/functionality.


A couple of relevant articles that are really talking about the same thing:

1. An example from aviation, which has been dealing with complexity and resilience for a long time. The title is very fitting: “Want to build resilience? Kill the Complexity”. Equally applicable in almost every field.

2. Architecture of Robust, Evolvable Networks. That’s an abstract and the actual paper is here. He talks about internet as a whole, but smaller networks are often a microcosm of the very same thing.



Cloud & Architectures

This is a great post about how the world of IT is changing. I would somewhat disagree in one area though. I don’t think it’s so black and white between datacenter and “cloud” approaches. Even if you’re running your own metal, it’s still moving in the direction of the cloud. Most organizations, even with their own DCs will still run a hypervisor of some sort and manage their IT infrastructure dynamically via APIs. The question is really going to be whether you’re going to run on a private or a public cloud or a hybrid version.



© 2017 Mind End

Theme by Anders NorenUp ↑