Category: Uncategorized

Back to blogging….

Looks like it has been almost 4 years since my last post. Lots of things happened since then…Probably time to start writing again.

Hopefully I’ve percolated enough interesting content in that time…..more to follow.

Automation

This post about automation drew my attention. It’s well written and tries to address some of the problems with automation and the general attitude with “automate all things”. However, I don’t think the problem is with automation itself. This goes back to the root problem of complex systems that develop emergent properties, resilience engineering and “black swan” events.  The author himself has a great post on the this topic.

When automating a repetitive task, the chance for error and more imporantly the chance for a disproportionately significant impact is very low. When you’re using automation to walk through a complex tree logic, the impact of an error increases considerably. The problem with automating for rare events that include multiple components are:

  1. Especially as it applies to complex systems, it is very difficult to predict every variation. Inevitably something will be missed.
  2. When your automation didn’t work as expected, the best case scenario is that you didn’t handle a particular condition. Worst case senario is that you’ve introduced another (significant) problem into the environment which exascerbates the original. The result is often a cascading failure or a domino effect. There are hundreds of examples with the Github outage and EC2 outage from last year being just a few of them. In my personal experience, I’ve seen dozens of cases like this.
  3. I would argue that with time the problem often gets worse. As automation logic evolves and gets more complex, you believe that it’s getting better. You start accounting for edge cases, you learn from experience and so on.  Unfortunately, as your timeline moves forward, the chance of a “black swan” event is getting higher and higher. And when it does happen, the imact will be proportionally magnified.

So, I think it’s the wrong way to talk about the problem. Automation is a secondary factor which amplifies existing problems with system complexity. These are some of the guidelines to follow to design around it:

  1. KISS. Can’t say that often enough. Too frequently the architecture discussions start too far down the complexity chain. Desire to do something off the charts on the “wickedly awesome” scale leads down the same path.  If your architecture and processes look like this, then you’re going in the wrong direction.
  2. Hire people who understand systemic thinking.
  3. Compartmenalize your application into self-sustaining tiers. If something fails, try to have enough resiliency to continue operating at reduced capacity/functionality.

 

A couple of relevant articles that are really talking about the same thing:

1. An example from aviation, which has been dealing with complexity and resilience for a long time. The title is very fitting: “Want to build resilience? Kill the Complexity”. Equally applicable in almost every field.

2. Architecture of Robust, Evolvable Networks. That’s an abstract and the actual paper is here. He talks about internet as a whole, but smaller networks are often a microcosm of the very same thing.

 

 

Securing users

I was reading an edition of PenTest Magazine (attached here for convenience). They’ve had a few decent articles in there, but one was talking specifically about securing your users. That’s an interesting topic. An attack against your company is very likely to come through the “meatware” vector. It’s often much easier then trying to find the latest 0-day or buffer overflow. Of course you have your security policies and user training, but even the security pros fall for a well crafted phishing attack. Your expectation of the extent that you’ll be able to harden and train your userbase should be limited. You need to be prepared for a breach to come through that direction.

A lot of defenses should be focused on isolating the user population from critical systems, so that when a breach does occur, the impact is limited. Of course users do need some access in order to perform their jobs and that’s where it’s critical to focus on granular access controls, specifically RBAC. You also need to have the capacity to detect and respond to any anomalies in user behavior. That’s what ultimately will allow you to contain the threat and limit it’s impact.