Category: IT

Skills gap in IT security

The article tries to explain why companies have trouble hiring security pros. Some good items in there, but I think it misses the larger point. Too many companies simply don’t understand what they need and treat security as a check box that they mark off on some form. They believe that “security” consists of creation of myriads of policies, procedures and documents for every eventuality. Doubtless, that’s a part of it, but it has to start with evaluating risks, threats and having a proper mindset.

This reminds me of a security position that I once interviewed for. One interviewer really wanted to know the specific number of Active Directory Organizational Units (OUs) I have worked with. That is akin to asking a prospective sysadmin how many files he has worked with. The number is arbitrary and absolutely irrelevant to underlying complexity, nesting, policies, etc.  At the time, they told me that they’ve been trying to fill the position for more than 6 months…..Somehow that wasn’t surprising to me…

Technology divide

A couple of articles surfaced in the last few days, which describe the tech operations of Obama and Romney campaigns. It’s hard to put a number on the advantage that Obama had on the tech side, but considering the relatively small margin of victory, it was probably not insignificant.

Narwal (Obama) –

Orca (Romney) –


I think this is a great example of the advantage that tech can bring to any business, politics or anything else. It’s often hard to quantify in hard numbers and can be especially difficult for people in positions that are further removed from technology, but it works. And not only they had a “devops guy”, I have no doubt they ran their IT aligned with devops principles. If it was good enough for Obama’s campaign, then it’s probably good enough for others as well.




AWS outage

AWS had another outage last week and they posted their analysis of the event. There were a few things that really jumped out at me. Granted, it’s easy to be a Monday morning quarterback, not to mention hindsight bias or the fact that there are far more complexities in that system that can be understood from outside. Still….

  1. The original issue was caused by a memory leak bug – that type of a thing is very difficult to catch. I can see how the bug went unnoticed, but it’s the things that followed that were the real problem.
  2. Failed DNS update – assuming that DNS is the “rug that ties the room together” (as it should be), updates and propagation is absolutely key to the stability of the system. That can and should be monitored. It’s not an emergent property of the system.
  3. Memory exhaustion – In retrospect, it’s easy to say that memory should have been monitored per-process. Though that’s not usually done as a standard and generally tends to not be a baseline monitor that people configure. However, given the fact that they said that EBS servers will dynamically consume all available memory, then effectively they had no memory monitoring at all, since all the servers would generally report close to maximum consumption at all time. Still, that’s probably something that’s easier to see in hindsight.
  4. The system was not able to find enough healthy servers for failover – to me, that’s one of the biggest failures during this outage, which should haven’t been that difficult to predict. This is what Kitchen Soap was talking about in their post on automation and something that I’ve posted about as well. What’s especially troubling, is that this was essentially the cause of the outage last year. The failover rate should have been throttled automatically, based on the availability of healthy volumes; with some break points. That’s how the thundering herd or cascading effect could have been avoided. They already have throttling in effect for the API calls, so they are clearly aware of the potential problems of this behavior.
  5. API Throttling – that was actually the right idea, with slightly the wrong policy. They were too aggressive with throttling, but it’s a fine line to tread and to get correctly.
  6. Multi-AZ RDS problem – another very difficult to predict bug in a complicated solution. This will happen in a complex system, but it’s yet another example that you should verify everything and not make assumptions about availability or reliability. If your system is critical, it should not rely on the magic of Multi-AZ RDS. Or at least it needs a contingency plan.
  7. ELB based on EBS – that was news to me. Perhaps it was a fair assumption beforehand, but I don’t recall Amazon explicitly stating that’s the case. Yet another reason to look at HAProxy instead.

On Being a Senior Engineer

A very apt take on the profession, particularly the cultural component. Below is a great quote as well. The “why” piece is absolutely critical to success, regardless of how menial the task might seem.

” These engineers spend the time to make sure that more junior or new engineers unfamiliar with the tech or processes we have not only understand what they are doing, but also why they are doing it