A very good article. The author provides 18 points about complex systems and fault tolerance. He talks about complex systems in general, but it translates very well to IT systems. Particularly point #8 that “Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.” is very much true. I’ve engaged in this process more times then I care to remember and nearly every time it leads to fighting yesterday’s war. Complex systems also generate their own emergent properties that are hard if not impossible to see; which is a huge contributing factor to massive failures.

http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf