After my June 1st Tuesday Reading “Drilling for Certainty” – which made the point that our world has become extremely technologically complex and that the possibility for catastrophe is imbedded in the fabric of day-to-day live – I received several emails making the same or similar points.
One, today’s reading – “The Real Cause of BP’s Oil Spill” makes a very important new observation: “The company’s systems were designed to ‘mostly work’ instead of ‘never fail’.” Kenneth Brill, founder of the Uptiime Institute, begins this article by saying: “As an engineer, I was trained to design things to work. Working implies tolerating occasional failures.” Ant, this is essentially how everything we build -- including much of our IT environment -- is built. Designs often don’t take into account rare events. Parts do wear out unexpectedly; components fail; testing believed to be complete isn’t not necessarily because of incompetence but rather because of the underlying complexity of the system. And, the list goes on.
Yet, we as individuals, organizations, and governments have made a sure transition from understanding and accepting “rarely failing” to “never failing.” Never failing in a very complex system, especially when humans are an integral part of the system, is extremely challenging, and in many situations simply not possible. Further, in point-of-fact, as Brill makes clear, very few issues ever deserve such a “never failing” approach.
Brill concludes that those issues where “never fail” is rightfully expected should follow five reliability principles:
1. Multiple things must line up before failure can occur. I.e., there be no single points of failure.
2. Decision making responsibilities are not assigned to individuals without appropriate experience. (Inexperience, which he calls "junior executive error," is the most frequent root cause.)
3. Configuration changes are controlled very carefully.
4. Accept the fact that complex systems do often interact in unexpected, unintended ways.
5. Be extremely careful as the end of a long project nears. In the joy of a successful end often the guard is lowered.
And, I would add that while we don’t typically have projects and systems where the consequences of failure would approach that of this oil spill, we do have many systems where our clients expect that they “never fail.”
So, during the coming week, think about your client’s expectations of the systems you are responsible for, and begin to reset those that are not reasonable.
. . . . . jim