Joel Dehlin: Managing complexity

A few weeks ago Joel warned you that there would be occasional guest posts - I am the first volunteer. The brief bio on beta.tech.lds.org should provide you with some understanding of my experience and biases. In this post, I leverage those experiences and biases to offer some observations about complexity....

One of the attributes I failed to develop during my academic training was a proper appreciation for the perils of complexity. I recall creating code that was fast, efficient, and utterly un-maintainable. In the academic context this seemed fine, since I was typically the only one maintaining the code, it was rarely used beyond the end of the class, and system failure affected only my grade. However, over the past 15 years my professional experience has changed my perspective and caused me to value simple, understandable, and maintainable solutions over those which lack the foregoing traits yet are fast, efficient, and theoretically “robust.” I offer the following observations relative to the impact of complexity on the reliability, maintainability, and scalability of the systems we create.

Humans cause failures. In my experience, human failure is a more likely cause of downtime than the failure of a physical component. However, we often try to increase system availability by adding redundancy to mitigate for component failure. This redundancy has the collateral impact of adding complexity, which unfortunately increases the likelihood of human failure. I’m not sure we always end up net positive on the availability scale.

Control planes, which manage highly redundant environments, are often themselves single points of failure. The likelihood of control plane failure generally increases as the component redundancy becomes more complex.

Failure modes are difficult to predict and detect. As a result, sometimes secondary components go unused during primary component failure.

With redundant systems, we assume that the joint probability of multiple independent failures is small. Unfortunately, the assumption of independence is often incorrect.

Complex systems are difficult to scale. RFC 3439 quotes Mike O’Dell, the former Chief Architect of UUNET, as saying, “Complexity is the primary mechanism which impedes efficient scaling, and as a result is the primary driver of increases in both capital expenditures and operational expenditures.”

And finally, an observation by Willinger and Doyle et al: “…we point out a very typical, but in the long term potentially quite dangerous engineering approach to dealing with network-internal and -external changes, namely responding to demands for improved performance, better throughput, or more robustness to unexpectedly emerging fragilities with increasingly complex designs or untested short-term solutions. While any increase in complexity has a natural tendency to create further and potentially more disastrous sensitivities, this observation is especially relevant in the Internet context, where the likelihood for ‘unforeseen feature interactions’ in the ensuing highly engineered large-scale structure drastically increases as the network continues to evolve. The result is a complexity/robustness spiral [i.e., robust design → complexity → new fragility → make design more robust → …] that—without reliance on a solid and visionary architecture—can easily and quickly get out of control.”

What can be done to help manage complexity or mitigate its impact? I doubt there is a silver bullet, but the following concepts have been helpful to me.

Use a crutch to force yourself to remember what is important. My crutch was a note hanging on the side of my monitor to remind me that supportability, maintainability, and reliability were more important than performance and efficiency – not that the last two were not important – they were just not the most important.

Document what you are doing as you are doing it. If your solution is simple it should be easy to describe. Consider the documentation process a litmus test for simplicity.

Avoid tight coupling and interdependence. Focus on isolation, separation, and modularization.

You are more likely to be successful tailoring your system to the capabilities of your operators than tailoring the capabilities of your operators to your system.

Use automation, but continue to be vigilant about managing down the underlying complexity that the automation is abstracting.

This one is going to be controversial: sometimes you have to make hard tradeoffs in which you abandon some amount of functionality (and possibly redundancy) to maintain simplicity. This involves understanding the difference between what you can do and what you should do.

"Make everything as simple as possible, but not simpler." -- Albert Einstein.

How has complexity manifest itself in the environments in which you work? What are you doing to manage it? Is it hypocritical for a complex post to extol the virtues of simplicity?

6 comments:

Patrick FaulkMarch 16, 2007 at 9:15 AM
Pete,

Your reference to "abandon(ing) some amount of functionality to maintain simplicity" got me thinking about the twin problems of accurately defining "requirements" and the tendency to focus on "tools" rather than "processes." Too often "requirements" are defined in terms of features rather than outputs - which in turn should be defined by the process. Lean principles tell us to eliminate process steps that do not add value. In automating a process, we may expand the principle and decide not to automate certain process steps, because doing so does not add sufficient value to justify the added costs of increased complexity. This only works if you have a clear understanding of what functionality is in fact *required* as opposed to "desired."
PaulHMarch 21, 2007 at 1:03 PM
I am not a trained computer scientist (CS) and because of that I have always leaned torwards more simple solutions. The more I learn and become more of a trained CS I realize how much of an advantage that is.

I am working on a project right now where I have to denormalize an extremely normalized (forth normal form and then some) database because of performance issues. I did not design the schema, in fact I was against the complexity from the beginning, not because I thought it was too normalized, but because of the amount fo code and layers that needed to be written to make it work. Unfortunatly it took replacing the designer to get our system to work as it should have from the beginning - less tables, less code and less time. I saw all of this coming, but because my arguments were only its too complex nothing was done to stop the problems from going forward. its really sad.

Thanks for the post.
Mario HipolMarch 27, 2007 at 6:54 AM
This was a great post. I have actually forwarded it along to our CEO, President and COO. I have tried in the past to communicate the need for simplicity in our complex environment. It has only been recent that I have seen this change taking place. I think seeing someone elses input on the same topic really helps.

In an environment where we work several outsourced projects at once the need for simplifying complexity (OXYMORON) to the least common denominator is so very important.

Thanks for the insight.
Paul McFateApril 2, 2007 at 4:32 AM
Very perceptive analysis of complexity. As we have moved toward "enterprise-wide solutions" the complexity of systems has created a system in which we now spend a disproportionate amount of time just managing the complexity. That coupled with the fact the decisions (regarding solutions) get made as far as possible from the problem by people who do not have (and experience shows often cannot have) a full understanding of it.
What About Thad?April 2, 2007 at 5:28 AM
I couldn't agree more, particularly with regard to code complexity. When code manifest defects, the impediment to resolving them is rarely a lack of efficiency or elegance of the code; it's understanding the code and grasping the effects of unintended consequences both in the cause of the problem and its potential solutions. A few years ago I made the commitment to always choose the obvious way over the clever way, even at the expense of abstraction and reuse. The results are sometimes unattractive, but I feel like the resulting code is better fortified to survive out in the world when I'm no longer around to care for it.
josh coatesMay 25, 2007 at 9:52 AM
great article.

"Simplicity is prerequisite for reliability."
-- Edsger W.Dijkstra