Yesterday I had an epiphany (a small one). It happened when I installed a Nagios server and started to set up hosts and checks. Nagios is a network monitoring tool that consists of a nagios server that runs checks on and collects information about a number of other hosts. As an admin you can get alerts and notifications about events or just view the green/yellow/red status of these checks.

The elegance in Nagios is, that checks are very simple and isolated. They just shell commands that return a status (green/yellow/red) and a message. That's it. There are a number of predefined checks and you can easily add new ones.

Now, when you start monitoring an existing network, you don't have checks for every service there is. But you can add them. One after another. Every check you add, adds value to the monitoring system. And eventually you will have checks for all the vital parts in the network.

It's like Unit Testing for admins.

When you have a legacy code base that is not unit tested and start adding tests, you have 0% code coverage. But every test you add, adds value to the test suite.

I think there is a more general lesson to be learned from these two examples. In any existing system, there are two strategies for improvement:

  1. Revolution
  2. Evolution

Revolution means, making a big investment and creating something new to replace the old. Think of an old house that has become to costly to repair. You have to tear it down and build a new one. Thats a big investment and an associated risk that the house might not stay on budget and make you go bankrupt. That's why revolution most of the times is not feasible.

Evolution means small repeated investments to replace or enhance the old with something new. Speaking in terms of house-building, it would mean to replace parts of the house, maybe room by room or wall by wall with something new. Of course that's hard if your architecture does not support incremental change (buildings don't, most of the time). Supporting incremental change is key to improvement in situations where the revolution approach is infeasible. To support incremental change means:

  1. The returns of added increments must be immediate.
  2. The costs of added increments must stay constant.
  3. Returns must not diminish.

All three conditions hold true for Unit Testing and Nagios. I'll explain a bit more about the three rules:

1. Instant Benefits

You have to start with something small and even that has to be of value to you. Building a new house is the opposite: It is of no use only until it is completely done. When introducing a Unit Testing framework to an existing codebase, you have low fixed costs (adding the framework) and get immediate results from your first test on.

2. Constant Costs

According to the second law of thermodynamics, entropy in any closed system will only grow. This seems to be true for complex systems, too. When increments are added, most of the times a bit of coherence and order is lost. Adding another increment will be more costly.

In software development you can observe that directly: It all starts out with a clean design. Then someone has to add something unexpected and violates the design, a little bit. The next time someone wants to add something, he will have a harder time. Costs didn't stay constant but grew with the number of increments.

In Unit Testing, the architecture of isolated test does it's best to not let this happen. It tries to keep entropy of the system low. Nagios checks work the same way. They are isolated and consistently organized. Adding another check is not influenced by existing checks.

3. No Diminishing Returns

Diminishing returns is a term from economics. It means that returns will scale lower than linearly with investment. This must not be the case when you want a system to work. Adding an increment must result in an outcome that is proportional or even higher than proportional to the investment.

Take Unit Testing. Every test you add, results in code coverage (the return) which is more or less proportional to the investment (there might be a point where adding more tests doesn't result in more code coverage, but that's when you have a fully tested system and just don't need to add more tests). Again, Nagios checks behave the same way: Every check you add results in a better ability to react to defects in the network (linear returns).

When returns start to diminsh, you probably need a new architecture to replace the old one. It might be a completely different one, it might need to be more flexible, it might be designed for completly different goals. Diminishing returns mean, that you have pushed the system to it's limits. Don't go any further.

Conclusion

If you want to build systems that induce constant improvement, build them with these three rules in mind.

Examples here might have been overly simplistic. There always are border conditions when the mechanics stop working. That's probably when you need a new system (revolution). But when you see something that doesn't have these mechanics in place, you should either change it or leave it.