I’m obviously playing catch up with a number of posts I’ve been meaning to do. This was something from something I read back in November by Tom Limoncelli1, but was something I had planned on writing about anyway. The post, titled “Run, run, run, dead”, brings a nice analogy of things breaking in the analog/digital eras, and points out that as system administrators, we should be using the analog method of monitoring.
An analog radio (one with an old-fashion vacuum tube) sounds great at first, but you hear more static when the tube starts to wear out. Then the tube dies and you hear nothing. If you change the tube when it starts to degrade, you’ll never have a dead radio. (Assume, of course, you change the tube when your favorite radio show isn’t on.)
A transistor radio, on the other hand, is digital. It plays and plays and plays and then stops. Now, during your favorite song, you have to repair it.
This is one of those great analogies that works well for the situation. Watching for the host to go down is reactive, it’s already too late (transistor radio), watching for stuff changing, and adapting your system(s) to it is proactive. Take for example the image to the right. This is Nagios monitoring our VPN server’s connection counts. The server was in a small subnet, with a restricted number of connections (30). Nagios warns to excessive connections at 25, and alerts critical at 27. This allows us to take a look, and bump duplicate connections, or excessive timed connections. This was a temporary solution, because like any good admin, we didn’t want to sit messing with peoples’ connections all the time, so we moved the server so we could give it a bigger block. In this scenario we reacted to the changing demand of the VPN server as we watched trends (the year graph shows a trend upwards from 10-15 up to the current 25-28).
This is just one example of a reactive monitor, we watch all kinds of metrics from memory, to disk space, to IO, to webserver connections, all the way through to bandwidth utilization on our connections and VPN tunnels.
Reactive is fine for up/down status, but no good way to plan for your future needs, or your current performance trends. If you’ve not read either of Tom’s books, and you’d a system administrator, you should.