The Nubby Admin has an interesting post, and lesson learnt on the importance of monitoring. The post, titled The Wisdom of Specificity in Monitoring and Alerting. After an outage was caused due to his service provider making some DNS changes due to disk usage issues, Wesley found himself with a broken site, but monitoring didn’t report it as such.
In his post, he goes on to explain that his current monitoring only watches ICMP, but he came to the realization that a simple HTTP check wouldn’t have sufficed either. Why? Well, his service provider probably bumped his URL over to a new site that basically reported his site had been deactivated. As it was still a valid website, and didn’t report any errors, there wasn’t any reason the monitoring should have alarmed.
Monitoring Example
This is where the difference between Up and Alive come in, or as Wesley put it “specificity in monitoring”. Knowing HTTP is running is good, knowing HTTP is serving the website you told it to is much better. This is relatively easy using Nagios1 using the check_http plugin. We’ll take a simple example of making sure Google is serving up the search page.
$ ./check_http -H www.google.com
HTTP OK: HTTP/1.1 200 OK - 6924 bytes in 0.119 second response time |time=0.119169s;;;0.000000 size=6924B;;;0
So this first example shows that HTTP is responding, but it doesn’t tell us if it is really the Google page. There are several things on the Google search page we can look for that makes them unique. First is the Gmail link at the top, then there is also the “I’m Feeling Lucky” button. Look around, there are quite a few, but for this test, I’m going to use the lucky button…
$ ./check_http -H www.google.com -s "Feeling Lucky"
HTTP OK: HTTP/1.1 200 OK - 6912 bytes in 0.073 second response time |time=0.072889s;;;0.000000 size=6912B;;;0
As you can see, I’ve thrown an argument in there which tells the plugin to look for the string “Feeling Lucky” in the response. In this case, all came back OK. Lets see an example where it might fail…
$ ./check_http -H www.google.com -s "Foo Check"
HTTP CRITICAL: HTTP/1.1 200 OK - string not found - 6924 bytes in 0.079 second response time |time=0.079160s;;;0.000000 size=6924B;;;0
This is showing that “Foo Check” doesn’t exist in the page. This simple check implements a difference between Up and Alive, however, you can go much further…
Doug Luxem left a comment that sums up well what, and how we should consider monitoring…
The key when monitoring is to look at it from a perspective of the service offered to end user, not the individual components — Doug Luxem
Whilst I agree with the first part, I don’t on the second. Each of us considers a service different. An end user considers a service as something they can use to do something else. For example, Twitter is a service, Hotmail is a service, Google is a service. For a system administrator however, a service is a component of those user defined services. Ie, database service (mysql, mssql, postressql), web services (apache, lighthttpd), and such. A good set of monitors will monitor for the user2, and from the position of the system administrator3.
How I learnt the hard way…
Wesley asked for examples of how we might have been caught by similar issues, and I have the perfect example that hit me a few years back. The company was moving a core DNS server to another data center. Really all they were doing was starting a new one, and when all stable, decommissioning the previous one. A sound plan right? During this time, our domain had all three (NS1, NS2, and NS3) hosts in the DNS records. Nobody noticed any issues.
All was well, until we went to remove NS2 from the list. Once removed, we started receiving complaints within 24 hours that they couldn’t access the site, and all they kept getting was connection issues. We initially attributed the issue down to them, as we’d not received any other issues, and the customer complaints were all from a single group of customers.
Then we got more complaints, and after some digging around, we discovered that some of the people were getting a different IP address for the name they were requesting. Aha, I thought. Somebody made a typo, so we revalidated the configurations, everything was good, and querying both DNS servers reported the same correct address.
A little more poking around, and I requeried the DNS servers again, this time the NS3 server returned a different IP address. It was only different in the middle octet (a.b.e.d instead a.b.c.d), and the DNS server happened to be on the same network block as that different octet. Something had been transforming the addresses, even though the server responded correctly.
More digging around, and it was discovered that DNS fixup was enabled on the ASA firewalls that were protecting the network the DNS server was located on, and with all the other configurations in place, it was simply changing the first 3 octets of the IP address to match the network it was announced from.
This enlightened me to a new issue I hadn’t considered happening in our environment, so I now watch for it. Nagios once again has my behind covered…
$ ./check_dns -H netdork.net -s ns1.netdork.net -a 71.6.153.227
DNS OK: 0.019 seconds response time. netdork.net returns 71.6.153.227|time=0.019156s;;;0.000000
The above is an example, not really what I’m monitoring for the site that had issues, I’m just not disclosing that information.
Conclusion…
Any administrator will know the important of monitoring. If they don’t, then they haven’t yet been caught, and had to suffer the wrath of upset customers, and bosses, when something goes terribly wrong.
- Monitoring is something that should evolve, it should never be static4.
- Monitoring should watch for both the user (page logins etc), and the administrator (service is running).
In the wise words of Tom Limoncelli…
It isn’t a service if it isn’t monitored. If there is no monitoring then you’re just running software.