Sunday afternoon, we received a call from building security reporting that they’ve had a power outage lasting 1.5 hours, and power has now been restored. This means hell for me. 1.5hours is quite a long time. Our UPSes are able to sustain the servers for approximately 30 minutes, and longer if we reduce the servers to critical load only.
So why didn’t I get a notification earlier that power was out?
The server that monitors our internal infrastructure runs on a different UPS. Unfortunately this UPS has apparently given up any hope of sustaining anything but the LED on the front, and the small speaker to tell you there is issues. This caused the monitoring server to go down within seconds of the power outage beginning. To round things off here, it looks like it fried either the PSU, or mother board in the server. So now we have no internal monitoring.
That’s just the start. About a month ago, I wrote about another Bad Day in which a seriously crippled SQL server decided to cause me some major headaches. I must say, I’m glad I documented what happened, because the power outage also took that server offline too causing the master db to corrupt once again. Instead of taking several hours to recover this time, i was back up and running within an hour and 15 minutes. I could have fixed that yesterday, had I known the sql service was dead, however our internal monitoring server was broken, so I wasn’t notified… argh!
So I wonder how the rest of my week is going to be…? Stay tuned.