I think you realize that your week is going to be bad when on Sunday you spend more time with a server than you do with your son.
Sunday morning, at approximately 9:30 am, my phone started beeping away. Well more technically “boinging” away. It was our monitoring servers triggering an alert to report that one of our test web servers was unreachable. I check the state of the server, and I cannot reach it… chaos… I wait a few minutes to see if it might be a network issue…. No luck… Time to head into the office.
When I get into the office, I find the server room is roasting, sitting probably at 100 at least. Apparently the building shut off the A/C units on weekends, and our A/C unit in the server room was struggling away to try its best. I glance to the rack, and see the offending server blinking away, so I swap the screen to that server in the KVM, and nothing… screen doesn’t come up… not a good start.
Hitting reset on the server, the screen comes back to life, my hopes are high, maybe a virtual memory issue with a run-away process that stopped the box responding. 30 seconds later, my hopes were dashed with the error message:
1 Logical Drive: Found 1 Logical Drive: Handled by BIOS 1 Logical Drive: Error
Uh oh. It prompts me to go into the RAID card manager, so, under the hint of the computer I head in there. It reports that both drives in the RAID 1 configuration are offline, that is a bad thing. If one drive goes offline, you can continue working, and replace the one dead drive, with both offline it usually mean something serious has happened. In this case, it was something serious. I attempted to force one drive online to see if I can get anything up and running. Fortunately we had a full backup run on the Saturday before.
Crossing my fingers, I let the box reboot. And hope… and hope… and grin as I see the classic “Starting Windows…” screen… then the windows login box… and all is happy again. That was after 4 hours of fighting with the drives so my hopes are half-hearted in a way. It’s up, it’s running, it seems stable… I head home…
Well I did say it was a bad weekend if I spent more time with a server than my son, you can probably guess what is about to happen. About an hour after I got home, boing boing boing.. my phone starts again… crap.
Back into the office, dead again, this time no chance of recovery so I look at the possibility of replacing the servers drives. Pulling them out, I feel my heart sink when I see they’re two IBM Ultra160 SCSI drives. This is quite bad. I’ve not seen an Ultra160 drive advertised anywhere for a long while. I even called the local Fry’s stories to find out if they had any (Fry’s has everything), but no such luck.
The fun continues when I’m asked to sort them out a temporary solution until we can replace the server… Looking at the pile of desktops on the floor, I grab what was the desktop of a graphics specialist in the company, and run home with it after spending another 45 minutes in the office. With my Windows 2000 Advanced Server CD in hand, I start a long night.
By about midday today I finally had the server up and running, with all the services re-installed. Or at least all the services I knew of. With the development department not being entirely clear on what they have and don’t have on their servers, it makes my life a little more difficult.
At 3:00pm today, I turned the server over to the developement manager and let him know all is working… here is my fingers being crossed in a hope that they can survive on that for a few days.
In the meantime, I’m in a frantic hunt for a new server. I’ve been told we’re probably going to keep going with Dell servers, which means if they want to replace this for a resonible price, and still get an okay box, they’re looking at a rough price of $1600… we shall see how that goes through the finance department.