In most situations, you can spend a lot of time, planning for the worst possible outcome in the event of an outage. It’s usually the silly little things that end up getting you…
This week was not fun… Monday was a day of chaos induced by the rather unusual weather. Dallas, like most of the country, is being nicely chilled out, making the Packers feel right at home for the Super Bowl. With snow on Friday, and not getting above freezing until Saturday.
The local power provider, Oncor, decided to help reduce load on their circuits, they were mandating 15 minute rolling blackouts across the area. This was supposed to be impacting residential circuits only, but apparently somebody didn’t tell them about the office block they shut down. Unfortunately when they turned the circuit back up, the sudden surge on the line apparently tripped something, resulting in a 2 hour 45 minute power outage. Our UPS was not happy. Neither was I, having to venture into the office at 07:30, when it was still about 9F1 outside.
Once power came back on, I set to making sure everything was coming up properly. We have one server that needs hands on2, otherwise everything else should come up on its own.
1 - The domain controllers did not boot first
They’re virtual machines, this was almost expected to happen, for some reason the boot order keeps getting messed up. The problem with the DCs not coming up first, no DHCP, which made it hard for me to get to anything. Fortunately I have a relatively good memory for most of the servers in the office, at least the important ones need to get stuff going.
2 – One NAS network port didn’t come up
For some reason, one of the NICs on one of the NetApp filers didn’t come up. This port happened to have been the port on the filer that was primary in a team for the virtual machines NFS storage. Because it wasn’t a filer failure, and the other 2 ports were up, the filer didn’t fail to the second filer. This resulted in most of the VMs not coming up. After unplugging and replugging the cable, the port came up, VMware was happy again.
3 – mount failed on one server
Because one physical machine booted faster than another, the mount of a share failed because the server wasn’t ready. Easily remedied by telling it to mount again.
4 – Previous decommissioned server jumped to life
A server that had previously been decommissioned, and shut down, decided to jump back to life. This was because of a BIOS setting that tells it to power on when it senses a restoration of power. This caused its replacement server to fail (ip conflicts etc).
5 – NAS failed NIC port came up in half-duplex
If you’ve ever done any networking work, and have a port running in half-duplex, you’ll understand the impact. To say it made the virtual machines run slow is an understatement. It took me nearly all day to spot this. Some might ask why I didn’t hard code the port speeds. I would have, but for some reason, NetApp didn’t make 1000mbps/full duplex an option, but our Cisco switches did.
6 – Windows services without service dependencies will fail
Our proxy server depends on a windows server to do virus scanning of incoming content. For some reason, the services on Trend’s IWSS didn’t have dependencies created on install (not our fault). This meant when SQL wasn’t running, IWSS didn’t start, causing the proxy to throw errors because it couldn’t talk to the IWSS service. This took me several hours to figure out, because starting the IWSS service wouldn’t throw any errors, anywhere, it just stopped again.
7 – DHCP pool exhaustion on VPN network
Microsoft’s RRAS service requests IP addresses for clients that connect. It pulls the IPs from our domain controllers. The problem here is that our DCs remembered it’d already allocated IPs to the RRAS service, but RRAS forgot about those allocations. When RRAS started up again, and requested another block of IP addresses, it quickly exhausted the VPN network’s DHCP scope, blocking people from getting back on.
Domain Controller Boot Order
This one is relatively obvious, we need to make sure the domain controllers come up first. The problem we have is that they’re currently virtual. Which means that not only do I have to wait for the DC to start, but the VMware host to start first. This is while the rest of the hosts are all coming up. We’d have to make sure all the other hosts are set to not power up on power restore. Not really a wise recommendation. We do have a future plan of an upcoming server decommission, which we will be making into a domain controller.
NAS port failure and duplexing
Not entirely sure on this one yet. Port monitoring is one of the things on the list though, I would have spotted this failure earlier.
This one could probably be fixed with 2 things. Monitoring the mount point to make sure it’s working, and a script that sleeps on boot, and attempts to remount after a set period of time. I also need to figure out why auto-mount isn’t working on that particular mount point.
Zombie Decommissioned Server
We fixed this as soon as we figured out what happened. Power and network have been removed.
Now I have a better understanding of the services involved, I’m going to go back, and tweak the settings on the server, and create all the dependencies. More monitoring will be introduced here too.
VPN DHCP Pool
We’re phasing the Windows RRAS service out in favor of a Cisco VPN service. That being said, I could use the pool options on RRAS to fix this issue for the future, I’ve got to do some thinking on this.
While we always try planning for the worst, sometimes the simply stuff catches us, and we learn some more. I’m bashing together more monitoring rules that need to be in place, and making sure all the discoveries are all documented as part of the start-up procedures.