Over the weekend, we had a major maintenance project, which involved taking down our entire storage infrastructure for some upgrades to the CPUs of the storage controllers.
This maintenance was required as part of a big upcoming project which will be rolling out next year. This required shutting down storage, and un-mounting all services using storage. We had a number of SQL clusters, an image server, and virtualization. After deploying a maintenance page on the site, and bringing all the necessary servers, and services down, the hard work by the EMC tech began.
During the EMC updates, we took advantage of the extended downtime, and performed firmware, driver, and OS patches. After the drivers and firmware was updated, we tried to move onto the OS patches. This is where I should have known it was going to be a problematic maintenance window. Telling Windows to check for updates, it kept throwing errors, with the error code pointing to not being able to connect to the internet. I did the usual tests, pinged the gateway, and other general pings, and all was working, I just couldn’t get windows updates to work. Then it dawned on me, the domain controllers were offline because of the maintenance. These servers supplied DNS to the network, and no DNS means no way to turn the name to an IP for the servers to get to the Windows Update servers.
After the maintenance was completed, we were told to bring our services back online. Starting with our ESX servers so we could bring the domain controllers up, we hit our second issue of the night. Our vCenter server could not connect to any of the ESX hosts, there was a long pause whilst it tried, followed by an error about not being able to connect, then a box asking to confirm the details. After checking with the ESX servers they were running using the Dell Remote Access Controller (DRAC) card, I was a little baffled as to what was wrong. So I went back to basics… Can I ping the machines from the controller? No! Well that’s odd, and why does it say it cannot resolve the hostname? This dawned on me as a problem. Then I realized why. The DNS servers were part of the Active Directory infrastructure, which is all virtual, and the virtual servers have all been offline for nearly 5 hours. This was more than enough time for the vCenter server to forget the IP address.
Obviously without being able to connect to the ESX servers, I cannot bring up the DNS servers to resolve the issue, so I had to go back to the classic method of fixing DNS issues, editing the hosts file (located in c:\windows\system32\drivers\etc). After adding the entries for each of the ESX boxes, I was able to reconnect to the ESX boxes.
After some twiddling of thumbs, ESX then dropped another error about not being able to bring up the HA support. Trying to start any of the virtual guests also failed with a similar error about unsatisfied dependencies for high availability. It appears this might have been caused by DNS too, because the ESX boxes couldn’t talk to each other, they were unable to resolve the names either. Quickly disabling HA on the cluster, and starting up the domain controllers, I was able to turn the HA options back on.
After some more tinkering, everything was back up and running again, however it made me pause to think. DNS was a single point of failure again. Whilst we built everything with a certain amount of redundancy, and high availability, the unavailability of a single service crippled everything.
- We had setup multiple virtual hosts, in the event we have a hardware failure on one.
- Storage is attached to each virtual host with multiple paths, so in the event of a storage switch failure, secondary paths exist.
- Network adapters are run in pairs for management, and 2 quad port cards for the virtual machine networks, so network redundancy across switches is covered too.
- There was multiple Active Directory servers in the event one crashed, or was rebooted.
- Multiple switches and firewalls with a high availability setup configured.
So whilst we’d built a whole bunch of redundancy, the failure of, or unavailability of, DNS was the ultimate issue for most of the evenings problems. So what am I doing about it? I will be replicating the DNS services to a physical box that I know has a lower dependency threshold.
Lessons relearned again.. DNS is nearly always the issue.