A common habit in a lot of IT shops is computer cloning. This saves time, and saves on mistakes. If you have a good image of one computer, and you use it to duplicate to another, you know the other machine is going to be good. You get to apply all your corporate policies, security software, anti-virus, group policies, etc etc, all in one single step. With this in mind, this is what we did with some SQL clusters we built 6 months ago.
The cloning went well, and the servers were successfully built, SQL deployed, and clusters setup. All was good… Until recently. At first, I didn’t notice a problem, then we had a hiccup with something, I connected to the server using Remote Desktop Protocol (RDP), and started poking around, and found nothing unusual. A little baffled, I poked around some more, and some things didn’t seem right to me. Then it dawned on me, I was looking at the wrong server. Disconnecting, I tried again, and ended up back on the same box. For some reason, DNS was pointing to the wrong server. Quickly dropping the DNS record, and adding the other server back, all was good.
A few days later, the DNS issue popped up again, doing some more poking around, I discovered that the host name on the server had stayed the same as the cloned machine. For some reason, the sysprep hadn’t properly changed the name. We decided to schedule some time to change the name at a later point in time, but for now, both boxes were stable.
A few days ago, the second box, with the incorrect name crashed out, no apparent reason, the cluster service just reported issues, with no specific reason. As the services had started up on the other node correctly, we left it for the night, and decided to deal with it in the morning. Whilst poking around on it the following morning, and trying to fail the SQL services over, we were bombarded by errors in the event log… 3 of them in blocks, for every second it attempted to move over the services.
Log Name: Application Source: MSSQLSERVER Date: 9/16/2010 9:04:46 AM Event ID: 19019 Task Category: Failover Level: Error Keywords: Classic User: N/A Computer: Server.domain Description: [sqsrvres] ODBC sqldriverconnect failed
Log Name: Application Source: MSSQLSERVER Date: 9/16/2010 9:04:46 AM Event ID: 19019 Task Category: Failover Level: Error Keywords: Classic User: N/A Computer: Server.domain Description: [sqsrvres] checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'.
Log Name: Application Source: MSSQLSERVER Date: 9/16/2010 9:04:48 AM Event ID: 19019 Task Category: Failover Level: Error Keywords: Classic User: N/A Computer: Server.domain Description: [sqsrvres] OnlineThread: did not connect after 10 attempts resource failed
None of these look too healthy, but the last one was a bit of a hint as to what was going on. From the looks of it, once the service is turned up, and all up and running, the cluster service attempts to login to validate it is up and accepting connections. However, as you can see from the second entry, it was failing.
Doing some searching I stumbled across this post who seemed to be having a very similar issue. This hinted to an issue with the services rejecting logins against the server itself from the server, using certain accounts. Following the rabbit down the hole, and you get to the Microsoft knowledgebase article KB896861. This describes a new security feature introduced in Windows XP service pack 2, and Windows 2003 service pack 1 (and inherently Windows 2008). From the KB article…
Windows XP SP2 and Windows Server 2003 SP1 include a loopback check security feature that is designed to help prevent reflection attacks on your computer. Therefore, authentication fails if the FQDN or the custom host header that you use does not match the local computer name.
This basically says, when computer server.yourdomain.local attempts to connect to itself, if the header in the request does not match the computer name, it rejects the authentication. This struck a note… the computer name was wrong, and had been corrected. Something, somewhere, still thought the computer name was incorrect, and it was trying to login.
Rather than try and dig through all the possible locations, we went with Microsoft’s recommendations for now, and disabled loopback checks with a simple registry key change.
1. Set the DisableStrictNameChecking registry entry to 1. 2. Click Start, click Run, type regedit, and then click OK. 3. In Registry Editor, locate and then click the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa 4. Right-click Lsa, point to New, and then click DWORD Value. 5. Type DisableLoopbackCheck, and then press ENTER. 6. Right-click DisableLoopbackCheck, and then click Modify. 7. In the Value data box, type 1, and then click OK. 8. Quit Registry Editor, and then restart your computer.
After doing this, the services failed back over without any issues, and no more authentication problems. The server didn’t even need to be rebooted. Now the correct way to fix this would be to find out where the machine name was still incorrect. This has been added as a task on my todo later list, but for now, all is working again.