While working on a SQL server cluster install, after a failed cluster setup, we started getting the error:
The SQL Server failover cluster instance name 'SQLCluster1' already exists as a server on the network
This error refused to let us proceed with installing an instance with a previously used (but removed) instance name. See how to resolve it after the jump…
If you’re only interested in the answer on how to solve the issue, jump to the solution.
Background
A little background about this issue, as it’s probably waranted. Our Active Directory configuration is setup in such a way that only certain people can create user and computer accounts, that are also not domain administrators. This is a security feature, and many large organizations use it to allow delegation of services to smaller teams without giving up the keys to the kingdom so to speak. I am one of those lucky folks in my office that has that right to create computer account.
Why does this matter? When doing cluster installs, the cluster services on the computer attempts to create a computer account1, setup all the necessary security settings, and hand over the computer account to the services to use (simply put anyway). The problem with this, the cluster services doesn’t have rights to do this. This usually ends in a tragic error and things don’t work. We know this going into these installs, so what we end up doing is creating all the computer accounts, then assigning the necessary permissions on those accounts for the computer accounts, and cluster computers, to modify them.
This is usually the case, unfortunately in a case of badly timed Active Directory replication, the permissions for the SQL instance name hadn’t replicated to the domain controllers that the cluster servers used. When the install reached the very end where it attempts to register the computer account, instead of being told it was there, and simply modifying stuff, it was told it was there, but the account didn’t have the permission to modify the account.
This failed the setup, and setup bailed out, ungracefully I might add. This leaves bits and pieces of a failed cluster install scattered all over the place, from AD entries2 to oodles of registry entries on the server itself.
After going through the steps of cleaning up a failed install, we still were presented with the same error after install. Annoyingly, the only search results for this came up with a single page, which pointed to a Microsoft document, which contained the error messages from SQL setup.
Solution
The solution is so silly, you’d be surprised it wasn’t thought of earlier. The install does a gethostname check, and finds the remenant DNS entries from the failed install (and uninstall) still active. The solution is to simply delete the entries, and force a replication of DNS to the other DNS servers. It should be noted that Microsoft’s recommendation for this error is to use a different host name, and I’d usually recommend the same, but our host names are rather unique, and we knew the host name wasn’t being used by anything else.
This is one of those “why didn’t I think of that earlier” issues, along with a “it’s always a DNS problem”.