**Update – another cause of this problem at the bottom of this article**
Here’s the problem:
Windows 2008 Failover cluster, running SQL 2008, both Enterprise, x64. I have two separate clusters that were having exactly the same problem.
Sometimes, intermittently, when moving SQL Server app group to another node, the SQL Server(Instance) resource wouldn’t come up. I figured issue not to be related to permissions, for multiple reasons. Sometimes the problem would occur only on one node, then only on the other, and sometimes there would be no issues. In all cases, I was able to start SQL Server service from services.msc, but in that case windows authentication fails and SA is the only account I’d be able to log in with. Furthermore, since I have multiple instances on the same cluster, not all instances were failing, but all instances failed at one point in time or another.
There really is no consistency in what fails when and where, so it was quite a bitch to troubleshoot. However, I did get some [not so] helpful event log entries.
In System logs:
Event 1069 from FailoverClustering: Cluster resource ‘SQL Server (MOSS)’ in clustered service or application ‘SQL Server (MOSS)’ failed.
Event 1205 from FailoverClustering: The Cluster service failed to bring clustered service or application ‘SQL Server (MOSS)’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
All errors had the same EventID, 19019 from Failover, but different error text:
[sqsrvres] ODBC sqldriverconnect failed
[sqsrvres] checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user ‘NT AUTHORITY\ANONYMOUS LOGON’.
[sqsrvres] OnlineThread: Error connecting to SQL Server.
[sqsrvres] CheckServiceAlive: Service is dead
Most times it was also accompanied by an “Information” log entry, 18456 from Logon: Login failed for user ‘NT AUTHORITY\ANONYMOUS LOGON’. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors. [CLIENT: 192.168.16.76]
I spent numerous hours reading documentation, forums, and technet articles. Thankfully, we have Microsoft support subscription, and after an hour or two of talking to different agents, I finally got to talk to a tech. The guy seemed to know quite a bit, but by the sounds of it, he’s never touched SQL 2008. It’s got quite a few differences from 2005, so I was a bit worried. He pulled up an article from their internal knowledge base, which I requested to see but was denied, which suggested making a simple registry change on both nodes.
In HKLM\SYSTEM\CurrentControlSet\Control\Lsa\ create REG-DWORD entry called DisableLoopbackCheck and give it value of 1
He couldn’t really tell me what this registry entry does, so I did some hunting on my own. This site has a pretty good explanation, even though it doesn’t directly relate to my problem.
It’s been two weeks now since I did it on our Dev cluster, and the problem hasn’t re-occurred since. Today I’m doing the change the production cluster and see if the problem goes away there too.
Update: It’s been a month now since i fixed our prod clusters, and still no sign of the issue
**Update 2** January 2011
Ran into this issue again, googled the error code and found my own article. This time the problem was slightly different, as the instance wouldn’t start up even with failover to another node, but event log errors were all the same. SQL service could be started manually, but not by the cluster.
Turned out, one of the developers, and they wouldn’t say who as they’re denying they did it, deleted few crucial logins from SQL Security. I started the service manually and added the following logins:
NT SERVICE\MSSQL$SQLSERVER (Where SQLSERVER is the name of the instance)
DisableLoopbackCheck can be avoided by not building cluster nodes from the same image. Essentially, just build both nodes by installing Windows from scratch