The Exchange Team blog has published an article recommending all DAG members running Windows 2008 R2 to have a few hotfixes installed. Microsoft has encountered scenarios in which a transient network failure occurs (a failure of network communications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked, if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to restart all members within the entire cluster to resolve the deadlock condition.
The following KBs list the recommended hotfixes .
- KB2550886 – A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working
- http://support.microsoft.com/kb/2549472 – Cluster node cannot rejoin the cluster after the node is restarted or removed from the cluster in Windows Server 2008 R2
- http://support.microsoft.com/kb/2549448 – Cluster service still uses the default time-out value after you configure the regroup time-out setting in Windows Server 2008 R2
- http://support.microsoft.com/kb/2552040 – A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication fail.
Read full article @ source