Solaris Cluster Upgrade to 3.3 5/11 update1 may cause the node/server/system panic with 'Failfast: timeout - unit "failfast_now"' and rgmd to dump core (Doc ID 1498039.1)

Last updated on JANUARY 25, 2017

Applies to:

Solaris Cluster Geographic Edition - Version 3.2U2 2/09 to 3.3U1 5/11 [Release 3.2 to 3.3]
Solaris Cluster - Version 3.2U2 2/09 to 3.3U1 5/11 [Release 3.2 to 3.3]
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on x86-64 (64-bit)

Symptoms

The general problem is that one node can not join the running node of a Solaris Cluster 3.3 5/11 update1.
The joining node will always panic with 'Failfast: timeout - unit "failfast_now"'.
But the node which fails to join the cluster can boot in non-cluster mode.
Another symptom which was observed is that cluster SMF services  (svc:/system/cluster/cl-svc-cluster-milestone:default, svc:/system/cluster/cznetd:default) went into maintenance after the upgrade.


The issue can occur in different scenarios:
A) Using Live Upgrade to upgrade fron Solaris Cluster 3.1 to Solaris Cluster 3.3 5/11 update1
B) The Solaris Cluster was upgraded more times and have missing entries in Cluster Configuration Repository (CCR).
In such a case there was an upgrade of Solaris Cluster from 3.1 to 3.2u1, then from 3.2u1 to 3.2u3 and at last a upgrade from 3.2u3 to 3.3 5/11 update1.

In both scenarios the panic will look like this:



Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms