SuperCluster - GI and DB homes linked with UDP instead of RDS lead to CSSD reporting "has a disk HB, but no network HB" and "CSSD aborting from thread GMClientListener" (Doc ID 1916992.1)

Last updated on JULY 06, 2016

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.3 and later
Solaris SPARC Operating System - Version 11.1 to 11.2 [Release 11.0]
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
SPARC SuperCluster T4-4 - Version All Versions and later
Oracle Solaris on SPARC (64-bit)
Oracle SuperCluster and version. Grid Infrastructure and/or Database Homes installed without using Java Once Command (JOC)

Symptoms

RAC CRS services on  one or many nodes shutting down intermittently and not able to restart.

OCSSD Log 

[    CSSD][28]clssnmvDHBValidateNcopy: node 1, rmb-zpr-db-fin1, has a disk HB, but no network HB, DHB has rcfg 300902083, wrtcnt, 4912592, LATS 1807588635, lastSeqNo 4912589, uniqueness 1405754628, timestamp 1405999042/1899489711
[    CSSD][37]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
[    CSSD][28]clssnmvDHBValidateNcopy: node 1, rmb-zpr-db-fin1, has a disk HB, but no network HB, DHB has rcfg 300902083, wrtcnt, 4912595, LATS 1807589637, lastSeqNo 4912592, uniqueness 1405754628, timestamp 1405999043/1899490711
[    CSSD][37]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
[    CSSD][28]clssnmvDHBValidateNcopy: node 1, rmb-zpr-db-fin1, has a disk HB, but no network HB, DHB has rcfg 300902083, wrtcnt, 4912598, LATS 1807590638, lastSeqNo 4912595, uniqueness 1405754628, timestamp 1405999044/1899491712
[    CSSD][5]clssgmExecuteClientRequest: MAINT recvd from proc 2 (100e55210)

[    CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
[    CSSD][5]###################################
[    CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
[    CSSD][5]###################################
[    CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

 

One node will usually remain up , typically the master node. On that node the following command will start to show rapidly accumulating Indle connections on the private interconnect. Typically you will start to see the other RAC nodes evict when the command below reaches around 2200 idle connections.

Please not that if the GI/DB in question is in Oracle Soalris Zones then you have to run the netstat command below from within the local zone (non global zone). If the GI/DB in question is at the LDom level you run it from the global zone.

 

If you see multiple entries for skgcp functions then that is indicative of  the problem a few every now and again is not bad but more than 10 or so in the matter of a few seconds is bad. In this case I retured over 100 matching calls in 5 seconds.

43317  oracle                                            

              libc.so.1`_so_bind+0x4

              libskgxp11.so`sskgxp_createport+0x2fc

              libskgxp11.so`_$o1cexiH0.skgxpicini+0x770

              libskgxp11.so`skgxpcini_with_stats+0x174

              oracle`ksxposdcini+0x32e0

              oracle`ksxppluginosd+0x1308

              oracle`ksxp_open+0x58c

              oracle`ksucrp+0x9f0

              oracle`opiino+0x5b4

              oracle`opiodr+0x48c

              oracle`opidrv+0x408

              oracle`sou2o+0x58

              oracle`opimai_real+0x1f8

              oracle`ssthrdmain+0x13c

              oracle`main+0x13c

              oracle`_start+0x17c

 

Changes

 The environment has Grid Infrastructure and Database Homes that

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms