Node Evicted and not Re-Joining Although OS ping/traceroute Works (Doc ID 2029731.1)

Last updated on JANUARY 07, 2017

Applies to:

Oracle Database - Enterprise Edition - Version 12.1.0.2 and later
Information in this document applies to any platform.

Symptoms

12.1.0.2 Grid Infrastructure, node2 got evicted and won't rejoin.


GI alert.log on node2

2015-06-17 09:28:46.357 [OCSSD(6454)]CRS-1612: Network communication with node n24-37 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.504 seconds
2015-06-17 09:28:53.365 [OCSSD(6454)]CRS-1611: Network communication with node n24-37 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.495 seconds
2015-06-17 09:28:58.371 [OCSSD(6454)]CRS-1610: Network communication with node n24-37 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.489 seconds
2015-06-17 09:29:00.862 [OCSSD(6454)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /var/opt/oracle/diag/crs/o24-37/crs/trace/ocssd.trc.
2015-06-17 09:29:00.862 [OCSSD(6454)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /var/opt/oracle/diag/crs/o24-37/crs/trace/ocssd.trc
2015-06-17 09:29:00.942 [OCSSD(6454)]CRS-1652: Starting clean up of CRSD resources.


After that, CSS tried to restart repeatedly but won't be able to join:

2015-06-17 09:29:45.603 [OCSSD(6028)]CRS-1707: Lease acquisition for node o24-37 number 2 completed
2015-06-17 09:29:46.744 [OCSSD(6028)]CRS-1605: CSSD voting file is online: /dev/rdsk/c0t600144F0F2BE16290000531F5DFD0001d0s0; details in /var/opt/oracle/diag/crs/o24-37/crs/trace/ocssd.trc.
2015-06-17 09:29:46.753 [OCSSD(6028)]CRS-1605: CSSD voting file is online: /dev/rdsk/c0t600144F0F305823C0000531F414B0001d0s0; details in /var/opt/oracle/diag/crs/o24-37/crs/trace/ocssd.trc.
2015-06-17 09:29:46.764 [OCSSD(6028)]CRS-1672: The number of voting files currently available 2 has fallen to the minimum number of voting files required 2.
....
2015-06-17 09:39:25.567 [CSSDAGENT(6026)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:13:182} in /var/opt/oracle/diag/crs/o24-37/crs/trace/ohasd_cssdagent_root.trc.
2015-06-17 09:39:25.568 [OCSSD(6028)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /var/opt/oracle/diag/crs/o24-37/crs/trace/ocssd.trc



ocssd.trc on node2 shows no network heartbeat

....
2015-06-18 13:08:26.817464 :GIPCHALO:18: gipchaLowerSendEstablish: sending establish message for node '1024f6990 { host 'n24-37', haName '8936-98e3-f5fe-08a9', srcLuid 6b2661f2-fa18104b, dstLuid 00000000-00000000 numInf 1, sentRegister 1, localMonitor 0, baseStream 105433890 type gipchaNodeType12001 (20), nodeIncarnation 00000000-259bd8e6 incarnation 2 flags 0x102804}'
2015-06-18 13:08:26.845459 : CSSD:36: clssnmvDHBValidateNCopy: node 1, n24-37, has a disk HB, but no network HB, DHB has rcfg 329887389, wrtcnt, 491874, LATS 1081916648, lastSeqNo 491872, uniqueness 1434421286, timestamp 1434632906/842617522
2015-06-18 13:08:27.069792 : CSSD:39: clssnmSendingThread: Connection pending for node n24-37, number 1, flags 0x00000002
2015-06-18 13:08:27.569341 : CSSD:11: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1001 with cvtimewait status 4294967234
2015-06-18 13:08:27.573076 : CSSD:37: clssnmvDiskCheck: No voting file found for guid 9e986478-3fe84f02-bfb1638f-0bd045ce
2015-06-18 13:08:27.734610 : CSSD:33: clssnmvDHBValidateNCopy: node 1, n24-37, has a disk HB, but no network HB, DHB has rcfg 329887389, wrtcnt, 491875, LATS 1081917537, lastSeqNo 491873, uniqueness 1434421286, timestamp 1434632907/842618109
2015-06-18 13:08:27.735430 : CSSD:6: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2015-06-18 13:08:27.735441 : CSSD:6: clsssc_CLSFAInit_CB: clsfa fencing not ready yet
2015-06-18 13:08:27.788029 : CSSD:4: clssscagProcAgReq: shutdown abort requested by the agent
2015-06-18 13:08:27.788045 : CSSD:4: clssnmRemoveNodeInTerm: node 2, o24-37 terminated. Removing from its own member and connected bitmaps
2015-06-18 13:08:27.788128 : CSSD:4: ###################################
2015-06-18 13:08:27.788133 : CSSD:4: clssscExit: CSSD aborting from thread clssscAgListener



ocssd log on node1 shows CSS gipchaWorkerWork thread stuck after the initial eviction

2015-06-17 09:27:38.034696 :GIPCHTHR:18: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30045loopCount 137
2015-06-17 09:28:08.076065 :GIPCHTHR:19: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30044loopCount 42
2015-06-17 09:28:08.076240 :GIPCHTHR:18: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30044loopCount 157
2015-06-17 09:28:38.115560 :GIPCHTHR:19: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30043loopCount 38
....
....
2015-06-17 09:29:38.260191 :GIPCHTHR:19: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30114loopCount 37
2015-06-17 09:30:08.293073 :GIPCHTHR:19: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30035loopCount 30


pstack of ocssd.bin on node1 shows workerThread stuck on OS send 

----------------- lwp# 18 / thread# 18 --------------------
ffffffff7e9e9068 __so_send (64, ffffffff798f9367, 1, ffffffff7ac1b640, ffffffff7ac1b640, ff0000) + c
ffffffff7ef0dc30 send (64, ffff7fff, 1, 0, ffff7c00, fffc00) + 1c
ffffffff3590046c sgipcwEpollPost (ffffffff798f95d8, 1021289a8, 10232c8d0, 1, 0, ff0000) + 40c
ffffffff358fb3ec sgipcwPost (ffffffff798f95d8, 1021289a8, 10232c8d0, 1, 0, fffc00) + 94
ffffffff357070b8 gipcEndpointTriggerReadyF (ffffffff798fa158, 102b142d0, ffffffff35f6fe09, ffffffff35fd0168, 732, 727) + 5c0
ffffffff3573669c gipcWaitProcessEndpoint (ffffffff798fa158, 102b142d0, 1, ffffffff798fa778, 1, ffffffff798fa774) + 784
ffffffff35737fd4 gipcInternalWaitEpoll (ffffffff798fa158, 102b142d0, ffffffff798fa778, 1, ffffffff798fa774, ffffffff) + f4c
ffffffff35731420 gipcInternalWait (ffffffff, 102b142d0, ffffffff35f6fd71, ffffffff35fd0168, 3df, ffffffff) + 1c40
ffffffff356cb238 gipcWaitF (ffffffff, ffffffff35fd0168, ffffffff35f6fd71, ffffffff35fd0168, 3df, 1) + b58
ffffffff35724500 gipcInternalSendSync (ffffffff798fa774, ffffffff, ffffffff35fd5270, ffffffff, 0, 1112f5210) + 240
ffffffff356c90f4 gipcSendSyncF (1112f5210, 11427f068, 114, ffffffff798fb110, ffffffff, 11427f068) + c74
ffffffff357c33dc gipchaLowerInternalSend (11427f068, ffffffff798fb110, 10964bcb0, 1, 0, fffc00) + 9ec
ffffffff357d3e04 gipchaLowerProcessReadyQ (ffffffff798fbac8, 100939010, 1024f6050, 1, 0, ff0000) + dc
ffffffff357dcaf8 gipchaLowerProcessHaStream (ffffffff798fbac8, 100939010, 1024f6050, 1, 0, fffc00) + 88
ffffffff357dd380 gipchaLowerProcessStreams (ffffffff798fbac8, 100939010, ffffffff7eb86d40, 1, 0, fffc00) + 840
ffffffff357deab0 gipchaLowerDoWork (ffffffff798fbac8, 100939010, ffffffff7eb86d40, 1, 0, fffc00) + fb8
ffffffff35cfd228 gipchaWorkerLowerLayer (ffffffff798fbac8, 100939878, 0, 0, ffffffff36041588, ffffffff798fb780) + 20
ffffffff35cfdf94 gipchaWorkerWork (9d, 100939878, 1, ffffffff7ac1b640, 0, fffc00) + 574
ffffffff35d00fa4 gipchaWorkerThread (ffffffff798fbac8, 100939010, 10, ffffffff36040cc0, ffffffff7ac1b640, 2944) + adc
ffffffff35ce1f9c gipchaWorkerThreadEntry (100939010, 0, 0, ffffffff35ce1b10, 0, 1) + 48c
ffffffff7e9e4ab8 _lwp_start (0, 0, 0, 0, 0, 0)

 



Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms