My Oracle Support Banner

Solaris 11 ACFS Issues (Doc ID 2469952.1)

Last updated on JULY 20, 2024

Applies to:

Oracle Database - Enterprise Edition - Version 12.2.0.1 and later
Information in this document applies to any platform.

Symptoms

Configuring ACFS on Solaris SuperCluster reboots the node, we can see these symptoms:

 

1. ASM reported that the process VDBG is unresponsive and request OCSSD to fence the process:

 

...
2018-10-11T22:48:53.487128-04:00
NOTE: client +APX1:+APX:cluster-crs mounted group 3 (ACFS_TEST)
2018-10-11T22:52:27.903527-04:00
WARNING: client [+APX1:+APX:cluster-crs] not responsive for 212s; state=0x1. killing pid 44537
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_8131.trc
..
2018-10-11T22:57:34.970358-04:00
NOTE: timeout (300s) expired for orphan ownerid 0x100a8 for client +APX1:+APX:cluster-crs; 303s elapsed
WARNING: giving up on client id 0x100a8 [+APX1:+APX:cluster-crs] which has not reconnected for 303 seconds (originally from ASM inst +ASM1, reg:0) [timeout]
NOTE: CSS requested to fence client +APX1:+APX:cluster-crs id 0x100a8 fh:0xffffffff7712cf40

 

2. OCSSD reports it's not able to fence the process and eventually evicts the node:

2018-10-11 22:57:40.750 : CSSD:23: GM Diagnostics completed for mbrnum/grockname: 15/ASM_CLIENT

2018-10-11 22:57:40.750 : CSSD:23: clssgmmkLocalSendKD: Killing 2 processes

2018-10-11 22:57:40.750 : CSSD:23: clssgmmkLocalSendKD: Copy pid 44474

2018-10-11 22:57:40.750 : CSSD:23: clssgmmkLocalSendKD: Copy pid 44705

2018-10-11 22:57:40.750 : CSSD:23: clssgmcpInitiateFence: local fence finished with status -18

 

...

2018-10-11 23:00:38.892 : CSSD:6: clssgmGetReqByID: found fenceReq 10687f000 reqid 117 node 1

2018-10-11 23:00:38.892 : CSSD:6: clssgmCheckFenceCompleted: found fence req with reqid 117 for node 1

2018-10-11 23:00:38.892 : CSSD:6: clssgmHandleFenceTimeout: killing node 1 at incarnation 422888444

2018-10-11 23:00:38.893 : CSSD:14: clssgmpcFenceEscalate: Processing node kill escalation request from clientID 3:187:28 to remove member number 15

2018-10-11 23:00:38.893 : CSSD:14: clssgmpcFenceEsc: Killing node 1 at incarnation 422888444

2018-10-11 23:00:38.893 : CSSD:14: clssnmMarkNodeForRemoval: node 1, node1 marked for removal

2018-10-11 23:00:38.893 : CSSD:14: clssnmChangeState: oldstate 3 newstate 5 clssnm1.c 2955

2018-10-11 23:00:38.893 : CSSD:14: clssnmKillNode: node 1 (node1) kill initiated

 

3. Around the same time, ACFS hang Manager reported that several processes seems to be stuck:

 

U 8023063.040/181011225100 sched[0/39990] OKSK-00033: WARNING. Possible hung thread tid:0x30407a758b740 name:oracle pid:44474 tsd:0x30404bb72edd8 Volume Number:-1 record sequence count:4
U 8023063.040/181011225100 sched[0/39990] OKSK-00034: WARNING. Possible hung thread tid:0x30407a758b740 total elapsed secs:124 hang elapsed secs:124 record cnt:4 lock cnt:2 function cnt:2 last record removed:None

..

U 8023083.040/181011225120 sched[0/39990] OKSK-00033: WARNING. Possible hung thread tid:0x3040778597580 name:orarootagent.bi pid:5145 tsd:0x304074c5609c8 Volume Number:-1 record sequence count:2
U 8023083.040/181011225120 sched[0/39990] OKSK-00034: WARNING. Possible hung thread tid:0x3040778597580 total elapsed secs:124 hang elapsed secs:124 record cnt:2 lock cnt:1 function cnt:1 last record removed:None

..

U 8023083.040/181011225120 sched[0/39990] OKSK-00033: WARNING. Possible hung thread tid:0x3040668beca80 name:osysmond.bin pid:15062 tsd:0x3040685831a00 Volume Number:-1 record sequence count:1
U 8023083.040/181011225120 sched[0/39990] OKSK-00034: WARNING. Possible hung thread tid:0x3040668beca80 total elapsed secs:140 hang elapsed secs:140 record cnt:1 lock cnt:1 function cnt:0 last record removed:None

 

3. The alert for clusterware shows the initial eviciton and how the cssdagent found ocssd unresponsive 

 

2018-10-11 23:00:38.897 [OCSSD(5692)]CRS-1607: Node node1 is being evicted in cluster incarnation 422888457; details at (:CSSNM00007:) in /u01/app/grid/diag/crs/node1/crs/trace/ocssd.trc.

2018-10-11 23:01:32.297 [OHASD(3015)]CRS-8011: reboot advisory message from host: node1, component: cssagent, with time stamp: L-2018-10-11-23:01:32.296

2018-10-11 23:01:32.297 [OHASD(3015)]CRS-8013: reboot advisory message text: oracssdagent is rebooting this node due to network timeout (no network activity for 57978 milliseconds).

 

 

Changes

This is a new ACFS implementation on a Database Domain in Super Cluster

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.