Solaris 11 ACFS Issues
(Doc ID 2469952.1)
Last updated on JULY 20, 2024
Applies to:
Oracle Database - Enterprise Edition - Version 12.2.0.1 and laterInformation in this document applies to any platform.
Symptoms
Configuring ACFS on Solaris SuperCluster reboots the node, we can see these symptoms:
1. ASM reported that the process VDBG is unresponsive and request OCSSD to fence the process:
2018-10-11T22:48:53.487128-04:00
NOTE: client +APX1:+APX:cluster-crs mounted group 3 (ACFS_TEST)
2018-10-11T22:52:27.903527-04:00
WARNING: client [+APX1:+APX:cluster-crs] not responsive for 212s; state=0x1. killing pid 44537
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_8131.trc
..
2018-10-11T22:57:34.970358-04:00
NOTE: timeout (300s) expired for orphan ownerid 0x100a8 for client +APX1:+APX:cluster-crs; 303s elapsed
WARNING: giving up on client id 0x100a8 [+APX1:+APX:cluster-crs] which has not reconnected for 303 seconds (originally from ASM inst +ASM1, reg:0) [timeout]
NOTE: CSS requested to fence client +APX1:+APX:cluster-crs id 0x100a8 fh:0xffffffff7712cf40
2. OCSSD reports it's not able to fence the process and eventually evicts the node:
2018-10-11 22:57:40.750 : CSSD:23: GM Diagnostics completed for mbrnum/grockname: 15/ASM_CLIENT
2018-10-11 22:57:40.750 : CSSD:23: clssgmmkLocalSendKD: Killing 2 processes
2018-10-11 22:57:40.750 : CSSD:23: clssgmmkLocalSendKD: Copy pid 44474
2018-10-11 22:57:40.750 : CSSD:23: clssgmmkLocalSendKD: Copy pid 44705
2018-10-11 22:57:40.750 : CSSD:23: clssgmcpInitiateFence: local fence finished with status -18
...
2018-10-11 23:00:38.892 : CSSD:6: clssgmGetReqByID: found fenceReq 10687f000 reqid 117 node 1
2018-10-11 23:00:38.892 : CSSD:6: clssgmCheckFenceCompleted: found fence req with reqid 117 for node 1
2018-10-11 23:00:38.892 : CSSD:6: clssgmHandleFenceTimeout: killing node 1 at incarnation 422888444
2018-10-11 23:00:38.893 : CSSD:14: clssgmpcFenceEscalate: Processing node kill escalation request from clientID 3:187:28 to remove member number 15
2018-10-11 23:00:38.893 : CSSD:14: clssgmpcFenceEsc: Killing node 1 at incarnation 422888444
2018-10-11 23:00:38.893 : CSSD:14: clssnmMarkNodeForRemoval: node 1, node1 marked for removal
2018-10-11 23:00:38.893 : CSSD:14: clssnmChangeState: oldstate 3 newstate 5 clssnm1.c 2955
2018-10-11 23:00:38.893 : CSSD:14: clssnmKillNode: node 1 (node1) kill initiated
3. Around the same time, ACFS hang Manager reported that several processes seems to be stuck:
U 8023063.040/181011225100 sched[0/39990] OKSK-00033: WARNING. Possible hung thread tid:0x30407a758b740 name:oracle pid:44474 tsd:0x30404bb72edd8 Volume Number:-1 record sequence count:4
U 8023063.040/181011225100 sched[0/39990] OKSK-00034: WARNING. Possible hung thread tid:0x30407a758b740 total elapsed secs:124 hang elapsed secs:124 record cnt:4 lock cnt:2 function cnt:2 last record removed:None
..
U 8023083.040/181011225120 sched[0/39990] OKSK-00033: WARNING. Possible hung thread tid:0x3040778597580 name:orarootagent.bi pid:5145 tsd:0x304074c5609c8 Volume Number:-1 record sequence count:2
U 8023083.040/181011225120 sched[0/39990] OKSK-00034: WARNING. Possible hung thread tid:0x3040778597580 total elapsed secs:124 hang elapsed secs:124 record cnt:2 lock cnt:1 function cnt:1 last record removed:None
..
U 8023083.040/181011225120 sched[0/39990] OKSK-00033: WARNING. Possible hung thread tid:0x3040668beca80 name:osysmond.bin pid:15062 tsd:0x3040685831a00 Volume Number:-1 record sequence count:1
U 8023083.040/181011225120 sched[0/39990] OKSK-00034: WARNING. Possible hung thread tid:0x3040668beca80 total elapsed secs:140 hang elapsed secs:140 record cnt:1 lock cnt:1 function cnt:0 last record removed:None
3. The alert for clusterware shows the initial eviciton and how the cssdagent found ocssd unresponsive
2018-10-11 23:00:38.897 [OCSSD(5692)]CRS-1607: Node node1 is being evicted in cluster incarnation 422888457; details at (:CSSNM00007:) in /u01/app/grid/diag/crs/node1/crs/trace/ocssd.trc.
2018-10-11 23:01:32.297 [OHASD(3015)]CRS-8011: reboot advisory message from host: node1, component: cssagent, with time stamp: L-2018-10-11-23:01:32.296
2018-10-11 23:01:32.297 [OHASD(3015)]CRS-8013: reboot advisory message text: oracssdagent is rebooting this node due to network timeout (no network activity for 57978 milliseconds).
Changes
This is a new ACFS implementation on a Database Domain in Super Cluster
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Changes |
Cause |
Solution |
References |