Solaris: node rebooted after membership kill escalation as some processes can not be killed due to OS bug (Doc ID 1931118.1)

Last updated on OCTOBER 13, 2016

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 11.2.0.4 [Release 11.2]
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on SPARC (64-bit)

Symptoms

11gR2 3-node cluster, instance on node2 is unhealthy, however the eviction was unsussessful and membership kill escalated to node kill as the process can't be killed:

Instance 1 alert.log

Wed Aug 27 06:46:17 2014
IPC Send timeout detected. Sender: ospid 22346 [oracle@pssva142 (LMS3)] <<<<
Receiver: inst 2 binc 381839072 ospid 21829
IPC Send timeout to 2.4 inc 10 for msg type 73 from opid 16
Wed Aug 27 06:46:19 2014
Communications reconfiguration: instance_number 2
Wed Aug 27 06:46:30 2014
IPC Send timeout detected. Sender: ospid 22380 [oracle@pssva142 (MMON)]
Receiver: inst 2 binc 381839066 ospid 21825
minact-scn: got error during useg scan e:12751 usn:15
minact-scn: useg scan erroring out with error e:12751
IPC Send timeout detected. Sender: ospid 22380 [oracle@pssva142 (MMON)]
Receiver: inst 2 binc 381839066 ospid 21825
IPC Send timeout to 2.3 inc 10 for msg type 32 from opid 28
IPC Send timeout: Terminating pid 28 osid 22380
Wed Aug 27 06:46:35 2014
IPC Send timeout detected. Sender: ospid 24323 [oracle@pssva142]
Receiver: inst 2 binc 381839072 ospid 21829
IPC Send timeout to 2.4 inc 10 for msg type 36 from opid 279
Wed Aug 27 06:46:51 2014
Restarting dead background process MMON
Wed Aug 27 06:46:51 2014
MMON started with pid=28, OS id=11917
Wed Aug 27 06:47:08 2014
IPC Send timeout detected. Sender: ospid 16811 [oracle@pssva142]
Receiver: inst 2 binc 10 ospid 9112
Wed Aug 27 06:47:18 2014
Evicting instance 2 from cluster
Waiting for instances to leave: 2

Instance 2 alert.log

Wed Aug 27 06:46:17 2014
IPC Send timeout detected. Receiver ospid 21829 [oracle@pssva143 (LMS3)]
Wed Aug 27 06:47:18 2014
Wed Aug 27 06:47:18 2014
Received an instance abort message from instance 1Received an instance abort message from instance 1
..
Please check instance 1 alert and LMON trace files for detail.
Please check instance 1 alert and LMON trace files for detail.
LMS0 (ospid: 21817): terminating the instance due to error 481
Wed Aug 27 06:47:18 2014
opiodr aborting process unknown ospid (1067) as a result of ORA-1092
Wed Aug 27 06:47:18 2014
IPC Send timeout detected. Receiver ospid 21825 [oracle@pssva143 (LMS2)]
System state dump requested by (instance=2, osid=21817 (LMS0)), summary=[abnormal instance termination].
Wed Aug 27 06:47:18 2014
opiodr aborting process unknown ospid (22043) as a result of ORA-1092
Wed Aug 27 07:27:25 2014
Starting ORACLE instance (normal)

ps output

0 S oracle 22033 1 0 40 20 ? 4512847 ? 06:34:43 ? 1:44 ora_lck0_vciprd62

 

CSSD log

2014-08-27 06:47:37.562: [ CSSD][5]clssgmFenceClient: fencing client (29f2f10), member 2 in group IGVCIPRD60SYS$BACKGROUND, no share, death fence 1, SAGE fence 0
2014-08-27 06:47:37.562: [ CSSD][5]clssgmUnreferenceMember: global grock IGVCIPRD60SYS$BACKGROUND member 2 refcount is 1
2014-08-27 06:47:37.562: [ CSSD][37]clssgmFenceCompletion: (5bc8b50) process death fence completed for process 21843, object type 2
2014-08-27 06:47:37.562: [ CSSD][5]clssgmFenceProcessDeath: client (29f2f10) pid 22033 undead <<<<<

// Trying to kill the processes but the process is undead. //

2014-08-27 06:47:37.549: [ CSSD][38]clssgmpProcessRequestMsg: Member kill request from remote node
2014-08-27 06:47:37.549: [ CSSD][44]clssgmmkLocalKillThread: Local kill requested: id 2 mbr map 0x00000002 Group name DBVCIPRD60 flags 0x00000000 st time -389377906 end time -389347406 time out 30500 req node 2
2014-08-27 06:47:37.549: [ CSSD][44]clssgmmkLocalKillThread: Kill requested for member 1 group (2303750/DBVCIPRD60)
..
2014-08-27 06:48:08.053: [ CSSD][44]clssgmmkLocalKillThread: Time up. Timeout 30500 Start time -389377906 End time -389347406 Current time-389347403
2014-08-27 06:48:08.053: [ CSSD][44]clssgmmkLocalKillResults: Replying to kill request from remote node 2 kill id 2 Success map 0x00000000 Fail map 0x00000000
2014-08-27 06:48:08.054: [ CSSD][42]clssnmeventhndlr: Disconnecting endp 7f7 ninf fa8b10

 

 

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms