Process ocssd.bin Consumes High CPU Even With Patch 25211209 Installed

(Doc ID 2304727.1)

Last updated on NOVEMBER 01, 2017

Applies to:

Oracle Database - Enterprise Edition - Version 12.1.0.2 and later
Information in this document applies to any platform.

Symptoms

1. ocssd.bin is using high CPU and CPU spinning before and after install the patch for bug 25211209. The patch for bug 25211209 could have also installed as part of a merge with patch for bug 20302006 or alone.

2. ocssd.bin can still reported by CHM or OSW Top, as the top CPU consumer:

zzz ***Fri Sep 1 07:28:52 CDT 2017
top - 07:28:54 up 7 days, 6:59, 0 users, load average: 3.87, 2.91, 2.44
Tasks: 2728 total, 1 running, 2727 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.3%us, 0.8%sy, 0.0%ni, 95.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 264459252k total, 191177440k used, 73281812k free, 1867104k buffers
Swap: 16777212k total, 0k used, 16777212k free, 17264068k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22657 ora9dba RT 0 2409m 282m 111m S 100 0.1 264:41.55 ocssd.bin

 

3. At that time OCSSD updates all or several of its trace files:

-rw-rw-r--. 1 grid oinstall 157294147 Aug 6 11:30 ocssd_10343.trc
-rw-rw-r--. 1 grid oinstall 157311429 Aug 6 11:31 ocssd_10344.trc
-rw-rw-r--. 1 grid oinstall 157293490 Aug 6 11:31 ocssd_10345.trc
-rw-rw-r--. 1 grid oinstall 157293743 Aug 6 11:31 ocssd_10346.trc
-rw-rw-r--. 1 grid oinstall 157303809 Aug 6 11:33 ocssd_10347.trc
-rw-rw-r--. 1 grid oinstall 157296103 Aug 6 11:33 ocssd_10348.trc
-rw-rw-r--. 1 grid oinstall 157299218 Aug 6 11:33 ocssd_10349.trc
-rw-rw-r--. 1 grid oinstall 157289910 Aug 6 11:33 ocssd_10350.trc
-rw-rw-r--. 1 grid oinstall 157294730 Aug 6 11:33 ocssd_10351.trc
-rw-rw-r--. 1 grid oinstall 68603603 Aug 7 00:19 ocssd.trc

4. Reviewing those trace files shows repeated trace information, the first column shows the number of times the same line was found in the file:

$ sort ocssd_10343.trc | uniq -c | grep -v '^ 1' | head
5
2 2017-08-06 09:44:23.662934 : CSSD:319117056: clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 10:2:2 from clientID 2:34:4
2 2017-08-06 09:44:23.828302 :GIPCHTHR:315963136: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30010loopCount 139
33 2017-08-06 09:44:24.827509 : CSSD:293922560: clssnmSendingThread: sending status msg to all nodes
33 2017-08-06 09:44:24.827707 : CSSD:293922560: clssnmSendingThread: sent 5 status msgs to all nodes
33 2017-08-06 09:44:25.713471 : CSSD:322270976: clssgmMbrDataUpdt: Processing member data change type 1, size 4 for group HB+ASM, memberID 10:2:2
34 2017-08-06 09:44:25.713496 : CSSD:322270976: clssgmMbrDataUpdt: Sending member data change to GMP for group HB+ASM, memberID 10:2:2
74 2017-08-06 09:44:25.714168 : CSSD:324507392: clssgmpcMemberDataUpdt: grockName HB+ASM memberID 10:2:2, datatype 1 datasize 4
81 2017-08-06 09:44:25.715099 : CSSD:319117056: clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 10:2:2 from clientID 2:34:4
81 2017-08-06 09:44:26.709128 : CSSD:322270976: clssgmMbrDataUpdt: Processing member data change type 1, size 401 for group GR+DB_+ASM, memberID 26:2:1
...

  In the output shown above, some messages repeated 81 times, which is incorrect, because the log messages normally are unique, due to the timestamp and the message reported.

5. Potentially ocssd.bin can spin for several hours. In some situations, other symptoms also seen, where ocssd.bin aborts:

a. Clusterware alert log shows the error DIA-48001 and CRS-8503, just before OCSSD aborts:

DIA-48001: internal error code, arguments: [dbgtrRecValidate], [], [], [], [], [], [], []
Incident details in: <path incident trace directory>/incdir_1/ocssd_i1.trc
DIA-48001: internal error code, arguments: [dbgtrRecValidate], [], [], [], [], [], [], []
Sweep [inc][1]: completed
Errors in file <path trace directory>/ocssd.trc (incident=2):
DIA-48001: internal error code, arguments: [kgepop: no error frame to pop to], [], [], [], [], [], [], []
DIA-48001: internal error code, arguments: [dbgtrRecValidate], [], [], [], [], [], [], []
Incident details in:<path incident trace directory>/incdir_2/ocssd_i2.trc
DIA-48001: internal error code, arguments: [kgepop: no error frame to pop to], [], [], [], [], [], [], []
DIA-48001: internal error code, arguments: [dbgtrRecValidate], [], [], [], [], [], [], []
Sweep [inc][2]: completed
Errors in file <path trace directory>/ocssd.trc (incident=3):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: <path incident trace directory>/incdir_3/ocssd_i3.trc
DIA-48001: internal error code, arguments: [kgepop: no error frame to pop to], [], [], [], [], [], [], []
DIA-48001: internal error code, arguments: [dbgtrRecValidate], [], [], [], [], [], [], []
...
CRS-1661: The CSS daemon is not responding. Reboot will occur in 13860 milliseconds; Details at (:CLSN00111:) in <path trace directory>/ohasd_cssdmonitor_root.trc
CRS-1661: The CSS daemon is not responding. Reboot will occur in 13290 milliseconds; Details at (:CLSN00111:) in <path trace directory>/ohasd_cssdagent_root.trc

b. A Reboot advisory message is reported, showing ocssd.bin was terminated:

CRS-8011: reboot advisory message from host: node2, component: cssmonit, with time stamp: L-2017-08-06-11:34:02.364
CRS-8013: reboot advisory message text: oracssdmonitor is rebooting this node due to network timeout (no network activity for 27860 milliseconds).
CRS-8011: reboot advisory message from host: node2, component: cssagent, with time stamp: L-2017-08-06-11:34:02.332
CRS-8013: reboot advisory message text: oracssdagent is rebooting this node due to network timeout (no network activity for 27860 milliseconds).

c. In some cases the error DIA-48001 is not reported in the alert log file, but the OCSSD trace file shows:

kgepop: no error frame to pop to for error 48127
DDE: Flood control is not active
Incident 1 created, dump file: <path incident trace directory>/incdir_1/ocssd_i1.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Trace file <path trace directory>/ocssd.trc

- in this case the trace file is not created, but a stack for a core dump may show:

#0 0x00007fc3374c5885 in raise
#1 0x00007fc3374c6e61 in abort
#2 0x00007fc33ba770ca in skgdbgcra
#3 0x00007fc33ba6f795 in skgesigCrash
#4 0x00007fc33ba6f9f9 in skgesig_sigactionHandler
#5 <signal handler called>
#6 0x00007fc33bc82fba in dbgc_cra
#7 0x00007fc33a49a611 in kgepop
#8 0x00007fc33a49a16f in kgersel
#9 0x00007fc33a66fd25 in dbgtbBucketIterNextGet
#10 0x00007fc33d71732a in clsdadr_process_bucket
#11 0x00007fc33d716eeb in clsdadr_bucketThread
#12 0x00007fc338eab806 in start_thread
..

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms