My Oracle Support Banner

Database instance crash due to MR Key depletion on compute node (Doc ID 2694588.1)

Last updated on JANUARY 13, 2022

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.4 to 20.0.0.0.0 Beta [Release 11.2 to 20.0]
Information in this document applies to any platform.

Symptoms

 With increase in large set IOs and multiple parallel slaves are running , compute nodes
runs low on MR Keys .
And resulting failure of IOs to cell and crashing database instances/instance.

This is applicable for 11gR2 database instances.

For 12.2 or above database instances,

Database instance log file,

2020-NN-MMT06:00:29.786709-05:00
TT02 (PID:331657): SRL selected for T-4.S-91673 for LAD:4
2020-NN-MMT06:01:01.294238-05:00
ARC5 (PID:328260): Archived Log entry 475395 added for T-4.S-91672 ID 0x355a753c LAD:1
2020-NN-MMT06:10:59.572602-05:00
LGWR (ospid: 301275) waits for event 'LGWR any worker group' for 71 secs.
2020-NN-MMT06:10:59.572672-05:00
LGWR (ospid: 301275) is hung in an acceptable location (inwait 0x1.ffff).
2020-NN-MMT06:11:03.256323-05:00
LG00 (ospid: 301279) waits for event 'log file parallel write' for 77 secs.
2020-NN-MMT06:11:05.592530-05:00
LG04 (ospid: 301305) waits for event 'log file parallel write' for 79 secs.
2020-NN-MMT06:11:08.098856-05:00
LGWR (ospid: 301275) waits for event 'LGWR any worker group' for 82 secs.
2020-NN-MMT06:11:16.254969-05:00
LGWR (ospid: 301275) waits for event 'LGWR any worker group' for 90 secs.
2020-NN-MMT06:11:22.649260-05:00
LG00 (ospid: 301279) waits for event 'log file parallel write' for 96 secs.
2020-NN-MMT06:11:24.346752-05:00
LG04 (ospid: 301305) waits for event 'log file parallel write' for 98 secs.
2020-NN-MMT06:11:26.251350-05:00
NOTE: process _ctwr_xpclrf4 (331659) initiating optional offline of disk 236.3306169807 (DATADG_CD_02_nnnnn) with mask 0x7e in group 2 (DATADG) with client assisting
2020-NN-MMT06:11:26.321736-05:00
LGWR (ospid: 301275) waits for event 'LGWR any worker group' for 100 secs.

2020-NN-MMT06:17:02.465271-05:00
LG00 (ospid: 301279) waits for event 'log file parallel write' for 436 secs.
2020-NN-MMT06:17:04.463883-05:00
LG04 (ospid: 301305) waits for event 'log file parallel write' for 438 secs.
2020-NN-MMT06:17:22.418435-05:00
LG00 (ospid: 301279) waits for event 'log file parallel write' for 456 secs.
2020-NN-MMT06:17:24.417712-05:00
LG04 (ospid: 301305) waits for event 'log file parallel write' for 458 secs.
2020-NN-MMT06:22:32.001673-05:00
Pin Cleanup: Cannot cleanup temp blocks in the middle of modification
File # 4021 block # 2 (0.3/4c00002)

LMHB trace file,

** 2020-NN-MMT06:HH:58.180283-05:00
: heartbeat check status 2 (acceptable) (threshold 70 sec)
: heartbeat poke time 0x5f10357a req 0x282b0 ack 0x282af freq 10
: heartbeat state 0x1.ffff (inwait) pso-flag 0x0
: waiting for event 'LGWR any worker group' for 4 secs with wait_id 330582335.
===[ Wait Chain ]===
LGWR (ospid: 301275) waits for event 'LGWR any worker group'.
LG00 (ospid: 301279) waits for event 'log file parallel write'.
==============================
Dumping PROCESS LGWR (ospid: 301275) States
==============================
===[ Callstack ]===

*** 2020-NN-MMT06:10:06.222594-05:00
Process diagnostic dump for oracle@nnnnnNN (LG00), OS id=301279,
pid: 51, proc_ser: 1, sid: 3520, sess_ser: 22743
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 06:10:05 ]
NOTE: scheduling delay has not been sampled for 0.383535 secs
0.000000 secs from [ 06:10:01 - 06:10:06 ], 5 sec avg
0.000000 secs from [ 06:HH:06 - 06:10:06 ], 1 min avg
0.000000 secs from [ 06:05:06 - 06:10:06 ], 5 min avg

*** 2020-NN-MMT06:10:08.489097-05:00
loadavg : 7.26 7.03 5.41
System user time: 0.02 sys time: 0.01 context switch: 92147
Memory (Avail / Total) = 45504.94M / 773469.56M
Swap (Avail / Total) = 18558.04M / 24576.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S orprod 301279 1 1 80 0 - 104064044 poll_s Jun27 ? 07:35:53 ora_lg00_XPCLRF4
Name: ora_lg00_xpclrf

CapBnd: 0000003fffffffff
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-87
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 337652312
nonvoluntary_ctxt_switches: 8869192
Short stack dump:
ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-__poll()+16
<-sskgxp_selectex()+246<-skgxpiwait()+1069<-skgxpwaiti()+389<-skgxpwait()+167<-ossnet_wait_all()+260
<-ossnet_unified_wait()+346<-ossnet_wait_internal()+537<-ossnet_wait()+186<-oss_wait_ext()+1325
<-oss_wait()+129<-kfk_reap_oss_int()+142<-kfk_reap_oss_async_io()+372<-kfk_reap_ios_from_subsys()+362
<-kfk_reap_ios()+302<-kfk_io1()+439<-kfk_transitIO()+323<-kfioReapIO()+339<-kfioRequestPriv()+245
<-kfioRequest()+332<-ksfdafRequest()+342<-ksfdafWait()+382<-ksfdwtio()+721<-ksfdwat_internal()+206
<-ksfdblock()+266<-kcrfw_do_write()+6504<-kcrfw_redo_write()+760<-kcrfw_slave_group_main()+924
<-ksvrdp_int()+1941<-opirip()+583<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417
<-main()+256<-__libc_start_main()+245
Kernel stack:

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=2.275000 sec
(max dump time=5.000000 sec)

*** 2020-NN-MMT06:10:08.497207-05:00

From IO capable process trace file (lg00) trace file,

2020-NN-MM AA:BB:CC.909*: osswait failed: context 0x194a0d50 childctx 0x194a0d50 timeout 0 errorcode 38 net_request (nil)
2020-NN-MM AA:BB:CC.909*: osswait failed: context 0x194a0d50 childctx 0x194a0d50 timeout 0 errorcode 38 net_request (nil)
2020-NN-MM AA:BB:CC.932*: osswait failed: context 0x194a0d50 childctx 0x194a0d50 timeout 0 errorcode 38 net_request (nil)

 

2020-NN-DD HH:MM:CC.932* : OSSIPC:SKGXP:[db9a680.8021]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid
2020-NN-DD HH:MM:CC.932*: OSSIPC:SKGXP:[db9a680.8023]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid

From RDS info,

zzz <MM/DD/YYYY HH:MM:35> subcount: 6
ib_rdma_mr_1m_used 16657083633
ib_rdma_mr_1m_pool_depleted 708533595
zzz <MM/DD/YYYY HH:MM:35> subcount: 7
ib_rdma_mr_1m_used 16657111635
ib_rdma_mr_1m_pool_depleted 708533595 
zzz <MM/DD/YYYY HH:MM:40> subcount: 8
ib_rdma_mr_1m_used 16657202408
ib_rdma_mr_1m_pool_depleted 709081086 <<--- increase in 547491
zzz <MM/DD/YYYY HH:MM:46> subcount: 9
ib_rdma_mr_1m_used 16657202408
ib_rdma_mr_1m_pool_depleted 710854784
zzz <MM/DD/YYYY HH:MM:51> subcount: 10
ib_rdma_mr_1m_used 16657202408
ib_rdma_mr_1m_pool_depleted 712744199
zzz <MM/DD/YYYY HH:MM:57> subcount: 11
ib_rdma_mr_1m_used 16657202408
ib_rdma_mr_1m_pool_depleted 714715253
zzz <MM/DD/YYYY HH:MM:08> subcount: 12
ib_rdma_mr_1m_used 16657202408
ib_rdma_mr_1m_pool_depleted 716628545
zzz <MM/DD/YYYY HH:MM:08> subcount: 13
ib_rdma_mr_1m_used 16657202408
ib_rdma_mr_1m_pool_depleted 718635876

For 11.2.0.4 database instances,

RDBMS alert log file,

DAY MM DD HH:18:29 2020
Process RSM0, PID = 213898, will be killed
DAY MM DD HH:18:31 2020
RSM0 started with pid=295, OS id=233516
DAY MM DDHH:21:58 2020
LMS2 (ospid: 160847) received an instance eviction notification from instance 1 [2]
DAY MM DD HH:21:59 2020
Errors in file /u01/app/oracle/diag/rdbms/abcd_abcd10/abcd10/trace/abcd10_ora_371269.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/xxx.yyy.zz.98;xxx.yyy.zz.99/N1_NNN_MM_05_mmmmmmm05 at offset 3284023246848 for data length 16384
ORA-27626: Exadata error: 221 (I/O request fenced)
WARNING: Read Failed. group:4 disk:59 AU:782972 offset:655360 size:16384
path:o/xxx.yyy.zz.98;xxx.yyy.zz.99/N1_R10_CD_05_mmmmmmm05
incarnation:0xe9afa4e4 synchronous result:'I/O error'
subsys:OSS krq:0x7f617db58000 bufp:0x673668000 osderr1:0xdd osderr2:0x0
ERROR: unrecoverable error ORA-15311 raised in ASM I/O path; terminating process 371269
DAY MM DD HH:21:59 2020
Errors in file /u01/app/oracle/diag/rdbms/abcd_abcd10/abcd10/trace/abcd10_ora_233429.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/xxx.yyy.zz.61;xxx.yyy.zz.62/D1_NNN_MM_02_nnnnnnn01 at offset 174579531776 for data length 16384
ORA-27626: Exadata error: 221 (I/O request fenced)
WARNING: Read Failed. group:1 disk:201 AU:41623 offset:16384 size:16384
path:o/xxx.yyy.zz.61;xxx.yyy.zz.62/D1_NNN_MM_02_abcd11celadm01
incarnation:0xe9afa171 synchronous result:'I/O error'
subsys:OSS krq:0x7f734b0ae000 bufp:0x7f734adb3e00 osderr1:0xdd osderr2:0x0
ERROR: unrecoverable error ORA-15311 raised in ASM I/O path; terminating process 233429
DAY MM DDHH:21:59 2020
Received an instance abort message from instance 2
DAY MM DD HH:21:59 2020
Received an instance abort message from instance 2
Please check instance 2 alert and LMON trace files for detail.
Please check instance 2 alert and LMON trace files for detail.
LMD0 (ospid: 160770): terminating the instance due to error 481
DAY MM DD HH:22:00 2020
System state dump requested by (instance=10, osid=160770 (LMD0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/abcd_abcd10/abcd10/trace/abcd10_diag_160638_20200715092200.trc
DAY MM DD HH:22:00 2020
License high water mark = 545
Instance terminated by LMD0, pid = 160770

LMHB trace file,

*** 2020-NN-15 HH:21:20.158
==============================
LMON (ospid: 160755) has not moved for 261 sec (1594819279.1594819018)
kjfmGCR_HBCheckAll: LMON (ospid: 160755) has status 2
: waiting for event 'control file sequential read' for 257 secs with wait_id 9060614.
===[ Wait Chain ]===
Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

*** 2020-NN-15 HH:21:40.158
==============================
LMON (ospid: 160755) has not moved for 281 sec (1594819299.1594819018)
kjfmGCR_HBCheckAll: LMON (ospid: 160755) has status 2
: waiting for event 'control file sequential read' for 277 secs with wait_id 9060614.
===[ Wait Chain ]===
Wait chain is empty.
kjgcr_Main: KJGCR_ACTION - id 5

From hangamanger trace file,

HM: Short Stack of root session ID 3926, OSPID 161068 of hang ID 283
Short stack dump:
ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()
<-__poll()+24<-sskgxp_selectex()+392<-skgxpiwait()+4318<-skgxpwaiti()+1916
<-skgxpwait()+162<-ossnet_wait_all()+248<-ossnet_wait()+212<-oss_wait()+991
<-kfk_reap_oss_async_io()+521<-kfk_reap_ios_from_subsys()+317<-kfk_reap_ios()+2048
<-kfk_io1()+1108<-kfkRequest()+14<-kfk_transitIO()+1727<-kfioReapIO()+448
<-kfioRequestPriv()+392<-kfioRequest()+691<-ksfd_kfioRequest()+649
<-ksfd_osmwat()+942<-ksfdwtio()+1240<-ksfdwat_internal()+649
<-ksfdwat()+250<-kcfwatwr()+145<-kcbbdrv()+2154<-ksbabs()+775<-ksbrdp()+1045
<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265
<-main()+201<-__libc_start_main()+253

IO capable process trace file (LMON) trace file,

ossnet_wait_all: WAITED TOO LONG for network request completion: 40001. init_timeout: 4294967295 remaining_timeout: 4294927294
Dumping State information

*** 2020-NN-DD HH:MM:43.131

BEGIN dump active requests

2020-NN-DD HH:MM:41.187892 : OSSIPC:SKGXP:[db9a680.8021]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid
2020-NN-DD HH:MM:41.198107 : OSSIPC:SKGXP:[db9a680.8023]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid
2020-NN-DD HH:MM:41.208237 : OSSIPC:SKGXP:[db9a680.8025]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid
2020-NN-DD HH:MM:41.218320 : OSSIPC:SKGXP:[db9a680.8027]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid
2020-NN-DD HH:MM:41.228466 : OSSIPC:SKGXP:[db9a680.8029]{E}: (160755 -> 22391)SKGXPRPCBIDS: FAILED to prepare write rdma bid

 

 

rdsinfo file ,

ib_rdma_mr_1m_pool_depleted 6213959
zzz <MM/DD/YYYY HH:16:31> subcount: 4
ib_rdma_mr_1m_used 52017863435
ib_rdma_mr_1m_pool_depleted 6213959
zzz <MM/DD/YYYY HH:18:50> subcount: 5
ib_rdma_mr_1m_used 52018364421
ib_rdma_mr_1m_pool_depleted 7205739
zzz <MM/DD/YYYY HH:21:30> subcount: 6
ib_rdma_mr_1m_used 52018364421
ib_rdma_mr_1m_pool_depleted 8567162
zzz <MM/DD/YYYY HH:23:31> subcount: 7
ib_rdma_mr_1m_used 52018367606
ib_rdma_mr_1m_pool_depleted 9068852
zzz <MM/DD/YYYY HH:25:48> subcount: 8
ib_rdma_mr_1m_used 52018426784
ib_rdma_mr_1m_pool_depleted 9068852

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.