My Oracle Support Banner

Database level performance issue after one cell disk in one cell went into BAD state (Doc ID 2696778.1)

Last updated on MARCH 17, 2021

Applies to:

Oracle Exadata Storage Server Software - Version 12.2.1.1.0 and later
Information in this document applies to any platform.

Symptoms

Database level performance issue due to hard ware issue for one of the cell disk in
a cell .

12_1 2020-06-30T01:25:25+00:00 warning "Data hard disk was removed. Status : NOT PRESENT Manufacturer : HGST Model Number : H7210A520SUN010T Size : 10TB Serial Number : 1814R8NRVN Firmware : A680 Slot Number : 3 Cell Disk : CD_03_xxxxxxcel02 Grid Disk : DATA_CD_03_xxxxxxcel02, RECO_CD_03_xxxxxxcel02"
12_2 2020-06-30T01:29:04+00:00 clear "Data hard disk was replaced. Status : NORMAL Manufacturer : HGST Model Number : H7210A520SUN010T Size : 10TB Serial Number : 1917RBWRKD Firmware : A680 Slot Number : 3 Cell Disk : CD_03_xxxxxxcel02 Grid Disk : DATA_CD_03_xxxxxxcel02, RECO_CD_03_xxxxxxcel02"

*** 2020-06-28T18:34:02.867236-07:00 (CDB$ROOT(1))
loadavg : 14.16 12.60 10.96
System user time: 0.12 sys time: 0.02 context switch: 209522
Memory (Avail / Total) = 63382.12M / 772301.10M
Swap (Avail / Total) = 24576.00M / 24576.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 145030 1 0 80 0 - 6481619 poll_s Jun26 ? 00:06:14 ora_ckpt_******111
Name: ora_ckpt_******
Umask: 0022
State: S (sleeping)
Tgid: 145030
Ngid: 0
Pid: 145030
PPid: 1
TracerPid: 0
Uid: 25265 25265 25265 25265
Gid: 112 112 112 112
FDSize: 128
Groups: 112
NStgid: 145030
NSpid: 145030
NSpgid: 145030
NSsid: 145030
VmPeak: 25926476 kB
VmSize: 25926476 kB
VmLck: 0 kB
VmPin: 300 kB
VmHWM: 121644 kB
VmRSS: 121644 kB
RssAnon: 13404 kB
RssFile: 106560 kB
RssShmem: 1680 kB
VmData: 15940 kB
VmStk: 608 kB
VmExe: 347776 kB
VmLib: 35500 kB
VmPTE: 860 kB
VmPMD: 112 kB
VmSwap: 0 kB
HugetlbPages: 452608 kB
Threads: 1
SigQ: 1/3088187
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000004
SigIgn: 0000000016400207
SigCgt: 00000003c9887cf8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-95
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 5185148
nonvoluntary_ctxt_switches: 58327
Short stack dump:
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-__poll()+16
<-sskgxp_selectex()+309<-skgxpiwait()+6689<-skgxpwaiti()+2348<-skgxpwait()+1108<-ossnet_wait_all()+282
<-ossnet_wait()+356<-oss_wait_ext()+327<-oss_wait()+108<-kfk_reap_oss_async_io()+534<-kfk_reap_ios_from_subsys()+164
<-kfk_reap_ios()+399<-kfk_io1()+1461<-kfk_transitIO()+451<-kfioReapIO()+631<-kfioRequestPriv()+216<-kfioRequest()+337
<-ksfdafRequest()+742<-ksfdafWait()+370<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+336<-ksfdbio()+2910
<-ksfdbio()+3035<-kccwbp()+1035<-kccrec_wdf()+457<-kccfhb()+136<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303
<-kctrcp()+1942<-kcvcca()+189<-ksbcti()+247<-ksbabs()+11524<-ksbrdp()+1079<-opirip()+609<-opidrv()+602
<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

-------------------------------------------------------------------------------
Process diagnostic dump actual duration=2.385000 sec
(max dump time=5.000000 sec)

*** 2020-06-28T18:35:08.809880-07:00 (CDB$ROOT(1))
==============================
CKPT (ospid: 145030) has not moved for 90 sec (1593394508.1593394418)
: wait in cfio is not acceptable as it fails the global check.
: heartbeat check status 6 (no-heartbeat) (threshold 70 sec)
==================================================
=== CKPT (ospid: 145030) Heartbeat Report
==================================================
CKPT (ospid: 145030) has no heartbeats for 90 sec. (threshold 70)
: heartbeat state 0x11.00 (cfio) pso-flag 0x100
: waiting for event 'control file parallel write' for 87 secs with wait_id 8389479.
===[ Wait Chain ]===
Wait chain is empty.
==============================
Dumping PROCESS CKPT (ospid: 145030) States
==============================
===[ System Load State ]===
CPU Total 8 Raw 8 Core 48 Socket 2
Load normal: Cur 2163 Highmark 10240 (8.44 40.00)
===[ Latch State ]===
Not in Latch Get
===[ Session State Object ]===
----------------------------------------
SO: 0xa4acb808, type: 4, owner: 0xa4011430, flag: INIT/-/-/-/0x00 if: 0x3 c: 0x3
proc=0xa4011430, name=session, file=ksu.h LINE:15737, pg=0, conuid=1
SGA version=(1,0)
(session) sid: 577 ser: 52795 trans: 0x9a5936a0, creator: 0xa4011430
flags: (0x51) USR/- flags2: (0x40409) -/-/INC
flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
DID: 0001-0022-0000000D0001-0022-0000000E, short-term DID:
txn branch: (nil)
con_id/con_uid/con_name: 1/1/CDB$ROOT
con_logonuid: 1 con_logonid: 1
con_scuid: 1 con_scid: 1
edition#: 0 user#/name: 0/SYS
oct: 0, prv: 0, sql: (nil), psql: 0x596b58ff0
stats: 0x8bffc798, PX stats: 0x114e4944
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'control file parallel write'
files=0x2, block#=0x49, requests=0x2
wait_id=8389479 seq_num=59457 snap_id=1
wait times: snap=1 min 27 sec, exc=1 min 27 sec, total=1 min 27 sec
wait times: max=infinite, heur=1 min 27 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a0
There are 15 sessions blocked by this session.
^^^^^^^^^^^^^^^^^^^^^^^^
Dumping one waiter:
inst: 2, sid: 86, ser: 57981
wait event: 'reliable message'
p1: 'channel context'=0x5c23315c0
p2: 'channel handle'=0x9d244960
p3: 'broadcast message'=0x5be6cf580
row_wait_obj#: 224439, block#: 9046371, row#: 0, file# 30
min_blocked_time: 73 secs, waiter_cache_ver: 13535

The history is displayed in reverse chronological order.

sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[85 samples, 18:33:41 - 18:35:07]
waited for 'control file parallel write', seq_num: 59457
p1: 'files'=0x2
p2: 'block#'=0x49
p3: 'requests'=0x2
time_waited: >= 86 sec (still in wait)
[14 samples, 18:33:27 - 18:33:40]
idle wait at each sample
[1 sample, 18:33:26]
not in wait at each sample
[21 samples, 18:33:08 - 18:33:25]
idle wait at each sample
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'control file parallel write'
[85 samples, 18:33:41 - 18:35:07]
time_waited: >= 86 sec (still in wait)
---------------------------------------------------
----------------------------------------

File_name:: ******111_lmhb_145005.trc

2020-06-29T01:28:36.273605+00:00
NO IO COMPLETION ON DISK /dev/sdl FOR 10000 MILLISECONDS: CD - CD_11_xxxxxxcel02 TIME - Mon Jun 29 01:28:36 2020 137 msec.
LAST IO SUBMITTED ON DISK /dev/sdl AT: Mon Jun 29 01:28:36 2020 272 msec
LAST IO COMPLETED ON DISK /dev/sdl AT: Mon Jun 29 01:28:25 2020 825 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdl AT: Mon Jun 29 01:28:26 2020 35 msec
NO IO COMPLETION ON DISK /dev/sdh FOR 10000 MILLISECONDS: CD - CD_07_xxxxxxcel02 TIME - Mon Jun 29 01:28:36 2020 137 msec.
LAST IO SUBMITTED ON DISK /dev/sdh AT: Mon Jun 29 01:28:36 2020 272 msec
LAST IO COMPLETED ON DISK /dev/sdh AT: Mon Jun 29 01:28:25 2020 825 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdh AT: Mon Jun 29 01:28:25 2020 915 msec
NO IO COMPLETION ON DISK /dev/sdf FOR 10000 MILLISECONDS: CD - CD_05_xxxxxxcel02 TIME - Mon Jun 29 01:28:36 2020 137 msec.
LAST IO SUBMITTED ON DISK /dev/sdf AT: Mon Jun 29 01:28:36 2020 272 msec
LAST IO COMPLETED ON DISK /dev/sdf AT: Mon Jun 29 01:28:25 2020 825 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdf AT: Mon Jun 29 01:28:25 2020 890 msec
NO IO COMPLETION ON DISK /dev/sde FOR 10000 MILLISECONDS: CD - CD_04_xxxxxxcel02 TIME - Mon Jun 29 01:28:36 2020 137 msec.
LAST IO SUBMITTED ON DISK /dev/sde AT: Mon Jun 29 01:28:36 2020 267 msec

++ Multiple disks were showing slow performance.
So, IO was stuck on disks.

Jun 29 01:25:11 xxxxxxcel02 kernel: [7898327.915199] megaraid_sas 0000:5e:00.0: 3114 (646709110s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0xfc/s3) at 89084000

Jun 29 01:25:44 xxxxxxcel02 kernel: [7898361.716225] sd 0:2:3:0: [sdd] tag#10 megasas: target reset FAILED!!
Jun 29 01:25:44 xxxxxxcel02 kernel: [7898361.716231] megaraid_sas 0000:5e:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware
Jun 29 01:25:47 xxxxxxcel02 kernel: [7898364.452819] megaraid_sas 0000:5e:00.0: Number of host crash buffers allocated: 512
Jun 29 01:25:49 xxxxxxcel02 kernel: [7898366.612491] megaraid_sas 0000:5e:00.0: Crash Dump is available,number of copied buffers: 98
Jun 29 01:25:49 xxxxxxcel02 kernel: [7898366.612508] megaraid_sas 0000:5e:00.0: Found FW in FAULT state, will reset adapter scsi0.
Jun 29 01:25:49 xxxxxxcel02 kernel: [7898366.612511] megaraid_sas 0000:5e:00.0: resetting fusion adapter scsi0.
Jun 29 01:26:02 xxxxxxcel02 kernel: [7898379.511354] megaraid_sas 0000:5e:00.0: Waiting for FW to come to ready state
Jun 29 01:26:13 xxxxxxcel02 kernel: [7898390.815250] megaraid_sas 0000:5e:00.0: FW now in Ready state
Jun 29 01:26:13 xxxxxxcel02 kernel: [7898390.815349] megaraid_sas 0000:5e:00.0: Current firmware supports maximum commands: 928 LDIO threshold: 0
Jun 29 01:26:13 xxxxxxcel02 kernel: [7898390.815351] megaraid_sas 0000:5e:00.0: FW supports sync cache : Yes
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.651241] megaraid_sas 0000:5e:00.0: FW provided supportMaxExtLDs: 1 max_lds: 64
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.651244] megaraid_sas 0000:5e:00.0: controller type : MR(2048MB)
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.651245] megaraid_sas 0000:5e:00.0: Online Controller Reset(OCR) : Enabled
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.651246] megaraid_sas 0000:5e:00.0: Secure JBOD support : Yes
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.651247] megaraid_sas 0000:5e:00.0: NVMe passthru support : No
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.651248] megaraid_sas 0000:5e:00.0: FW provided TM TaskAbort/Reset timeout : 0 secs/0 secs
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.717244] megaraid_sas 0000:5e:00.0: Interrupts are enabled and controller is OPERATIONAL for scsi:0
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.719341] megaraid_sas 0000:5e:00.0: Reset successful for scsi0.
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.742033] megaraid_sas 0000:5e:00.0: 3167 (646709116s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at 89084000 (on PD 04(e0xfc/s3) at 89084000)
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.742184] megaraid_sas 0000:5e:00.0: 3168 (646709116s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0xfc/s3) at 89084ce4
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.756698] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables iptable_filter ipmi_poweroff ipmi_ssif ib_umad mlx4_en mlx4_ib vfat fat skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul iTCO_wdt ghash_clmulni_intel iTCO_vendor_support pcbc aesni_intel crypto_simd glue_helper cryptd mlx4_core sg ioatdma i2c_i801 lpc_ich shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq rds_rdma resilient_rdmaip ib_ipoib binfmt_misc rds rdma_ucm rdma_cm iw_cm ib_cm ib_uverbs ib_core ip_tables ext4 mbcache jbd2 fscrypto raid1 ahci libahci igb raid0 sd_mod crc32c_intel ptp nvme libata pps_core megaraid_sas i2c_algo_bit nvme_core dca
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.756789] megasas_free_host_crash_buffer+0x32/0x80 [megaraid_sas]
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.756792] megasas_fw_crash_state_store+0xad/0x130 [megaraid_sas]
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.764401] megaraid_sas 0000:5e:00.0: 3255 (646709142s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at 89084ce4 (on PD 04(e0xfc/s3) at 89084ce4)
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.764586] megaraid_sas 0000:5e:00.0: 3256 (646709142s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0xfc/s3) at 89084800
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.764730] megaraid_sas 0000:5e:00.0: 3257 (646709143s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at 89084800 (on PD 04(e0xfc/s3) at 89084800)

Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.767402] megaraid_sas 0000:5e:00.0: 3270 (646709144s/0x0020/DEAD) - Fatal firmware error: Line 1233 in ../../raid/1078dma.c
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.767588] megaraid_sas 0000:5e:00.0: 3273 (646709146s/0x0020/CRIT) - Controller encountered a fatal error and was reset <<------
Jun 29 01:26:14 xxxxxxcel02 kernel: [7898391.771977] megaraid_sas 0000:5e:00.0: Firmware crash dump copied successfully

Jun 29 01:32:40 xxxxxxcel02 kernel: [7898777.595852] megaraid_sas 0000:5e:00.0: 3513 (646709510s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at 327037000 (on PD 04(e0xfc/s3) at 327037000)
Jun 29 01:32:40 xxxxxxcel02 kernel: [7898777.596375] megaraid_sas 0000:5e:00.0: 3515 (646709513s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0xfc/s3) at 155622000
Jun 29 01:32:40 xxxxxxcel02 kernel: [7898777.602762] megaraid_sas 0000:5e:00.0: 3544 (646709517s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at 155622000 (on PD 04(e0xfc/s3) at 155622000)
Jun 29 01:32:40 xxxxxxcel02 kernel: [7898777.602907] megaraid_sas 0000:5e:00.0: 3545 (646709517s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0xfc/s3) at 155622800
Jun 29 01:32:40 xxxxxxcel02 kernel: [7898777.606040] sd 0:2:3:0: [sdd] tag#15 megasas: target reset FAILED!!
Jun 29 01:32:40 xxxxxxcel02 kernel: [7898777.606047] megaraid_sas 0000:5e:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware
Jun 29 01:32:50 xxxxxxcel02 kernel: [7898787.110532] megaraid_sas 0000:5e:00.0: Number of host crash buffers allocated: 512

Jun 29 01:33:16 xxxxxxcel02 kernel: [7898813.723159] megaraid_sas 0000:5e:00.0: OCR done for IO timeout case
Jun 29 01:33:38 xxxxxxcel02 kernel: [7898835.333057] megaraid_sas 0000:5e:00.0: 3713 (646709587s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0xfc/s3) at 327037800
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752286] sd 0:2:3:0: [sdd] tag#48 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752297] sd 0:2:9:0: [sdj] tag#23 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752308] sd 0:2:4:0: [sde] tag#456 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752318] sd 0:2:0:0: [sda] tag#459 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752329] sd 0:2:1:0: [sdb] tag#457 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752337] sd 0:2:7:0: [sdh] tag#453 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752341] sd 0:2:11:0: [sdl] tag#368 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752346] sd 0:2:2:0: [sdc] tag#425 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752350] sd 0:2:5:0: [sdf] tag#448 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752354] sd 0:2:10:0: [sdk] tag#602 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752359] sd 0:2:8:0: [sdi] tag#461 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752364] sd 0:2:6:0: [sdg] tag#553 megasas: target reset FAILED!!
Jun 29 01:34:21 xxxxxxcel02 kernel: [7898878.752394] megaraid_sas 0000:5e:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware
Jun 29 01:34:22 xxxxxxcel02 kernel: [7898879.263771] megaraid_sas 0000:5e:00.0: Number of host crash buffers allocated: 512
Jun 29 01:34:24 xxxxxxcel02 kernel: [7898881.234835] megaraid_sas 0000:5e:00.0: waiting for controller reset to finish

File_name:: driver_log

 

 

 

Changes

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.