My Oracle Support Banner

Database hang :: Due to Flash disk failure database level hang detected sometime resulting node eviction (Doc ID 2680404.1)

Last updated on JULY 30, 2020

Applies to:

Oracle Database - Enterprise Edition - Version 12.1.0.2 and later
Information in this document applies to any platform.

Symptoms

Due to flash disk failure, database level hang detected .

Without 19.2.7 Jan re-release version cell image ,database instance detected.And sometimes lead to node eviction.

With 19.2.7 Jan re-release version, we have fix 30903965 included which results in more verbose
debugging and after hitting certain condition restart cellsrv with ORA-600 to clear out the IO
stalled situation.

Sun Jun 07 15:37:43 2020
WARNING: Read Failed. group:1 disk:68 AU:47432 offset:8192 size:8192
path:o/xxx.xxx.xx.xxnn;nnn.nn.nn/DATA1_CD_08_xxxxxxxxxxNN
incarnation:0x981662e7 synchronous result:'I/O error'
subsys:OSS krq:0x7f9249378ef8 bufp:0x7f924919ee00 osderr1:0xdf osderr2:0x0
Exadata error:'Block needs to be resilvered'
IO elapsed time: 0 usec Time waited on I/O: 0 usec

1_1 2020-06-07T15:38:31-05:00 critical "Flash disk failed. Status : FAILED Manufacturer : Oracle Model Number : Flash Accelerator F320 PCIe Card Size : 2981GB Serial Number : S2T7NA0J311194 Firmware : KPYAIR3Q Slot Number : PCI Slot: 2; FDOM: 1 Cell Disk : FD_01_xxxxxxxxxxNN Grid Disk : Not configured"
1_2 2020-06-07T22:53:53-05:00 warning "Flash disk was removed. Status : FAILED Manufacturer : Oracle Model Number : Flash Accelerator F320 PCIe Card Size : 2981GB Serial Number : S2T7NA0J311194 Firmware : KPYAIR3Q Slot Number : PCI Slot: 2; FDOM: 1 Cell Disk : FD_01_xxxxxxxxxxNN Grid Disk : Not configured Flash Cache : Present Flash Log : Present"
1_3 2020-06-07T22:57:19-05:00 clear "Flash disk was replaced. Status : NORMAL Manufacturer : Oracle Model Number : Flash Accelerator F320 PCIe Card Size : 2981GB Serial Number : S2T7NCAH802707 Firmware : KPYAIR3Q Slot Number : PCI Slot: 2; FDOM: 1 Cell Disk : FD_01_xxxxxxxxxxNN Grid Disk : Not configured Flash Cache : Present Flash Log : Present"
2 2020-06-07T15:39:00-05:00 critical "ORA-00600: internal error code, arguments: [io stuck], [], [], [], [], [], [], [], [], [], [], []"

File_name:: alerthistory.out

2020-06-07T15:37:08.207195-05:00
Smart FLASH Caching assigned store FC-FD_00_xxxxxxxxxxNN guid=12a7a322-d68e-4364-b908-f4edbd4b6695 (1783671252) to store FC-FD_01_xxxxxxxxxxNN guid=9c7c765a-c6f8-4e44-bb71-d159fc47162e (3687805468) as a logger
2020-06-07T15:37:08.207204-05:00
FlashLog FL-FD_01_xxxxxxxxxxNN guid=aa766a27-06f6-423a-a3c2-4cbe7bbea421 (1300604316) cdisk=FD_01_xxxxxxxxxxNN is inactive due to write error
2020-06-07T15:37:11.711356-05:00
NO IO COMPLETION ON DISK /dev/nvme2n1 FOR 5000 MILLISECONDS: CD - FD_01_xxxxxxxxxxNN TIME - Sun Jun 7 15:37:11 2020 502 msec.
LAST IO SUBMITTED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:07 2020 353 msec
LAST IO COMPLETED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 226 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 491 msec
NO IO COMPLETION ON DISK /dev/nvme2n1 FOR 5500 MILLISECONDS: CD - FD_01_xxxxxxxxxxNN TIME - Sun Jun 7 15:37:12 2020 3 msec.
LAST IO SUBMITTED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:07 2020 353 msec
LAST IO COMPLETED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 226 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 491 msec
2020-06-07T15:37:12.713142-05:00
NO IO COMPLETION ON DISK /dev/nvme2n1 FOR 6000 MILLISECONDS: CD - FD_01_xxxxxxxxxxNN TIME - Sun Jun 7 15:37:12 2020 505 msec.
LAST IO SUBMITTED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:07 2020 353 msec
LAST IO COMPLETED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 226 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 491 msec
NO IO COMPLETION ON DISK /dev/nvme2n1 FOR 6500 MILLISECONDS: CD - FD_01_xxxxxxxxxxNN TIME - Sun Jun 7 15:37:13 2020 1 msec.
LAST IO SUBMITTED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:07 2020 353 msec
LAST IO COMPLETED ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 226 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/nvme2n1 AT: Sun Jun 7 15:37:06 2020 491 msec
2020-06-07T15:37:13.715034-05:00

2020-06-07T15:37:37.862098-05:00
CDHS: Do cd health state change FD_01_xxxxxxxxxxNN from HEALTH_GOOD to newState HEALTH_BAD_OFFLINE
updateHealth: For CellDisk FD_01_xxxxxxxxxxNN target health BAD current health GOOD confinement state INACTIVE health state HEALTH_BAD_OFFLINE. Reason posted job
Warning: turned off caching for FlashCache Part 9c7c765a-c6f8-4e44-bb71-d159fc47162e (3687805468) located on cdisk FD_01_xxxxxxxxxxNN due to IO errors or slow/hung device isGridDiskActive 0
2020-06-07T15:37:37.893181-05:00
updateHealth: For CellDisk CD_08_xxxxxxxxxxNN target health BAD current health GOOD confinement state NONE health state HEALTH_GOOD. Reason resilvering
updateHealth: For CellDisk CD_04_xxxxxxxxxxNN target health BAD current health GOOD confinement state NONE health state HEALTH_GOOD. Reason resilvering
updateHealth: For CellDisk CD_10_xxxxxxxxxxNN target health BAD current health GOOD confinement state NONE health state HEALTH_GOOD. Reason resilvering
updateHealth: For CellDisk CD_04_xxxxxxxxxxNN target health BAD current health BAD confinement state NONE health state HEALTH_GOOD. Reason resilvering
updateHealth: For CellDisk CD_10_xxxxxxxxxxNN target health BAD current health BAD confinement state NONE health state HEALTH_GOOD. Reason resilvering
updateHealth: For CellDisk CD_08_xxxxxxxxxxNN target health BAD current health BAD confinement state NONE health state HEALTH_GOOD. Reason resilvering
2020-06-07T15:37:38.007881-05:00
WARNING: Cellsrv UserThread: LWPID: 1101 userId: 10 kernelId: 17 pthreadID: 139939935135488 is stuck for more than 30656588 micro sec in io_submit().
Cellsrv process id=610 will crash as soon as resilvering table update has completed.
WARNING: Cellsrv UserThread: LWPID: 1096 userId: 12 kernelId: 14 pthreadID: 139940058867456 is stuck for more than 30789546 micro sec in io_submit().
Cellsrv process id=610 will crash as soon as resilvering table update has completed.
WARNING: Cellsrv UserThread: LWPID: 1097 userId: 15 kernelId: 15 pthreadID: 139940048377600 is stuck for more than 30789914 micro sec in io_submit().
Cellsrv process id=610 will crash as soon as resilvering table update has completed.
WARNING: Cellsrv UserThread: LWPID: 1092 userId: 18 kernelId: 11 pthreadID: 139940090337024 is stuck for more than 30689674 micro sec in io_submit().

GridDisk DATA1_CD_10_xxxxxxxxxxNN is no longer cached by FlashCache
INFO: Griddisk DATA1_CD_10_xxxxxxxxxxNN (id: 1852541204) no longer cached by flash ID: 3687805468. Resilvering table updated. 10357443 cachelines need resilvering.
updateHealth: For CellDisk CD_10_xxxxxxxxxxNN target health BAD current health BAD confinement state NONE health state HEALTH_GOOD. Reason resilvering
GridDisk DBFS1_CD_04_xxxxxxxxxxNN is no longer cached by FlashCache
GridDisk DBFS1_CD_10_xxxxxxxxxxNN is no longer cached by FlashCache
INFO: 4 resilvering tables were updated because flash disk (ID: 3687805468, guid: 9c7c765a-c6f8-4e44-bb71-d159fc47162e) failed
2020-06-07T15:38:30.255949-05:00
CDHS: Received cd health state change with newState HEALTH_FAIL guid 0b118703-13ad-4bad-ad2d-d51268cf7e36
2020-06-07T15:38:31.203983-05:00
CDHS: Received cd health state change with newState HEALTH_FAIL guid 0b118703-13ad-4bad-ad2d-d51268cf7e36
2020-06-07T15:39:00.721782-05:00
[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid: 491) returned with error: 128
[RS] Service CELLSRV will be restarted.
2020-06-07T15:39:00.732613-05:00
Errors in file /opt/oracle/cell/log/diag/asm/cell/xxxxxxxxxxNN/trace/svtrc_610_28.trc (incident=17):
ORA-00600: internal error code, arguments: [io stuck], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/cell/log/diag/asm/cell/xxxxxxxxxxNN/incident/incdir_17/svtrc_610_28_i17.trc
2020-06-07T15:39:01.802984-05:00
LinuxBlockIO: io_getevents err = -4, errno 11
2020-06-07T15:39:02.017765-05:00
System state dump completed (with 1 failures)
CELLSRV error - ORA-600 internal error
Sun Jun 7 15:39:02 2020 188 msec State dump completed for CELLSRV<610> after ORA-600 occurred
2020-06-07T15:39:03.015976-05:00

File_name:: alert.log

Jun 7 15:37:07 xxxxxxxxxxNN kernel: [2347199.927778] nvme 0000:84:00.0: Failed status: 0x3, reset controller.
Jun 7 15:37:07 xxxxxxxxxxNN kernel: [2347199.928517] nvme 0000:84:00.0: detected Samsung PM172x controller, limit IO queue depth to 64 and admin queue depth to 63.
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.426151] nvme nvme2: Device not ready; aborting reset
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.432299] nvme 0000:84:00.0: Removing after probe failure
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.432310] nvme2n1: detected capacity change from 3200631791616 to 0 <<-----
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.432554] blk_update_request: I/O error, dev nvme2n1, sector 89968
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.439877] blk_update_request: I/O error, dev nvme2n1, sector 88976
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.447192] blk_update_request: I/O error, dev nvme2n1, sector 1989827888
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.454983] blk_update_request: I/O error, dev nvme2n1, sector 3519611264
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.462779] blk_update_request: I/O error, dev nvme2n1, sector 4445051648
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.470585] blk_update_request: I/O error, dev nvme2n1, sector 76328
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.477897] blk_update_request: I/O error, dev nvme2n1, sector 75904
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.485216] blk_update_request: I/O error, dev nvme2n1, sector 75864
Jun 7 15:37:49 xxxxxxxxxxNN kernel: [2347242.492527] blk_update_request: I/O error, dev nvme2n1, sector 91088
Jun 7 15:38:12 xxxxxxxxxxNN systemd[1]: Created slice User Slice of cellmonitor.
Jun 7 15:38:12 xxxxxxxxxxNN systemd-logind[1787]: New session 70672 of user cellmonitor.

File_name:: messages

06/07/2020 03:37:14 PM
avg-cpu: %user %nice %system %iowait %steal %idle
7.30 0.00 1.65 2.40 0.00 88.65

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.60 16.60 2.40 17.60 234.60 4185.60 442.02 0.01 0.30 2.00 0.07 0.30 0.60
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sdm 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1 0.00 0.00 2332.00 362.20 144603.20 13843.20 117.62 0.94 0.34 0.39 0.03 0.03 8.76
nvme1n1 0.00 0.00 2262.20 423.80 140436.80 16347.20 116.74 1.28 0.48 0.56 0.05 0.03 7.20
nvme2n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 525.00 0.00 0.00 0.00 0.00 100.00
nvme3n1 0.00 0.00 2609.80 1428.60 155489.60 23012.00 88.40 1.13 0.28 0.42 0.02 0.03 11.88

06/07/2020 03:37:44 PM
avg-cpu: %user %nice %system %iowait %steal %idle
3.98 0.00 0.84 4.99 0.00 90.19

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 2.40 19.80 2.80 82.00 813.80 4752.20 131.27 0.02 0.25 5.21 0.08 0.15 1.26
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sdm1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1 0.00 0.00 98.60 349.80 1569.60 11867.20 59.93 0.03 0.07 0.17 0.04 0.05 2.12
nvme1n1 0.00 0.00 11.60 515.80 158.40 20216.80 77.27 0.03 0.05 0.12 0.05 0.03 1.48
nvme2n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 525.00 0.00 0.00 0.00 0.00 100.00
nvme3n1 0.00 0.00 226.80 417.80 3848.00 13872.00 54.98 0.04 0.07 0.12 0.04 0.06 3.56

File_name:: 2020_06_07_15_07_44_IostatExaWatcher_xxxxxxxxxxNN.xxxxx.com.dat

1 D root 30119 2 0 8 19 0 - 0 0 msleep 15:00 ? 00:00:02 [kworker/u80:3]
4 D root 4922 4908 0 37 19 0 - 1832 6839 blk_execute_rq 15:37 ? 00:00:00 nvmecli --identify --device=/dev/nvme2n1 --detail
1 D root 30119 2 0 5 19 0 - 0 0 msleep 15:00 ? 00:00:02 [kworker/u80:3]
4 D root 4922 4908 0 37 19 0 - 1832 6839 blk_execute_rq 15:37 ? 00:00:00 nvmecli --identify --device=/dev/nvme2n1 --detail
1 D root 30119 2 0 1 19 0 - 0 0 msleep 15:00 ? 00:00:02 [kworker/u80:3]
4 D root 4922 4908 0 37 19 0 - 1832 6839 blk_execute_rq 15:37 ? 00:00:00 nvmecli --identify --device=/dev/nvme2n1 --detail
1 D root 30119 2 0 5 19 0 - 0 0 msleep 15:00 ? 00:00:02 [kworker/u80:3]
4 D root 4922 4908 0 37 19 0 - 1832 6839 blk_execute_rq 15:37 ? 00:00:00 nvmecli --identify --device=/dev/nvme2n1 --detail
1 D root 30119 2 0 2 19 0 - 0 0 msleep 15:00 ? 00:00:02 [kworker/u80:3]

File_name:: 2020_06_07_15_16_56_PsExaWatcher_xxxxxxxxxxNN.xxxxx.com.dat

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.