Exadata Compute Nodes Crashed with many IO errors in CELL alert logs (Doc ID 1531673.1)

Last updated on MARCH 25, 2015

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.3 to 11.2.0.3 [Release 11.2]
Oracle Exadata Hardware - Version 11.2.0.3 to 11.2.0.3 [Release 11.2]
Oracle Exadata Storage Server Software - Version 11.2.3.2.0 to 11.2.3.2.0 [Release 11.2]
Information in this document applies to any platform.

Symptoms

DB instance crashed when running high load.

 

Many IO errors in CELL alert logs :  IO errors for flash and storage  cell disks.

IO Error: dev=/dev/sdz cdisk=FD_04_XXXXX3c02 op=RD 32768 bytes at sector 124214016 failed with errno=-2
IO Error: dev=/dev/sdy cdisk=FD_03_XXXXX3c02 op=RD 65536 bytes at sector 120966784 failed with errno=-2
IO Error: dev=/dev/sdz cdisk=FD_04_XXXXX3c02 op=RD 65536 bytes at sector 75332736 failed with errno=-2
IO Error: dev=/dev/sdq cdisk=FD_15_XXXXX3c02 op=RD 65536 bytes at sector 122211584 failed with errno=-2
IO Error: dev=/dev/sdaa cdisk=FD_05_XXXXX3c02 op=RD 65536 bytes at sector 26423168 failed with errno=-2
IO Error: dev=/dev/sdab cdisk=FD_06_XXXXX3c02 op=RD 24576 bytes at sector 123508608 failed with errno=-2
IO Error: dev=/dev/sdv cdisk=FD_00_XXXXX3c02 op=RD 65536 bytes at sector 123316352 failed with errno=-2
IO Error: dev=/dev/sdw cdisk=FD_01_XXXXX3c02 op=RD 16384 bytes at sector 124288432 failed with errno=-2
IO Error: dev=/dev/sdb3 cdisk=CD_01_XXXXX3c02 op=RD 90112 bytes at sector 184342352 failed with errno=-2
IO Error: dev=/dev/sdaa cdisk=FD_05_XXXXX3c02 op=RD 65536 bytes at sector 74984576 failed with errno=-2
IO Error: dev=/dev/sdx cdisk=FD_02_XXXXX3c02 op=RD 65536 bytes at sector 72410112 failed with errno=-2
IO Error: dev=/dev/sdaa cdisk=FD_05_XXXXX3c02 op=RD 65536 bytes at sector 75525376 failed with errno=-2
IO Error: dev=/dev/sdaa cdisk=FD_05_XXXXX3c02 op=RD 65536 bytes at sector 75536512 failed with errno=-2
IO Error: dev=/dev/sdy cdisk=FD_03_XXXXX3c02 op=RD 65536 bytes at sector 24994304 failed with errno=-2

 


ASM alert log shows disk offline messages but there is no disk offline in ASM metadata out :  There are many warning messages of same type but there is no output shows that disk became offline.


Thu Jan 24 17:20:16 2013
NOTE: process _user44167_+asm1 (44167) initiating offline of disk 14.3915937104 (DATA_XXXXX3_CD_02_XXXXX3C02) with mask 0x7e[0x7f] in group 1
WARNING: Disk 14 (DATA_XXXXX3_CD_02_XXXXX3C02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 14/0xe9687550, mask = 0x6a, op = clear
Thu Jan 24 17:20:16 2013
GMON updating disk modes for group 1 at 10 for pid 54, osid 44167

 

Cell Top output shows cellsrv process with high CPU: cellsrv is showing %CPU is  304%.

top - 17:21:41 up 49 days,  6:00,  0 users,  load average: 0.89, 0.85, 0.57
Tasks: 425 total,   1 running, 424 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.7%us,  4.1%sy,  0.0%ni, 86.3%id,  0.7%wa,  0.0%hi,  1.1%si,  0.0%st
Mem:  65963336k total, 25384020k used, 40579316k free,    55848k buffers
Swap:  2097080k total,        0k used,  2097080k free,  4223448k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                  
29640 root      20   0 22.6g 7.9g  12m S 304.4 12.5   3294:25 /opt/oracle/cell11.2.3.2.0_LINUX.X64_120713/cellsrv/bin/cellsrv 100 5000 9 5042

 

Cell IO stat  shows high utilization  :  in following output %util is  100%. r/s is showing 350 which is very high. In this case  High Capacity disk is used.

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

sdc              26.60     0.20 406.00  0.20 97923.20    12.80   241.10   521.24 1341.08   2.46 100.00
sdd              25.20     0.00 414.00  0.00 101968.00     0.00   246.30   349.02 1155.30   2.42 100.00
sde              26.20     0.00 428.80  0.40 98128.00    12.80   228.66   892.45 2771.19   2.33 100.00
sdf              35.60     0.00 412.40  0.20 104924.80     6.40   254.32   142.74  362.83   2.42 100.00
sdg              33.60     0.00 408.80  0.00 99046.40     0.00   242.29   496.82 1450.82   2.45 100.00
sdh              21.20     0.00 399.00  0.00 97641.60     0.00   244.72   388.33 1262.35   2.51 100.00
sdi              22.20     0.40 420.20  0.60 96688.00    14.40   229.81   603.19 1833.91   2.38 100.00
sdj              34.80     0.00 404.20  0.00 98083.20     0.00   242.66   386.81 1028.80   2.47 100.00
sdk              33.40     0.00 383.80  0.00 98142.40     0.00   255.71   144.67  456.54   2.61 100.00
sdl              31.40     0.00 445.60  0.00 102550.40     0.00   230.14   775.37 2251.88   2.24 100.00

 

Changes

 There is no change in setup. Instance  face a High IO intensive load.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms