ODA Outage including diskgroup offlined: ASM alert.log WARNING: Waited 15 secs for write IO to PST disk [0,1...23] in group [ 1 | 2 | 3 ]
(Doc ID 1940986.1)
Last updated on DECEMBER 07, 2022
Applies to:
Oracle Database Appliance Software - Version 2.1.0.1 to 12.1.2.9 [Release 2.1 to 12.1]Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X4-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
ODA, Node Outage, Crash, ASM, Poor Performance, Diskgroup offline
Symptoms
This problem has a few distinctive symptoms but the highest is a node crash:
- Diskgroup outage
- Very Slow IO Performance*
- Possible very high CPU
- Timeouts for IO
- Communications to ASM, CRS or CSS failures
More:
- Only one node active, the other one hangs while starting ASM.**
- After an outage the Node restarts, but IO Waits are very high
- Overall Very slow performance on one node, but no load or evidence of why IO be stats are so high
* Note:
Confirm if your issue is regarding very slow DISKGROUP level performance.
Then investigate further to confirm if the problem appears related to a single substandard disk
Then, the Enhancement included in UEK4 included in ODA 12.1.2.11.0 may resolve your problem
** If your symptom is ONLY seen on one node then continue with this note for further information.
Changes
None known to the users or dba.
Review of performance of the disks in the alert.log or other sources will usually reveal some substandard performance on at least one disk.
See many delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup.
This results in the ASM instance dismounting the diskgroup
ASM ALERT.LOG
The Disk number can range from 0,1.. up to 23, and diskgroup 1 (Data), 2 (RECO) or 3 (REDO)
...
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 2
...
...
NOTE: process _b000_+asm1 (12580) initiating offline of disk 0.3915926799 (HDD_E0_S00_576669152P1) with mask 0x7e in group 1 << Can be for any ODA Diskgroup: 1 (DATA), 2 (RECO) or 3 (REDO)
NOTE: process _b000_+asm1 (12580) initiating offline of disk 1.3915926797 (HDD_E0_S01_576659536P1) with mask 0x7e in group 1 << Can be any disk including HDD [0-19] or SSD [20-23]
NOTE: process _b000_+asm1 (12580) initiating offline of disk 2.3915926788 (HDD_E0_S02_576440136P1) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 14 for pid 50, osid 12580
...
Symptom - GMON
...
GMON dismounting group 2 at 151 for pid 24, osid 13912 << Can be for any ODA Diskgroup: 1 (DATA), 2 (RECO) or 3 (REDO)
NOTE: Disk SSD_E0_S20_805853057P1 in mode 0x7f marked for de-assignment << Can be any disk including HDD [0-19] or SSD [20-23]
NOTE: Disk SSD_E0_S21_805849551P1 in mode 0x7f marked for de-assignment
NOTE: Disk SSD_E0_S22_805853058P1 in mode 0x7f marked for de-assignment
NOTE: Disk SSD_E0_S23_805852406P1 in mode 0x7f marked for de-assignment
Another very common accompanying error are Read Failures found in the ASM ALERT.LOG, Repairing group, read failurs and success will become more and more frequent.
ASM ALERT.LOG
...
...
NOTE: repairing group 1 file 459 extent 34 ------- The read failures and Repairing warnings can be several hours to days before the outage!
...
WARNING: Read Failed. group:1 disk:2 AU:54900 offset:3145728 size:1048576
...
SUCCESS: extent 34 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline -- Followed by SUCCESS - relocation messages
...
NOTE: repairing group 1 file 459 extent 54 ------- More Repairing Group messages!
NOTE: repairing group 1 file 459 extent 54
...
WARNING: Read Failed. group:1 disk:2 AU:54904 offset:2097152 size:1048576 -------- More Read Failed messages...
NOTE: repairing group 1 file 459 extent 54
WARNING: Read Failed. group:1 disk:2 AU:54904 offset:2097152 size:1048576
...
---- This pattern continues, with the messages for Warning, read Failed: Note Repairing Group and Successful Relocating increasing in a tighter time loop
...
...
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
...
NOTE: repairing group 1 file 459 extent 54
SUCCESS: extent 54 of file 459 group 1 repaired - all online mirror sides found readable, no repair required
SUCCESS: extent 54 of file 459 group 1 repaired - all online mirror sides found readable, no repair required
...
--- UNTIL FAILURE with any of several assorted messages:
...
Received dirty detach msg from inst 1 for dom 2
Fri Oct 31 02:02:36...
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 8)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE
The ASM Alert log may show some or all of the following errors:
ORA-29701
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |