ODA Outage including diskgroup offlined: ASM alert.log WARNING: Waited 15 secs for write IO to PST disk [0,1...23] in group [ 1 | 2 | 3 ] (Doc ID 1940986.1)

Last updated on JULY 19, 2017

Applies to:

Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X4-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance Software - Version 2.1.0.1 to 12.1.2.9 [Release 2.1 to 12.1]
Information in this document applies to any platform.
ODA, Node Outage, Crash, ASM, Poor Performance, Diskgroup offline

Symptoms

This problem has a few distinctive symptoms but the highest is a node crash:

  • Diskgroup outage
  • Very Slow IO Performance*
  • Possible very high CPU
  • Timeouts for IO
  • Communications to ASM, CRS or CSS failures

More:

 
* Note:
          Confirm if your issue is regarding very slow DISKGROUP level performance.
          Then investigate further to confirm if the problem appears related to a single substandard disk
          Then, the Enhancement included in UEK4 included in ODA 12.1.2.11.0 may resolve your problem
      ** If your symptom is ONLY seen on one node then continue with this note for further information.

Changes

None known to the users or dba.
Review of performance of the disks in the alert.log or other sources will usually reveal some substandard performance on at least one disk.

See many delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup.
This results in the ASM instance dismounting the diskgroup

ASM ALERT.LOG

           The Disk number can range from 0,1.. up to 23, and diskgroup 1 (Data), 2 (RECO) or 3 (REDO)
...
 WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
 WARNING:  Waited   15 secs  for  write  IO  to PST disk 2 in group 2.
 WARNING:  Waited   15 secs  for  write  IO  to PST disk 0 in group 2.
 WARNING:  Waited   15 secs  for  write  IO  to PST disk 2 in group 2

    ...
    ...
 NOTE: process _b000_+asm1 (12580) initiating offline of disk 0.3915926799 (HDD_E0_S00_576669152P1) with mask 0x7e in group 1    << Can be for any ODA Diskgroup:  1 (DATA), 2 (RECO) or 3 (REDO)
 NOTE: process _b000_+asm1 (12580)  initiating  offline  of  disk 1.3915926797   (HDD_E0_S01_576659536P1) with mask 0x7e in  group 1     << Can be any disk including HDD [0-19]  or  SSD [20-23]
 NOTE: process _b000_+asm1 (12580)  initiating  offline  of  disk 2.3915926788   (HDD_E0_S02_576440136P1) with mask 0x7e in  group 1
 NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 14 for pid 50, osid 12580
  
   ...
                   Symptom - GMON
   ...
 GMON dismounting group 2 at
151 for pid 24, osid 13912                                                              << Can be for any ODA Diskgroup:  1 (DATA), 2 (RECO) or 3 (REDO)
 NOTE: Disk SSD_E0_S20_805853057P1 in mode 0x7f marked for de-assignment                          << Can be any disk including HDD [0-19]  or  SSD [20-23]
 NOTE: Disk  SSD_E0_S21_805849551P1 in mode 0x7f marked for de-assignment
 NOTE: Disk  SSD_E0_S22_805853058P1 in mode 0x7f marked for de-assignment
 NOTE: Disk  SSD_E0_S23_805852406P1 in mode 0x7f marked for de-assignment

 

Another very common accompanying error are Read Failures found in the ASM ALERT.LOG, Repairing group, read failurs and success will become more and more frequent.

ASM ALERT.LOG
...                                                                              
...
NOTE: repairing group 1 file 459 extent 34                                                                                             -------  The read failures and Repairing warnings can be several hours to days before the outage!
...
WARNING: Read Failed. group:1 disk:2 AU:54900 offset:3145728 size:1048576
...
SUCCESS: extent 34 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline   -- Followed by SUCCESS - relocation messages
...
NOTE: repairing group 1 file 459 extent 54                                                                                             -------        More Repairing Group messages!
NOTE: repairing group 1 file 459 extent 54
 ...
WARNING: Read Failed. group:1 disk:2 AU:54904 offset:2097152 size:1048576                                         --------        More Read Failed messages...
NOTE: repairing group 1 file 459 extent 54
WARNING: Read Failed. group:1 disk:2 AU:54904 offset:2097152 size:1048576
...
     ----    This pattern continues, with the messages for Warning, read Failed: Note Repairing Group and Successful Relocating increasing in a tighter time loop
...
...
SUCCESS:
extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
...
NOTE: repairing group 1 file 459 extent 54
SUCCESS: extent 54 of file 459 group 1 repaired - all online mirror sides found readable, no repair required
SUCCESS: extent 54 of file 459 group 1 repaired - all online mirror sides found readable, no repair required
...

       ---   UNTIL FAILURE with any of several assorted messages:
...
 Received dirty detach msg from inst 1 for dom 2
Fri Oct 31 02:02:36...
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 8)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE

 
The ASM Alert log may show some or all of the following errors:

ORA-29701

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms