How to change the DISK_REPAIR_TIME timer after disk goes offline from failgroup
(Doc ID 1404123.1)
Last updated on APRIL 12, 2022
Applies to:Oracle Exadata Storage Server Software - Version 184.108.40.206.0 and later
Oracle Database - Enterprise Edition - Version 220.127.116.11 and later
Oracle Exadata Hardware - Version 18.104.22.168 and later
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Information in this document applies to any platform.
The DISK_REPAIR_TIME attribute of the disk group controls the maximum acceptable outage duration. Once one or more disks become unavailable to ASM, it will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance, in order to restore redundancy using the surviving disks. Once the disk(s) are back online, they will be added to the diskgroup, with all existing extents on those disks being ignored/discarded, and another rebalance will begin. In other words, the DISK_REPAIR_TIME value is the acceptable time of duration during you need to fix the failure. This setting is also the countdown timer of ASM to drop the disk(s) that have been taken offline. The default setting for DISK_REPAIR_TIME is 3.6 hours.
Once one or more disks become unavailable to ASM due to any reason, there is a lack of (or decreased) redundancy for those extents that were stored on the affected disks. The number of surviving copies of the extents is dependant on the chosen redundancy level for the diskgroups. With HIGH redundancy, there will be two surviving mirrored copies of the affected extents. With NORMAL redundancy, there will be a single surviving copy.
During this period of disks being offlined and the DISK_REPAIR_TIME counting down, the diskgroups are in a more vulnerable state. As each disk has 8 partner disks that store the primary or secondary mirror of the extents, if one of the partner disks of the disks that went offline also fails or otherwise becomes unavailable, the diskgroup may be forcefully dismounted and data loss may occur. If the disks do not come back online before DISK_REPAIR_TIME expires, redundancy is restored only after the successful completion of the rebalance.
The advantage of not immediately dropping a disk as soon as it becomes unavailable is to reduce rebalances. A rebalance operation is generally lenghtier and more I/O intensive than a resync. Furthermore, a rebalance may not complete if there is insufficient free space in the diskgroup to compensate for the lost disks and restore redundancy across the remaining disks
If a maintenance activity is ongoing and there is a rough ETA for it's completion, but this ETA would put it beyond the expiry of DISK_REPAIR_TIME, it may be better to increase the DISK_REPAIR_TIME rather than allowing it to expire and for the disks to be dropped. Increasing the DISK_REPAIR_TIME means that you anticipate the maintenance activity to be completed within the give time frame and are willing to tolerate the increased risk due to lack of redundancy for certain extents.
For instance, let's assume a DISK_REPAIR_TIME of 3.6 hours. A storage cell has been down for 3 hours and the maintenance activity is deemed to extend for another 3 hours. If we allow the disks belonging to this cell to be dropped after 0.6 hours, the ensuing rebalance with an entire cell missing may take much more than 3 hours before completing, depending on how full the diskgroups are, the size of the rack, the rebalance power, database I/O utilization, etc. Therefore, as the rebalance would likely not complete before the maintenance activity is over and the disks are online, allowing the drop to occur would not significantly decrerase the time period during which ASM is exposed to a greater risk given any additional disk failures. It may therefore make more sense to extend the DISK_REPAIR_TIME in order to allow the maintenance activity to conclude. Extending DISK_REPAIR_TIME means that you anticipate the maintenance activities to complete in a reasonable given time frame and are willing to tolerate the risks from an extended period of no redundancy.
This article explains how to change the timer value for those disks which are already offlined by ASM( DISK_REPAIR_TIME countdown timer started by ASM) and how to immediately drop disks before their countdown timer expires.
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!