Solaris Volume Manager (SVM) Additional Recovery/Troubleshooting Options when Mirror re-syncing Fails due to Bad Blocks
(Doc ID 2107282.1)
Last updated on SEPTEMBER 10, 2024
Applies to:
Sun Solaris Volume Manager (SVM) - Version 11.9.0 to 11.11 [Release 11.0]Oracle Solaris on SPARC (64-bit)
Oracle Solaris on SPARC (32-bit)
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on x86 (32-bit)
Purpose
This document is to assist the reader with recovery options when bad blocks are encountered.
Solaris Volume Manager (SVM) allows for Raid1 mirrors with up to four submirrors. Typically, only two are used and most commonly these are for the OS. There will come a time when one of these disks will show some form of disk failure. Sometimes it will manifest as a complete hard failure where the drive fails to respond to any commands but often enough the drive will start to suffer I/O failures to specific sectors (blocks) on the disk.
These are shown as media errors typically with ASC 0x11 being reported in the message log. Sometimes these I/O failures are retryable and other times they will become fatal. A fatal I/O to a SVM submirror will put that submirror into the 'Needs maintenance' state. At this point further I/O to that submirror
is terminated unless it is the last running submirror. Writes to the first submirror put into maintenance cease and cause that side to become stale.
Knowing when the device was put into maintenance is very important.
Unless the System Administrator (SA) has setup some form of monitoring, either by a shell script run out of cron, periodically checking the cluster manager gui, setting up snmp traps (more on this later), or some third party monitoring application then the original failure can go undetected as the user community
still has access to the filesystem mountpoint and subdirectories contained within. Applications continue to run. This is why the mirrors were originally configured. It is often when the second side of the mirror starts having trouble that the SA is informed there is a problem with the system. This can sometimes be weeks, months or longer following the original problem. When the second side or a two-way mirror is placed into maintenance its component will be put into the 'Last Erred' state. The state allows I/O's to continue to be tried indefinitely to that side. Depending on many factors the effects of this can be limited (if the bad blocks are infrequently accessed) or substantial leading to very poor performance response and possibly OS or application hangs. It can even render a system unbootable.
Reference: <Document 1004417.1> Solaris Volume Manager (SVM): Understanding the "Needs Maintenance" and "Last Erred" States
Troubleshooting Steps
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |