Last updated on SEPTEMBER 15, 2017
Applies to:Linux OS - Version Oracle Linux 6.0 with Unbreakable Enterprise Kernel [2.6.32] to Oracle Linux 7.1 [Release OL6 to OL7U1]
When there are IO timeout the Linux kernel SCSI error handler logic proceeds through a sequence of recovery methods and it attempts to recover failing devices or transports while causing as little disruption to other IO taking place on the system as possible. The standard recovery levels are executed in order with an escalation to the next level whenever a recovery attempt fails, or a subsequent SCSI Test Unit Ready (TUR) command fails.
In a situation where all operations on the external storage time out (for example due to a failed SAN fabric component not allowing to pass any traffic or report any error condition) this logic can lead to very long delays in failing IO where there are large numbers of devices or targets (since each reset level is repeated for each outstanding command, device, target etc.).
By setting an overall limit on the time spent attempting these operations (and immediately proceeding to the HBA reset if this time expires) the features discussed in this solution provide more consistent and predictable system behavior when faults of this nature occur.
Sign In with your My Oracle Support account
Don't have a My Oracle Support account? Click to get started
Million Knowledge Articles and hundreds of Community platforms