SCSI Command Timeout in the SAN causes SCSI Commands to Hang in FCP Driver
(Doc ID 2350556.1)
Last updated on DECEMBER 08, 2021
Applies to:Solaris Operating System - Version 10 1/13 U11 and later
Solaris x64/x86 Operating System - Version 10 1/13 U11 and later
Information in this document applies to any platform.
An I/O hang was encountered and commands were found to be queued in the driver.
The symptoms are:
1. queued fcp_pkts from ssd driver (fcp_pkt_t)
2. incomplete internal fcp packets (fcp_ipkt_t)
Root cause for symptom 1:
There are problems either on Fibre Channel switch or storage device. Name service on Fibre Channel switch reports a storage device port existed, but PLOGI failed for long time, that causes Leadville unable to complete device enumeration in time, and thus a scsi packet coming from ssd driver is queued in fcp driver. sd_ddi_suspend() only tolerates 30 seconds delay, so this packet is not complete and sd_ddi_suspend() failed.
Suggested fix for symptom 1:
If fcp_port state is FCP_STATE_ONLINING and device enumeration is in progress, as long as some logical units are ready for I/O, fcp_scsi_start does not queue any scsi_pkts from target driver. To precisely guarantee logical units are not in old online state when enumeration is going on, a fcp_port_t->port_last_onlining_time is added to record the last time of setting FCP_STATE_ONLINING, and a fcp_lun_t->lun_last_online_time is newly added as well to indicate the last online time of a logical unit. If lun_last_online_time is bigger than port_last_onlining_time, then we consider this
logical unit is indeed online during the last device enumeration. Existing fcp_lun state or flags are not accurate to determine if the logical unit is really online during the last round of device enumeration.
Root cause for symptom 2:
Internal fcp packets includes PRLI and PLOGI, which are easy to timeout when Fibre Channel switch or storage device is unstable. Meanwhile any internal commands transported out are not recorded before, so there are memory leaks if callback function is never invoked.
Suggested fix for symptom 2:
fcp_port_t->port_active_ipkt_list is a newly added field to store any internal commands allocated by fcp_icmd_alloc. When fcp_handle_port_detach() is called, if 120 seconds timeout occurs, any internal packets are considered not complete and ignored to proceed to detach instead of returning FC_FAILURE, any icmds in this port_active_ipkt_list will be freed afterwards. This list will be helpful when debugging fcp driver in future, otherwise we have no ways to track which fcp_ipkt are not finished yet.
This bug fix is to tolerate long delay from switch or storage device in some manner, it shortens some window between all logical units online time and Fiber Channel local port online time. It will alleviate this kind of problem but will still happen when a scsi packet is transported to a logical unit that is not online yet due to long delay.
Issue is triggered by a problem with command timeout in the SAN.
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document