SPARC M7 HOST may panic upon forced SP failover (e.g,reset by Standby)

(Doc ID 2342857.1)

Last updated on APRIL 20, 2018

Applies to:

SPARC M7-8 - Version All Versions to All Versions [Release All Releases]
SPARC M7-16 - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster M7 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Symptoms

A HOST may suffer an unexpected drop to the OBP debugger upon a forced SP failover, either through explicit user action or catastrophic SP reset.

The HOST console will show the drop to the OBP debugger.  Upon selecting 's' to sync, the domain will panic.
e.g.,
Dec 14 12:15:44 hostname     sshd[59354]: fatal: Read from socket failed: Connection reset by peer

c)ontinue, s)ync, r)eset? s <------------host dropped to debugger
c)ontinue, s)ync, r)eset? s

panic[cpu37]/thread=2a10a3b9b80: sync initiated
sched: trap type = 0x0
pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0
o0-o7: 0, 0, 0, 0, 0, 0, 0, 0
g1-g7: 0, 0, 0, 0, 0, 0, 0

 

Domain messages immediately preceding the drop to the debugger will include messages such as those below.  The key messages to recognize are the "Link retraining detected" and "Surprise removal of mga0 detected".

Dec 14 12:38:09 hostname pcie: [ID 297812 kern.info] NOTICE: Live Suspend: port pci.0,0: child dev mga#0(400417ccab8) and descendants
Dec 14 12:38:09 hostname pcie: [ID 286789 kern.info] NOTICE: Live Suspend: mga0 suspended successfully
Dec 14 12:38:09 hostname pcie: [ID 486281 kern.info] NOTICE: IOR dev:////pci@304/pci@1/pci@0/pci@4/display@0, Reason: device has been surprise removed, Action: Hotplug LSR Suspend, Result: success, Current state: suspended
Dec 14 12:38:09 hostname pcie: [ID 833280 kern.notice] NOTICE: Suspend of mga0 succeeded.
Dec 14 12:38:09 hostname pcie: [ID 958946 kern.warning] WARNING: Link retraining detected in port pcieb7
Dec 14 12:38:09 hostname pcie: [ID 965590 kern.warning] WARNING: Surprise removal of mga0 detected
Dec 14 12:38:09 hostname mac: [ID 486395 kern.info] NOTICE: usbecm2 link down
Dec 14 12:38:15 hostname genunix: [ID 408114 kern.info] /pci@304/pci@1/pci@0/pci@2/usb@0/communications@1 (usbecm2) online
Dec 14 12:38:19 hostname fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-3S, TYPE: Fault, VER: 1, SEVERITY: Critical

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Various issues may cause forced SP failover.  One condition known to trigger the failover is SP memory exhaustion.  This condition can only be diagnosed by an Oracle engineer when SP snapshot is provided.

The SP failover will be evident in the event log.  Use the following command to display the event log and look for messages such as those appearing below.

 -> show -script  /SP/logs/event/list

 281 Thu Dec 14 12:47:57 2017 System Log minor
Host ID 0: Solaris running
280 Thu Dec 14 12:43:39 2017 Reset Log minor
/Servers/PDomains/PDomain_0 is now managed by PDomain SPP /SYS/SP1/SPM0.
279 Thu Dec 14 12:43:39 2017 Reset Log minor
/System/DCUs/DCU_0 is now managed by /SYS/SP1/SPM0.
278 Thu Dec 14 12:43:38 2017 Reset Log minor
Failover completed. Active SP is /SYS/SP1/SPM0.

 

 

 ----------------------------------------------------------------------------------------------------------------------------------------------

The issue impacts all System Firmware releases earlier than 9.8.0.d.  The SysFW version can be displayed with this command,

-> show /System/Firmware system_fw_version

 /System/Firmware
    Properties:
        system_fw_version = Sun System Firmware 9.7.4 2016/12/08 07:51

Changes

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms