Fatal error has occured in: PCIe fabric when Solaris 11.1 upgraded from SRU19.2 to 19.6 (Doc ID 1930105.1)

Last updated on NOVEMBER 11, 2014

Applies to:

Solaris SPARC Operating System - Version 11.1 to 11.1 [Release 11.0]
Information in this document applies to any platform.

Symptoms

Solaris 11.1 SRU 19.6 encounters PCIe fatal error panic when updating from SRU 19.2. The panic string and the panic stack trace are as follows

panic string:   Fatal error has occured in: PCIe fabric.(0x2)(0x47)

panic on CPU 250
panic string:   Fatal error has occured in: PCIe fabric.(0x2)(0x47)
==== panic interrupt thread: 0x2a10c083c60  PID: 0  on CPU: 250  affinity CPU: 250 (last_swtch: 0.225983550 seconds earlier)  PIL: 9 ====
cmd: sched(unix:thread_create_intr)
t_procp: 0x1038f9c0 (proc_sched)
   p_as: 0x10396a48 (kas)
   p_zone: 0x106a4258 (global)
t_stk: 0x2a10c083a50  sp: 0x1054ba01  t_stkbase: 0x2a10c07c000
t_pri: 108 (SYS)  pctcpu: 0.000049
t_transience: 10 (TRANSIENT)  t_wkld_flags: 0
t_cpupart: 0x1054c6e8(0)  last CPU: 250
idle: 225983550 hrticks (0.225983550s)
start: Sun Jul 13 20:27:32 2014
age: 2267 seconds (37 minutes 47 seconds)
interrupted (pinned) thread: 0x2a10c02dc60
t_state:     TS_ONPROC
t_flag:      0x10809 (T_INTR_THREAD|T_TALLOCSTK|T_PANIC|T_PUSHPAGE)
t_proc_flag: 0 (none set)
t_schedflag: 0x13 (TS_LOAD|TS_DONT_SWAP|TS_SIGNALLED)
p_flag:      1 (SSYS)

pc:      unix:panicsys+0x48:   call     unix:setjmp

void unix:panicsys+0x48((const char *)0x7bfbaa18, (va_list)0x2a10c0838d8, (struct regs *)0x1054c3c0, (int)1, 0x4480001601, , , , , , , , 0x7bfbaa18, 0x2a10c0838d8)
unix:vpanic_common+0x78(0x7bfbaa18, 0x2a10c0838d8, 0, 0x40020f24908, 0, 3)
void genunix:fm_panic+0x30((const char *)0x7bfbaa18, (void *)0x2a10c083900, 2, 0x47, 0x81010100, 0xff0000, ...)
void px:px_err_panic+0x1c4((int)2, (int)1, (int), (boolean_t)0)
uint_t px:px_err_intr+0x19c((px_fault_t *), (pcie_rc_err_t *))
unix:intr_thread+0x258(0x16, 0xffffffffffffffff, 0xc402a2dca000, 2, 0xc402a2dd1d00, 0xfa)
unix:ktl0+0x64()
-- interrupt data  rp: 0x2a10c02d850
pc:  0x10483b4 unix:cpu_halt+0x158:   call      unix:disable_vec_intr
npc: 0x10483b8 unix:cpu_halt+0x15c:   nop
  global:                       %g1      0xa84f320d2e3
        %g2             0x1098  %g3                  1
        %g4     0xc40285fb6fc8  %g5      0x300001fe000
        %g6                  0  %g7      0x2a10c02dc60
  out:  %o0               0x16  %o1 0xffffffffffffffff
        %o2     0xc402a2dca000  %o3                  2
        %o4     0xc402a2dd1d00  %o5               0xfa
        %sp      0x2a10c02d0f1  %o7          0x10483ac
  loc:  %l0                  1  %l1     0xc402cfe390fc
        %l2               0x16  %l3             0x108c
        %l4               0x7f  %l5               0x79
        %l6         0x100dbc00  %l7                  0
  in:   %i0             0x1000  %i1                  1
        %i2         0x1054c820  %i3         0x1054c6e8
        %i4      0x300001fe000  %i5               0xfa
        %fp      0x2a10c02d1a1  %i7          0x1075b94
<intr trap>void unix:cpu_halt+0x158()
void unix:idle+0x120()
unix:thread_start+4()
-- end of CPU250 idle thread stack --

Console log shows a lot of InfiniteBand related error messages.


        -4m59.50s| WARNING: mcxnex0: EQE local access violation
        -4m59.27s| WARNING: mcxnex0: EQE local access violation
        -4m59.06s| WARNING: mcxnex0: EQE local access violation
        -4m58.85s| WARNING: mcxnex0: EQE local access violation
        -4m58.72s| WARNING: mcxnex0: EQE local access violation
        -4m58.51s| WARNING: mcxnex0: EQE local access violation
        -4m58.39s| WARNING: mcxnex0: EQE local access violation
         -2m7.37s| WARNING: mcxnex0: EQE local access violation
         -2m7.25s| WARNING: mcxnex0: EQE local access violation
         -1m6.43s| WARNING: /scsi_vhci (scsi_vhci0):
                 |      /scsi_vhci/ssd@g600144f0eed9c2d4000053c2f317004d (ssd0):
 Command Timeout on path iscsi0/ssd@0000iqn.1986-03.com.sun:02:0eb8e466-0aeb-eeb
6-d10a-d365afea8a310002,0
         -1m6.43s| NOTICE: iscsi session(9) unable to enumerate logical unit - i
nquiry failed lun 0
         -1m6.33s| WARNING: mcxnex0: EQE local access violation
         -1m6.08s| created version 34 pool murex-uat using 34
           -0.93s| WARNING: mcxnex0: EQE local access violation
           -0.78s| WARNING: mcxnex0: EQE local access violation
           -0.64s| WARNING: mcxnex0: EQE local access violation
           -0.50s| WARNING: mcxnex0: EQE local access violation
           -0.37s| WARNING: mcxnex0: EQE local access violation
           -0.13s| WARNING: mcxnex0: CQE ERR: cqe 40080692b60 QPN 1c004d indx 5b
 status 0x4  vendor syndrome 57
           -0.13s| WARNING: mcxnex0: CQE local protection error
           -0.13s| WARNING: mcxnex0: ERRCQE is at 40080692b60

The problem happens when the following conditions are satisfied

1) ZFS is used with intensive I/O activities

2) InfiniteBand driver is used for interconnects to the disk array with iSER and mcxnex drivers being enabled

3) ZFS is configured to use the iSCSI logical units on the initiator and performs intensive I/O activities

4)COMSTAR (SCSI target mode framework) or ZFSSA is used as a storage iSCSI target

Among the configurations that will trigger the panic are:

a) Solaris box running 11.1 SRU 19.6 with 40Gbit QDR InfiniteBand HCA (acts as iSCSI/iSER initiator)

b) Solaris box running 11.1 SRU 19.6 | COMSTAR with 40Gbit QDR InfiniteBand HCA (acts as iSCSI/iSER target)


 

 

Changes

 Problem only happens when updating Solaris 11.1 SRU 19.2 to SRU 19.6 with intense ZFS I/O activities

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms