My Oracle Support Banner

Solaris 11 or Solaris 10 can suffer a deadlock in vldc driver (Doc ID 2433567.1)

Last updated on AUGUST 09, 2018

Applies to:

Solaris Operating System - Version 10 3/05 to 11.1 [Release 10.0 to 11.0]
Oracle Solaris on SPARC (64-bit)

Symptoms

A forced kernel dump will show a PIL 6 interrupt thread blocked on a mutex, similar to the thread shown here:

==== interrupt thread: 0x2a100f1dc80 PID: 0 affinity CPU: 1 PIL: 6 ====
cmd: sched(genunix:thread_create_intr)
t_wchan: 0x6001e038030 sobj: mutex owner: 0x3006679e060
top owner (0x3006679e060) is waiting for a semaphore @ 0x30059428e80
t_procp: 0x1911b40 (proc_sched)
p_as: 0x1914810 (kas)
p_zone: 0x1a23340 (global)
t_stk: 0x2a100f1da90 sp: 0x2a100f1ce51 t_stkbase: 0x2a100f18000
t_pri: 165 (SYS) pctcpu: 0.000000
t_transience: 0 t_wkld_flags: 2 WLKD_CPU_INTENSIVE
t_cpupart: 0x198fd70(0) last CPU: 1
idle: 15297859 ticks (1d18h29m38.59s)
start: Wed Jul 4 17:08:57 2018
age: 2826713 seconds (32 days 17 hours 11 minutes 53 seconds)
t_state: TS_SLEEP
t_flag: 0x4009 (T_INTR_THREAD|T_TALLOCSTK|T_PUSHPAGE)
t_proc_flag: 0 (none set)
t_schedflag: 3 (TS_LOAD|TS_DONT_SWAP)
t_acflag: 4 (TA_NO_ACCOUNTING)
p_flag: 1 (SSYS)

pc: genunix:turnstile_block+0x5fc: call unix:swtch

genunix:turnstile_block+0x5fc(0, 0, 0x6001e038030, 0x1867b48, 0, 0)
unix:mutex_vector_enter+0x428()
unix:mutex_enter(0x6001e038030) - frame recycled
vldc:i_vldc_cb+4(8, 0x6001e000030)
ldc:i_ldc_invoke_cb+0x24(0x60012ccb200, 8, 0x2a100f1d9c8)
ldc:i_ldc_rx_hdlr+0x8c(0x60012ccb200, 0, 0x60012ccb200, 0)
cnex:cnex_intr_wrapper+0x3c(, 0, , , 1)
unix:intr_thread+0x25c(0x16, 1, 0, 0x400, 0x100010000010101, 0x12)
unix:ktl0+0x64()
-- interrupt data rp: 0x2a100ead890
pc: 0x1046f14 unix:cpu_halt+0x118: call unix:disable_vec_intr
npc: 0x1046f18 unix:cpu_halt+0x11c: nop
global: %g1 1
%g2 0x30000b17e40 %g3 8
%g4 0 %g5 0x3000356c000
%g6 0 %g7 0x2a100eadc80
out: %o0 0x16 %o1 1
%o2 0 %o3 0x400
%o4 0x101000101000101 %o5 0x12
%sp 0x2a100ead131 %o7 0x1046f0c
loc: %l0 0x60011b0c57c %l1 0
%l2 0x16 %l3 1
%l4 0 %l5 0
%l6 0 %l7 0x1b
in: %i0 0x3000356c000 %i1 1
%i2 0x198fea0 %i3 0x198fd70
%i4 0x3000356c000 %i5 1
%fp 0x2a100ead1e1 %i7 0x1073858
<intr trap>unix:cpu_halt+0x118()
unix:idle+0x128(0, 0)
unix:thread_start+4()
-- end of CPU1 idle thread stack --

The thread that owns the mutex will be waiting for an I/O request to complete, similar to the thread shown here:

==== user (LWP_SYS) thread: 0x3006679e060 PID: 688 ====
cmd: /usr/lib/fm/fmd/fmd
t_wchan: 0x30059428e80 sobj: semaphore (from genunix:biowait+0x6c)
t_procp: 0x300652a6d20
p_as: 0x30065fa66a8 size: 58114048 RSS: 4792320
a_hat: 0x300141b4080
cnum: CPU0:2798/2876
cpusran: 0,1,2,3,4,5,6,7
p_zone: 0x1a23340 (global)
t_stk: 0x2a101e9fad0 sp: 0x2a101e9df21 t_stkbase: 0x2a101e9a000
t_pri: 60 (TS) t_epri: 165 pctcpu: 0.000000
t_transience: 10 (TRANSIENT) t_wkld_flags: 0
t_lwp: 0x30068411b58 t_tid: 14
machpcb: 0x2a101e9fad0
lwp_ap: 0x2a101e9fbc0
t_mstate: LMS_SLEEP ms_prev: LMS_KFAULT
ms_state_start: 1 days 18 hours 27 minutes 49.550311788 seconds earlier
ms_start: 32 days 17 hours 10 minutes 52.503588283 seconds earlier
t_cpupart: 0x198fd70(0) last CPU: 2
idle: 15286955 ticks (1d18h27m49.55s)
start: Wed Jul 4 17:10:27 2018
age: 2826623 seconds (32 days 17 hours 10 minutes 23 seconds)
t_state: TS_SLEEP
t_flag: 0x1000 (T_LWPREUSE)
t_proc_flag: 0x104 (TP_TWAIT|TP_MSACCT)
t_schedflag: 0x13 (TS_LOAD|TS_DONT_SWAP|TS_SIGNALLED)
t_acflag: 3 (TA_NO_PROCESS_LOCK|TA_BATCH_TICKS)
p_flag: 0x42000000 (SMSACCT|SMSFORK)

pc: genunix:sema_p+0x138: call unix:swtch

genunix:sema_p+0x138(0x30059428e80)
genunix:biowait+0x6c(0x30059428dc0, , , , 0x30059428dc0)
specfs:spec_pageio+0x40(0x60012f01000, 0x700023f2180, 0x4018e000, 0x2000, 0x40)
genunix:fop_pageio+0x20(0x60012f01000, 0x700023f2180, 0x4018e000, 0x2000, 0x40, 0x30066d6dc80)
genunix:swap_getapage+0x1d0(0x30065953040, 0xc019b98000, 0x2000, 0x2a101e9f0c8, 0x2a101e9eee8, 0x2000, 0x30066d67008, 0x514000, 2, 0x30066d6dc80)
genunix:swap_getpage+0x4c(0x30065953040, 0xc019b98000, 0x2000, 0x2a101e9f0c8, 0x2a101e9eee8, , , 0x514000, 2, 0x30066d6dc80)
genunix:fop_getpage+0x44(0x30065953040, 0xc019b98000, 0x2000, 0x2a101e9f0c8, 0x2a101e9eee8, 0x2000, 0x30066d67008, , , 0x30066d6dc80)
genunix:anon_getpage+0x158(0x2a101e9eef8, 0x2a101e9f0c8, 0x2a101e9eee8, 0x2000, 0x30066d67008, 0x514000, 2, 0x30066d6dc80)
genunix:anon_map_getpages+0xdc(0x30066d4a2b8, 0x25a, 0, 0x30066d67008, 0x514000, 0xf, , 0x60012a80d88, , 0, , , , 0)
genunix:segvn_fault_anonpages+0x32c(0x300141b4080, 0x30066d67008, 0x510000, 0x516000, 0?, 2, 0x514000, , 1)
genunix:segvn_fault+0x530(0x300141b4080, 0x30066d67008, 0x514000, 0x2000, 0, 2)
genunix:as_fault+0x4c8(0x300141b4080, 0x30065fa66a8, 0x514000, 1, 0, 2)
unix:pagefault+0x68(0x514000, 0, 2, 0)
unix:trap+0x950(, 0x514000)
unix:sfmmu_tsbmiss_exception(0x2a101e9f690) - frame recycled
unix:ktl0+0x64()
-- trap data type: 0x31 (data access MMU miss) rp: 0x2a101e9f690 LEAF --
pc: 0x12bf48c SUNW,UltraSPARC-T2+:copyout+0xb8: stxa %o4, [%o1 - 0x8] %asi
npc: 0x12bf478 SUNW,UltraSPARC-T2+:copyout+0xa4: ldx [%o0], %o4
global: %g1 0x30066d28000
%g2 0x3005d120000 %g3 0x514000
%g4 0x18 %g5 0x12c06e8
%g6 0 %g7 0x3006679e060
out: %o0 0x3005d120008 %o1 0x514008
%o2 9 %o3 0
%o4 0xadb8a5a003040000 %o5 0
%sp 0x2a101e9ef31 %o7 0x1174604
loc: %l0 0x2a101e9fa10 %l1 0x2a101e9fad0
%l2 2 %l3 0x5b6480df
%l4 0x300805ef7e8 %l5 0x2a101e9fac0
%l6 3 %l7 3
in: %i0 0x3005d120000 %i1 0x18
%i2 0 %i3 0x2a101e9fa00
%i4 0x60012ccb208 %i5 0x18
%fp 0x2a101e9efe1 %i7 0x7b395aa8
<leaf trap>SUNW,UltraSPARC-T2+:copyout+0xb8()
SUNW,UltraSPARC-T2+:xcopyout(0x3005d120000, 0x514000, 0x18) - frame recycled
genunix:uiomove+0x90(0x3005d120000, 0x18, 0, 0x2a101e9fa00)
vldc:vldc_read+0x12c(, 0x2a101e9fa00?)
specfs:spec_read(0x30060c82e80, 0x2a101e9fa00, 0, 0x60010823a48, 0) - frame recycled
genunix:fop_read+0x20(0x30060c82e80, 0x2a101e9fa00, 0, , 0)
genunix:read+0x274(0xd)
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --

The I/O request will have its DONE (B_DONE) flag set as shown here:

CAT(vmcore.0/10V)> buf 0x30059428dc0
buf @ 0x30059428dc0
b_edev: 85(md)set0,101 /dev/md/dsk/d101(swap) b_blkno: 0x200c70
b_flags: 0x80053 (BUSY|DONE|PAGEIO|READ|NOCACHE)
b_addr: NULL
b_bcount: 8192 b_bufsize: 8192
b_vp: 0x60012f01000
b_dip: 0x6001090d2d0 md#0 /SUNW,Sun-Blade-T6340/pseudo/md@0
b_pages: 0x700023f2180 (struct page *)

However, the thread is waiting for an interrupt to arrive from the HBA. The interrupt is queued on the
same CPU on which the PIL 6 interrupt thread last ran. This holds that CPU at base_spl 6 and as such
the CPU cannot process any interrupts at or below PIL 6 and that includes the HBA interrupt that is
pending.

CAT(vmcore.0/10V)> cpu -l 1 | grep base_spl
in_prom: 1 cpu_base_spl: 6 cpu_intr_actv: 0x40 cpu_thread_lock: 0

CAT(vmcore.0/10V)> intr -q

(...)


CPU 1
ivnum addr PM PIL handler
===== ============= == === =======
0 0x7008d180 PS 1 unix:cbe_level1(NULL)
0 0x7008d500 PS 1 unix:softlevel1(NULL)
53 0x7008c680 S 3 cnex:cnex_intr_wrapper(ldc:i_ldc_rx_hdlr(0x60011b14480, NULL))
69 0x7008c780 S 3 cnex:cnex_intr_wrapper(ldc:i_ldc_rx_hdlr(0x60011b14180, NULL))
33 0x7008cd00 S 4 cnex:cnex_intr_wrapper(ldc:i_ldc_rx_hdlr(0x60011c3ec40, NULL))
1304 0x7008c4c0 S 5 px:px_msiq_intr(mpt:mpt_intr(*mpt(data):mpt_head, NULL))   <<<--- pending HBA interrupt
1056 0x7008e1c0 S 6 px:px_msiq_intr(nxge:nxge_intr(0x60012cfadd0, 0x60012d2c000))
25 0x7008e880 S 6 cnex:cnex_intr_wrapper(ldc:i_ldc_rx_hdlr(0x60012b41540, NULL))
0 0x7008d140 PS 10 unix:cbe_level10(NULL)
0 0x700c4070 M 14 unix:cbe_level14(NULL)
0 0x700ca118 M 15 genunix:kcpc_hw_overflow_intr(NULL)

(...)

This is a deadlock condition.

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.