My Oracle Support Banner

Kernel Panic - not Syncing: Hard LOCKUP due to IB Switch Errors (Doc ID 2656132.1)

Last updated on APRIL 07, 2020

Applies to:

Linux OS - Version Oracle Linux 6.9 and later
Linux x86-64

Symptoms

Kernel Panic - not Syncing: Hard LOCKUP due to IB Switch Errors

The following ib0: transmit timeout errors will be shown in dmesg. If the NFS is running on IPoIB layer, NFS not responding message will be shown too.

Mar 11 04:06:23 Hostname kernel: ib0: timing out; 176 sends not completed still waiting..
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:06:28 Hostname kernel: ib0: timing out; 176 sends not completed still waiting..
Mar 11 04:06:33 Hostname kernel: ib0: timing out; 176 sends not completed still waiting..
Mar 11 04:06:34 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:54:00 Hostname kernel: ib0: timing out; 384 sends not completed still waiting..
Mar 11 04:54:05 Hostname kernel: ib0: timing out; 384 sends not completed still waiting..
Mar 11 04:54:10 Hostname kernel: ib0: timing out; 384 sends not completed still waiting..
Mar 11 04:54:15 Hostname kernel: ib0: timing out; 384 sends not completed still waiting..
Mar 11 04:54:15 Hostname kernel: ib0: ipoib_cm_tx_destroy: 384 not completed force cleanup.

Occasionally a WATCHDOG event has been fired by IB driver.

Mar 11 05:35:25 Hostname kernel: NETDEV WATCHDOG: ib0 (mlx4_core): transmit queue 15 timed out
Mar 11 05:35:25 Hostname kernel: Call Trace:
Mar 11 05:35:25 Hostname kernel: <IRQ> [<ffffffff8107e771>] ? warn_slowpath_common+0x91/0xe0
Mar 11 05:35:25 Hostname kernel: [<ffffffff8107e876>] ? warn_slowpath_fmt+0x46/0x60
Mar 11 05:35:25 Hostname kernel: [<ffffffff814a5351>] ? dev_watchdog+0x271/0x280
Mar 11 05:35:25 Hostname kernel: [<ffffffff810a0065>] ? sys_getpriority+0x85/0x210
Mar 11 05:35:25 Hostname kernel: [<ffffffff8106a884>] ? scheduler_tick+0x124/0x270
Mar 11 05:35:25 Hostname kernel: [<ffffffff814a50e0>] ? dev_watchdog+0x0/0x280
Mar 11 05:35:25 Hostname kernel: [<ffffffff81091a39>] ? run_timer_softirq+0x199/0x350
Mar 11 05:35:25 Hostname kernel: [<ffffffff81091286>] ? update_process_times+0x76/0x90
Mar 11 05:35:25 Hostname kernel: [<ffffffff8103d9f2>] ? native_apic_msr_write+0x32/0x40
Mar 11 05:35:25 Hostname kernel: [<ffffffff810874ea>] ? __do_softirq+0xea/0x240
Mar 11 05:35:25 Hostname kernel: [<ffffffff8155f8cc>] ? call_softirq+0x1c/0x30
Mar 11 05:35:25 Hostname kernel: [<ffffffff8100e555>] ? do_softirq+0x65/0xa0
Mar 11 05:35:25 Hostname kernel: [<ffffffff8108717d>] ? irq_exit+0x8d/0xa0
Mar 11 05:35:25 Hostname kernel: [<ffffffff815606ee>] ? smp_apic_timer_interrupt+0x4e/0x60
Mar 11 05:35:25 Hostname kernel: [<ffffffff8155f143>] ? apic_timer_interrupt+0x13/0x20
Mar 11 05:35:25 Hostname kernel: <EOI> [<ffffffff813028fd>] ? intel_idle+0x12d/0x250
Mar 11 05:35:25 Hostname kernel: [<ffffffff813028e0>] ? intel_idle+0x110/0x250
Mar 11 05:35:25 Hostname kernel: [<ffffffff81013509>] ? sched_clock+0x9/0x10
Mar 11 05:35:25 Hostname kernel: [<ffffffff810b000d>] ? sched_clock_cpu+0xcd/0x110
Mar 11 05:35:25 Hostname kernel: [<ffffffff8144a9ae>] ? cpuidle_idle_call+0x8e/0xf0
Mar 11 05:35:25 Hostname kernel: [<ffffffff81009fc9>] ? cpu_idle+0xb9/0x110
Mar 11 05:35:25 Hostname kernel: [<ffffffff8154c27b>] ? start_secondary+0x30f/0x365
Mar 11 05:35:25 Hostname kernel: ib0: transmit timeout: latency -355921930 msecs
Mar 11 05:35:25 Hostname kernel: ib0: queue (0) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x221
Mar 11 05:35:25 Hostname kernel: ib0: queue (1) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x222
Mar 11 05:35:25 Hostname kernel: ib0: queue (2) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x223
Mar 11 05:35:25 Hostname kernel: ib0: queue (3) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x224
Mar 11 05:35:25 Hostname kernel: ib0: queue (4) stopped=0, tx_head 21952, tx_tail 21952 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x225
Mar 11 05:35:25 Hostname kernel: ib0: queue (5) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x226

From the vmcore analysis, these threads will be shown

PID: 17130 TASK: ffff897f1a618040 CPU: 14 COMMAND: "login_wq"
#0 [ffff897f1da07be0] schedule at ffffffff81552dba
#1 [ffff897f1da07cc8] __mutex_lock_slowpath at ffffffff81554b86
#2 [ffff897f1da07d38] mutex_lock at ffffffff8155440b
#3 [ffff897f1da07d58] rtnl_lock at ffffffff814942c5
#4 [ffff897f1da07d68] unregister_netdev at ffffffff814849a6
#5 [ffff897f1da07d88] vnic_login_destroy_wq_stopped at ffffffffa097bb90 [mlx4_vnic]
#6 [ffff897f1da07e08] fip_vnic_login_destroy at ffffffffa0994555 [mlx4_vnic]
#7 [ffff897f1da07e38] worker_thread at ffffffff810a1d32
#8 [ffff897f1da07ee8] kthread at ffffffff810a8620
#9 [ffff897f1da07f48] kernel_thread at ffffffff8155f7ba

The owner of the mutex lock is a link watch event of which detection of link carrier is bad and trying to deactivate the devices while waiting for outstanding IB device transmit but stuck there for long. In this case the thread stuck for close to 1 minute and NMI detects the hang and crashes the system

PID: 1265 TASK: ffff88bf1c304040 CPU: 1 COMMAND: "linkwatch"
#0 [ffff88bf1c313ad0] schedule at ffffffff81552dba
#1 [ffff88bf1c313bb8] schedule_timeout at ffffffff81553e55
#2 [ffff88bf1c313c68] wait_for_common at ffffffff81553ab3
#3 [ffff88bf1c313cf8] wait_for_completion at ffffffff81553bed
#4 [ffff88bf1c313d08] synchronize_sched at ffffffff810a42d8
#5 [ffff88bf1c313d58] dev_deactivate_many at ffffffff814a5dd6
#6 [ffff88bf1c313db8] dev_deactivate at ffffffff814a5ed8
#7 [ffff88bf1c313de8] __linkwatch_run_queue at ffffffff814979f8
#8 [ffff88bf1c313e28] linkwatch_event at ffffffff81497a55
#9 [ffff88bf1c313e38] worker_thread at ffffffff810a1d32
#10 [ffff88bf1c313ee8] kthread at ffffffff810a8620
#11 [ffff88bf1c313f48] kernel_thread at ffffffff8155f7ba

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.