Kernel Panic - not Syncing: Hard LOCKUP due to IB Switch Errors
(Doc ID 2656132.1)
Last updated on APRIL 07, 2020
Applies to:
Linux OS - Version Oracle Linux 6.9 and laterLinux x86-64
Symptoms
Kernel Panic - not Syncing: Hard LOCKUP due to IB Switch Errors
The following ib0: transmit timeout errors will be shown in dmesg. If the NFS is running on IPoIB layer, NFS not responding message will be shown too.
Mar 11 04:06:23 Hostname kernel: ib0: timing out; 176 sends not completed still waiting.. Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:26 Hostname kernel: nfs: server 192.168.X.X not responding, still trying Mar 11 04:06:28 Hostname kernel: ib0: timing out; 176 sends not completed still waiting.. Mar 11 04:06:33 Hostname kernel: ib0: timing out; 176 sends not completed still waiting.. Mar 11 04:06:34 Hostname kernel: nfs: server 192.168.X.X not responding, still trying
Mar 11 04:54:00 Hostname kernel: ib0: timing out; 384 sends not completed still waiting.. Mar 11 04:54:05 Hostname kernel: ib0: timing out; 384 sends not completed still waiting.. Mar 11 04:54:10 Hostname kernel: ib0: timing out; 384 sends not completed still waiting.. Mar 11 04:54:15 Hostname kernel: ib0: timing out; 384 sends not completed still waiting.. Mar 11 04:54:15 Hostname kernel: ib0: ipoib_cm_tx_destroy: 384 not completed force cleanup.
Occasionally a WATCHDOG event has been fired by IB driver.
Mar 11 05:35:25 Hostname kernel: NETDEV WATCHDOG: ib0 (mlx4_core): transmit queue 15 timed out Mar 11 05:35:25 Hostname kernel: Call Trace: Mar 11 05:35:25 Hostname kernel: <IRQ> [<ffffffff8107e771>] ? warn_slowpath_common+0x91/0xe0 Mar 11 05:35:25 Hostname kernel: [<ffffffff8107e876>] ? warn_slowpath_fmt+0x46/0x60 Mar 11 05:35:25 Hostname kernel: [<ffffffff814a5351>] ? dev_watchdog+0x271/0x280 Mar 11 05:35:25 Hostname kernel: [<ffffffff810a0065>] ? sys_getpriority+0x85/0x210 Mar 11 05:35:25 Hostname kernel: [<ffffffff8106a884>] ? scheduler_tick+0x124/0x270 Mar 11 05:35:25 Hostname kernel: [<ffffffff814a50e0>] ? dev_watchdog+0x0/0x280 Mar 11 05:35:25 Hostname kernel: [<ffffffff81091a39>] ? run_timer_softirq+0x199/0x350 Mar 11 05:35:25 Hostname kernel: [<ffffffff81091286>] ? update_process_times+0x76/0x90 Mar 11 05:35:25 Hostname kernel: [<ffffffff8103d9f2>] ? native_apic_msr_write+0x32/0x40 Mar 11 05:35:25 Hostname kernel: [<ffffffff810874ea>] ? __do_softirq+0xea/0x240 Mar 11 05:35:25 Hostname kernel: [<ffffffff8155f8cc>] ? call_softirq+0x1c/0x30 Mar 11 05:35:25 Hostname kernel: [<ffffffff8100e555>] ? do_softirq+0x65/0xa0 Mar 11 05:35:25 Hostname kernel: [<ffffffff8108717d>] ? irq_exit+0x8d/0xa0 Mar 11 05:35:25 Hostname kernel: [<ffffffff815606ee>] ? smp_apic_timer_interrupt+0x4e/0x60 Mar 11 05:35:25 Hostname kernel: [<ffffffff8155f143>] ? apic_timer_interrupt+0x13/0x20 Mar 11 05:35:25 Hostname kernel: <EOI> [<ffffffff813028fd>] ? intel_idle+0x12d/0x250 Mar 11 05:35:25 Hostname kernel: [<ffffffff813028e0>] ? intel_idle+0x110/0x250 Mar 11 05:35:25 Hostname kernel: [<ffffffff81013509>] ? sched_clock+0x9/0x10 Mar 11 05:35:25 Hostname kernel: [<ffffffff810b000d>] ? sched_clock_cpu+0xcd/0x110 Mar 11 05:35:25 Hostname kernel: [<ffffffff8144a9ae>] ? cpuidle_idle_call+0x8e/0xf0 Mar 11 05:35:25 Hostname kernel: [<ffffffff81009fc9>] ? cpu_idle+0xb9/0x110 Mar 11 05:35:25 Hostname kernel: [<ffffffff8154c27b>] ? start_secondary+0x30f/0x365
Mar 11 05:35:25 Hostname kernel: ib0: transmit timeout: latency -355921930 msecs Mar 11 05:35:25 Hostname kernel: ib0: queue (0) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x221 Mar 11 05:35:25 Hostname kernel: ib0: queue (1) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x222 Mar 11 05:35:25 Hostname kernel: ib0: queue (2) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x223 Mar 11 05:35:25 Hostname kernel: ib0: queue (3) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x224 Mar 11 05:35:25 Hostname kernel: ib0: queue (4) stopped=0, tx_head 21952, tx_tail 21952 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x225 Mar 11 05:35:25 Hostname kernel: ib0: queue (5) stopped=0, tx_head 0, tx_tail 0 tx_outstanding 0 ipoib_sendq_size: 512 QP number: 0x226
From the vmcore analysis, these threads will be shown
PID: 17130 TASK: ffff897f1a618040 CPU: 14 COMMAND: "login_wq" #0 [ffff897f1da07be0] schedule at ffffffff81552dba #1 [ffff897f1da07cc8] __mutex_lock_slowpath at ffffffff81554b86 #2 [ffff897f1da07d38] mutex_lock at ffffffff8155440b #3 [ffff897f1da07d58] rtnl_lock at ffffffff814942c5 #4 [ffff897f1da07d68] unregister_netdev at ffffffff814849a6 #5 [ffff897f1da07d88] vnic_login_destroy_wq_stopped at ffffffffa097bb90 [mlx4_vnic] #6 [ffff897f1da07e08] fip_vnic_login_destroy at ffffffffa0994555 [mlx4_vnic] #7 [ffff897f1da07e38] worker_thread at ffffffff810a1d32 #8 [ffff897f1da07ee8] kthread at ffffffff810a8620 #9 [ffff897f1da07f48] kernel_thread at ffffffff8155f7ba
The owner of the mutex lock is a link watch event of which detection of link carrier is bad and trying to deactivate the devices while waiting for outstanding IB device transmit but stuck there for long. In this case the thread stuck for close to 1 minute and NMI detects the hang and crashes the system
PID: 1265 TASK: ffff88bf1c304040 CPU: 1 COMMAND: "linkwatch" #0 [ffff88bf1c313ad0] schedule at ffffffff81552dba #1 [ffff88bf1c313bb8] schedule_timeout at ffffffff81553e55 #2 [ffff88bf1c313c68] wait_for_common at ffffffff81553ab3 #3 [ffff88bf1c313cf8] wait_for_completion at ffffffff81553bed #4 [ffff88bf1c313d08] synchronize_sched at ffffffff810a42d8 #5 [ffff88bf1c313d58] dev_deactivate_many at ffffffff814a5dd6 #6 [ffff88bf1c313db8] dev_deactivate at ffffffff814a5ed8 #7 [ffff88bf1c313de8] __linkwatch_run_queue at ffffffff814979f8 #8 [ffff88bf1c313e28] linkwatch_event at ffffffff81497a55 #9 [ffff88bf1c313e38] worker_thread at ffffffff810a1d32 #10 [ffff88bf1c313ee8] kthread at ffffffff810a8620 #11 [ffff88bf1c313f48] kernel_thread at ffffffff8155f7ba
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Cause |
Solution |