OCFS2 Does Not Reboot the Node at Network Failure (Doc ID 1172943.1)

Last updated on MARCH 30, 2015

Applies to:

Linux OS - Version 1.4 and later
Linux x86
Linux x86-64

Symptoms

On an OCFS2 1.4 cluster, when disabling the OCFS2 interconnection on one node, it does not reboot as desired. The node keeps reporting the following errors on console:

host kernel: (26,0):dlm_wait_for_node_death:370 1DC3CDAE88EF4D8CBBDE7C5CAC1E42FE: waiting 5000ms for notification of death of node 3
host kernel: (9466,5):dlm_send_remote_convert_request:395 ERROR: status = -107
host kernel: (9466,5):dlm_wait_for_node_death:370 1DC3CDAE88EF4D8CBBDE7C5CAC1E42FE: waiting 5000ms for notification of death of node 3
host kernel: (26,0):dlm_send_remote_convert_request:395 ERROR: status = -107
host kernel: (26,0):dlm_wait_for_node_death:370 1DC3CDAE88EF4D8CBBDE7C5CAC1E42FE: waiting 5000ms for notification of death of node 3
host kernel: (9466,5):dlm_send_remote_convert_request:395 ERROR: status = -107


Dump the task stack by running 'echo t > /proc/sysrq-trigger' and the following stack can be found in system log:

host kernel: events/0      D 00000806  1688    26      1            27    25 (L-TLB)
host kernel:        f635ed34 00000246 4a9b1e44 00000806 00000000 0000000a f632e000 4aa16374
host kernel:        00000806 00064530 f632e10c c18151e0 f1246200 c181569c c07cb000 c04283ba
host kernel:        c07cb000 f635ed3c c07cb000 c04284cb 00000000 00000000 00208b1d 00208b1d
host kernel: Call Trace:
host kernel:  [<c04283ba>] lock_timer_base+0x15/0x2f
host kernel:  [<c04284cb>] __mod_timer+0x99/0xa3
host kernel:  [<c0617ae0>] schedule_timeout+0x71/0x8c
host kernel:  [<c0427c3d>] process_timeout+0x0/0x5
host kernel:  [<f89dfa2a>] dlm_wait_for_node_death+0x107/0x201 [ocfs2_dlm]
host kernel:  [<c04312f3>] autoremove_wake_function+0x0/0x2d
host kernel:  [<f89cac71>] dlmconvert_remote+0x534/0x71c [ocfs2_dlm]
host kernel:  [<f89dd4e6>] dlm_wait_for_recovery+0x81/0xe6 [ocfs2_dlm]
host kernel:  [<f89d133d>] dlmlock+0x534/0x11d7 [ocfs2_dlm]
host kernel:  [<c0419cd2>] enqueue_task+0x29/0x39
host kernel:  [<c0419d2c>] __activate_task+0x4a/0x59
host kernel:  [<c041a510>] try_to_wake_up+0x32b/0x335
host kernel:  [<f8a93896>] ocfs2_cluster_lock+0x337/0x880 [ocfs2]
host kernel:  [<f8a9514e>] ocfs2_locking_ast+0x0/0x40a [ocfs2]
host kernel:  [<f8a9835c>] ocfs2_blocking_ast+0x0/0x259 [ocfs2]
host kernel:  [<f8a97e72>] ocfs2_orphan_scan_lock+0x47/0x73 [ocfs2]
host kernel:  [<f8aa5770>] ocfs2_queue_orphan_scan+0x2e/0x141 [ocfs2]
host kernel:  [<f8aa9afb>] ocfs2_orphan_scan_work+0x16/0x6f [ocfs2]
host kernel:  [<c042e32e>] run_workqueue+0x78/0xb5
host kernel:  [<f8aa9ae5>] ocfs2_orphan_scan_work+0x0/0x6f [ocfs2]
host kernel:  [<c042ec48>] worker_thread+0xd9/0x10d
host kernel:  [<c041a51a>] default_wake_function+0x0/0xc
host kernel:  [<c042eb6f>] worker_thread+0x0/0x10d
host kernel:  [<c0431231>] kthread+0xc0/0xeb
host kernel:  [<c0431171>] kthread+0x0/0xeb
host kernel:  [<c0403005>] kernel_thread_helper+0x5/0xb


On other nodes, any access to the OCFS2 filesystem will hang there and cannot be killed by Ctrl+c (or kill -9).
The only workaround is to reboot all nodes.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms