Linux: Node Reboots After "kernel: INFO: task ocssd.bin:<pid> blocked for more than 120 seconds." in /var/log/messages
(Doc ID 1423693.1)
Last updated on JUNE 18, 2024
Applies to:
Oracle Database - Enterprise Edition - Version 10.2.0.1 and laterOracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Generic Linux
Symptoms
Oracle Clusterware on Linux, node hangs and reboots after the following OS messages:
- /var/log/messages
Feb 13 20:21:45 racnode2 kernel: INFO: task pdflush:1793 blocked for more than 120 seconds.
Feb 13 20:21:45 racnode2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 20:21:45 racnode2 kernel: pdflush D ffffffff801538ac 0 1793 1163 1794 1792 (L-TLB)
..
Feb 13 20:21:45 racnode2 kernel: Call Trace:
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8849d002>] :jbd:start_this_handle+0x2e5/0x36c
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8849d152>] :jbd:journal_start+0xc9/0x100
Feb 13 20:21:45 racnode2 kernel: [<ffffffff884bb511>] :ext3:ext3_ordered_writepage+0x7d/0x198
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8001d18f>] mpage_writepages+0x1bf/0x37d
Feb 13 20:21:45 racnode2 kernel: [<ffffffff884bb494>] :ext3:ext3_ordered_writepage+0x0/0x198
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8815ed5c>] :dm_mod:dm_any_congested+0x38/0x3f
Feb 13 20:21:45 racnode2 kernel: [<ffffffff88160acf>] :dm_mod:dm_table_any_congested+0x53/0xa3
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8005aebc>] do_writepages+0x29/0x2f
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8002fe38>] __writeback_single_inode+0x19e/0x318
Feb 13 20:21:45 racnode2 kernel: [<ffffffff80021104>] sync_sb_inodes+0x1b5/0x26f
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
Feb 13 20:21:45 racnode2 kernel: [<ffffffff80051332>] writeback_inodes+0x82/0xd8
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800cbbce>] wb_kupdate+0xd4/0x14e
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800568c8>] pdflush+0x0/0x1fb
..
## >> note: the task name can be any running processes
Feb 13 20:21:45 racnode2 kernel:
Feb 13 20:21:45 racnode2 kernel: INFO: task kjournald:22190 blocked for more than 120 seconds.
Feb 13 20:21:45 racnode2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 20:21:45 racnode2 kernel: kjournald D ffffffff801538ac 0 22190 1163 22197 22185 (L-TLB)
..
Feb 13 20:21:45 racnode2 kernel: Call Trace:
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8849e5cf>] :jbd:journal_commit_transaction+0x16d/0x1066
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8004b425>] try_to_del_timer_sync+0x7f/0x88
Feb 13 20:21:45 racnode2 kernel: [<ffffffff884a25d3>] :jbd:kjournald+0xc1/0x213
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
Feb 13 20:21:45 racnode2 kernel: [<ffffffff884a2512>] :jbd:kjournald+0x0/0x213
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
Feb 13 20:21:45 racnode2 kernel: [<ffffffff80032b26>] kthread+0xfe/0x132
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a26d4>] keventd_create_kthread+0x0/0xc4
Feb 13 20:21:45 racnode2 kernel: [<ffffffff80032a28>] kthread+0x0/0x132
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
..
Feb 13 20:21:45 racnode2 kernel: INFO: task multipathd:24551 blocked for more than 120 seconds.
..
Feb 13 20:21:45 racnode2 kernel: INFO: task ohasd.bin:9040 blocked for more than 120 seconds.
..
Feb 13 20:21:45 racnode2 kernel: INFO: task oracle:8977 blocked for more than 120 seconds.
..
Feb 13 20:21:45 racnode2 kernel: INFO: task tnslsnr:6129 blocked for more than 120 seconds.
..
Feb 13 20:21:45 racnode2 kernel: INFO: task ocssd.bin:7644 blocked for more than 120 seconds.
Feb 13 20:21:45 racnode2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 20:21:45 racnode2 kernel: ocssd.bin D ffffffff801538ac 0 7644 1 7645 5800 (NOTLB)
..
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8001c41a>] generic_make_request+0x211/0x228
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800637ca>] io_schedule+0x3f/0x67
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800f74c6>] __blockdev_direct_IO+0x8da/0xa82
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8004fdcf>] __filemap_fdatawrite_range+0x50/0x5b
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800e7fbc>] blkdev_direct_IO+0x32/0x37
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800e7ef4>] blkdev_get_blocks+0x0/0x96
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800c878a>] generic_file_direct_IO+0x108/0x122
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8001f0ff>] generic_file_direct_write+0x60/0xf2
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800166b8>] __generic_file_aio_write_nolock+0x2b8/0x3b6
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800c890f>] generic_file_aio_write_nolock+0x20/0x6c
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800c8ce6>] generic_file_write_nolock+0x8f/0xa8
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
Feb 13 20:21:45 racnode2 kernel: [<ffffffff800e6d70>] blkdev_file_write+0x1a/0x1f
Feb 13 20:21:45 racnode2 kernel: [<ffffffff80016b71>] vfs_write+0xce/0x174
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8004419e>] sys_pwrite64+0x50/0x70
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0
Feb 13 20:21:45 racnode2 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
## >> note: node rebooted
Feb 13 20:35:33 racnode2 syslogd 1.4.1: restart.
Feb 13 20:35:33 racnode2 kernel: klogd 1.4.1, log source = /proc/kmsg started
- <grid-home>/log/<nodename>/cssd/ocssd.log
2012-02-13 20:18:13.243: [ CSSD][1116629312]clssnmSendingThread: sending status msg to all nodes
2012-02-13 20:18:13.243: [ CSSD][1116629312]clssnmSendingThread: sent 4 status msgs to all nodes
## >> note: node rebooted
2012-02-13 20:42:13.392: [ CSSD][1825956736]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2012-02-13 20:42:13.392: [ CSSD][1825956736]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
- Last OSW snapshot was done a few minutes earlier
zzz ***Mon Feb 13 20:17:41 CST 2012
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 0 0 29418300 1508316 26174916 0 0 130 335 1 1 2 1 97 1 0
1 0 0 29412920 1508316 26174956 0 0 520 358 1499 20298 2 2 96 0 0
0 0 0 29414320 1508316 26174976 0 0 432 253 1464 19237 1 1 98 0 0
zzz ***Mon Feb 13 20:18:01 CST 2012
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 0 0 29420676 1508316 26175428 0 0 130 335 1 1 2 1 97 1 0
0 1 0 29418056 1508316 26175652 0 0 272 2566 1837 21831 1 2 94 3 0
0 1 0 29418956 1508316 26175476 0 0 352 2906 2067 21219 0 1 96 3 0
## >> note: no more record after this, node got rebooted a few minutes later
- <grid-home>/log/<nodename>/alert<nodename>.log - surviving node
2012-02-13 20:26:32.825.
[cssd(13407)]CRS-1607:Node racnode2 is being evicted in cluster incarnation 206713158; details at (:CSSNM00007:) in /ocw/grid/log/racnode1/cssd/ocssd.log.
2012-02-13 20:27:16.332
[ohasd(2158)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-02-13-20:27:16.331
[ohasd(2158)]CRS-8013:reboot advisory message text: Rebooting after limit 28000 exceeded; disk timeout 28000, network timeout 12630, last heartbeat from CSSD at epoch seconds 1329186408.287, 28040 milliseconds ago based on invariant clock value of 1229052344
2012-02-13 20:27:18.980
[cssd(13407)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 .
2012-02-13 20:27:19.073
[crsd(23378)]CRS-5504:Node down event reported for node 'racnode2'
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Cause |
Solution |
References |