Exadata: RDS Related Database Node Crashes (Doc ID 1337506.1)

Last updated on JANUARY 16, 2015

Applies to:

Linux OS - Version Oracle Linux 5.5 to Oracle Linux 5.5 [Release OL5U5]
Linux x86-64

Symptoms

If you are running Exadata configuration with images earlier than 11.2.2.3.5 (and Infiniband OFED - OFA packages earlier than 1.5.1-4.0.50) you may encounter unconditional database node reboots initiated by Linux OS kernel crashes.

The problems described here happens under specific circumstances with very specific database workloads, therefore your system might not be affected at all.


There are various manifestations of the problem, Below you can find the most common call stacks that might be dumped in the system log or extracted from a vmcore / crashcore dumped by the database node.

"rds_info" Induced Crash with RIP: rds_info_copy+24

PID: 16705  TASK: ffff811108fe6820  CPU: 9   COMMAND: "rds-info"
#0 [ffff810778e4fbc0] crash_kexec at ffffffff800adadc
#1 [ffff810778e4fc80] __die at ffffffff80065173
#2 [ffff810778e4fcc0] do_page_fault at ffffffff80066df3
#3 [ffff810778e4fdb0] error_exit at ffffffff8005dde9
[exception RIP: rds_info_copy+24]
....
#4 [ffff810778e4fe60] rds_sock_info at ffffffff8850b2df
#5 [ffff810778e4fec0] rds_info_getsockopt at ffffffff8850d202
#6 [ffff810778e4ff40] sys_getsockopt at ffffffff80225d65
#7 [ffff810778e4ff80] tracesys at ffffffff8005d28d (via system_call)

Soft Lockup in flush_tlb_others()

kernel: BUG: soft lockup - CPU#1 stuck for 10s!
...
kernel: RIP: 0010:[<ffffffff80016169>] __bitmap_empty+0xf/0x62
...
kernel: Call Trace:
kernel:  [<ffffffff80023005>] flush_tlb_others+0x97/0xba
kernel:  [<ffffffff8002b341>] flush_tlb_page+0xcd/0xda
kernel:  [<ffffffff8001111a>] do_wp_page+0x3fd/0x902
kernel:  [<ffffffff80037bf0>] do_sock_write+0xc6/0x102
kernel:  [<ffffffff80009648>] __handle_mm_fault+0xee5/0xfaa
kernel:  [<ffffffff80030d67>] release_sock+0x13/0xaa
kernel:  [<ffffffff80054449>] tcp_ioctl+0x10f/0x11b
kernel:  [<ffffffff80228caa>] sock_getsockopt+0x32b/0x34d
kernel:  [<ffffffff80066b71>] do_page_fault+0x4cb/0x874

kernel: RIP: 0010:[<ffffffff8001615a>] __bitmap_empty+0x0/0x62
...
kernel: Call Trace:
kernel:  [<ffffffff80023005>] flush_tlb_others+0x97/0xba
kernel:  [<ffffffff800769ba>] flush_tlb_mm+0xca/0xd5
kernel:  [<ffffffff800cd259>] zap_page_range+0xcb/0xf5
kernel:  [<ffffffff800b68cf>] audit_filter_inodes+0xbe/0xf9
kernel:  [<ffffffff800cce54>] sys_madvise+0x3a7/0x4c1
kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
kernel:
kernel: Kernel panic - not syncing: softlockup: hung tasks

NMI Lockup with Various Stacks around kmem_cache_alloc()

NMI Watchdog detected LOCKUP on CPU 6
RIP: 0010:[<ffffffff8005c36b>]  cache_alloc_refill+0xf1/0x186
...
Call Trace:
 [<ffffffff8000abf4>] kmem_cache_alloc+0x6c/0x76
 [<ffffffff8004410c>] sk_alloc+0x2e/0xf3
 [<ffffffff8851e126>] :rds:rds_create+0x39/0x122
 [<ffffffff8004d175>] __sock_create+0x170/0x27c
 [<ffffffff80226357>] sys_socket+0xf/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe

OR

[exception RIP: list_del+14]
--- <NMI exception stack> ---
 #6 [ffff81122d7d1e48] list_del at ffffffff80154090
 #7 [ffff81122d7d1e50] cache_alloc_refill at ffffffff8005c36b
 #8 [ffff81122d7d1e90] kmem_cache_alloc at ffffffff8000abf4
 #9 [ffff81122d7d1eb0] sk_alloc at ffffffff8004410c
#10 [ffff81122d7d1ef0] rds_create at ffffffff884f3126
#11 [ffff81122d7d1f10] __sock_create at ffffffff8004d175
#12 [ffff81122d7d1f60] sys_socket at ffffffff80226357

OR

[exception RIP: __list_add+72]
 #5 [ffff8108f58a7ef0] rds_create at ffffffff884f81f3
 #6 [ffff8108f58a7f10] __sock_create at ffffffff8004d175
 #7 [ffff8108f58a7f60] sys_socket at ffffffff80226357
 #8 [ffff8108f58a7f80] tracesys at ffffffff8005d28d (via system_call)

OR

[exception RIP: .text.lock.spinlock+2]
--- <NMI exception stack> ---
 #6 [ffff810ebca63e50] .text.lock.spinlock at ffffffff80064bfc (via _spin_lock)
 #7 [ffff810ebca63e50] cache_alloc_refill at ffffffff8005c2e8
 #8 [ffff810ebca63e90] kmem_cache_alloc at ffffffff8000abf4
 #9 [ffff810ebca63eb0] sk_alloc at ffffffff8004410c
#10 [ffff810ebca63ef0] rds_create at ffffffff8859b126
#11 [ffff810ebca63f10] __sock_create at ffffffff8004d175
#12 [ffff810ebca63f60] sys_socket at ffffffff80226357

Older RDS Crashes

There is an older problem which is fixed by OFA 1.5.1-4.0.47 later (Exadata image version 11.2.2.3.1). The relevant call stacks seen are below:

PID: 20331  TASK: ffff8117fd348080  CPU: 14  COMMAND: "rds-info"
#0 [ffff8103d7619bb0] crash_kexec at ffffffff800adadc
#1 [ffff8103d7619c70] __die at ffffffff80065173
#2 [ffff8103d7619cb0] do_page_fault at ffffffff80066df3
#3 [ffff8103d7619da0] error_exit at ffffffff8005dde9
[exception RIP: sock_i_ino+30]
   ...
#4 [ffff8103d7619e50] sock_i_ino at ffffffff80227f46
#5 [ffff8103d7619e60] rds_sock_info at ffffffff8859b302
#6 [ffff8103d7619ec0] rds_info_getsockopt at ffffffff8859d13a
#7 [ffff8103d7619f40] sys_getsockopt at ffffffff80225d65


PID: 16222  TASK: ffff811828bdb820  CPU: 2   COMMAND: "diskmon.bin"
#0 [ffff810155d8cdc0] crash_kexec at ffffffff800adadc
#1 [ffff810155d8ce80] die_nmi at ffffffff800652c1
#2 [ffff810155d8cea0] nmi_watchdog_tick at ffffffff80065a27
#3 [ffff810155d8cef0] default_do_nmi at ffffffff80065645
#4 [ffff810155d8cf40] do_nmi at ffffffff800658b2
#5 [ffff810155d8cf50] nmi at ffffffff80064eef
[exception RIP: cache_alloc_refill+212]
   ...  --- <NMI exception stack> ---
#6 [ffff811827777e58] cache_alloc_refill at ffffffff8005c34e
#7 [ffff811827777e90] kmem_cache_alloc at ffffffff8000abf4
#8 [ffff811827777eb0] sk_alloc at ffffffff8004410c
#9 [ffff811827777ef0] rds_create at ffffffff88599126 [rds]
#10 [ffff811827777f10] __sock_create at ffffffff8004d175
#11 [ffff811827777f60] sys_socket at ffffffff80226357

Changes

The problem might manifest after (but not be limited to) the following changes:

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms