My Oracle Support Banner

Oracle Linux: Linux node running on Exadata node reboots with sdp stack (Doc ID 2494773.1)

Last updated on JUNE 23, 2020

Applies to:

Linux OS - Version Oracle Linux 6.1 to Oracle Linux 6.9 with Unbreakable Enterprise Kernel [4.1.12] [Release OL6U1 to OL6U9]
Linux x86-64
Linux x86

Symptoms

An exadata system running Oracle Linux with DB crashes frequently. This is an X4 exadata with 2 nodes. The problem occurred on both the nodes.

In /var/log/messages the below stack appears just before the reboot:

Sep 28 10:33:53 server kernel: [34304.417861] BUG: Bad page state in process oracle pfn:117e5da
Sep 28 10:33:53 server kernel: [34304.423922] page:ffffea0045f97680 count:-1 mapcount:0 mapping: (null) index:0x0
Sep 28 10:33:53 server kernel: [34304.432434] flags: 0x2fffff80000000()
Sep 28 10:33:53 server kernel: [34304.436340] page dumped because: nonzero _count
Sep 28 10:33:53 server kernel: [34304.441092] Modules linked in: mpt3sas scsi_transport_sas raid_class ib_sdp oracleacfs(PO) oracleadvm(PO) oracleoks(PO) ipmi_poweroff ipmi_devintf nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fs
cache lockd sunrpc grace bonding rds_rdma rds ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_core dm_multipath bnx2i cnic uio cxgb4i libcxgbi cxgb4 iscsi_tcp libiscsi_tcp libiscsi
 scsi_transport_iscsi ipv6 fuse iTCO_wdt iTCO_vendor_support ipmi_ssif ipmi_si ipmi_msghandler acpi_cpufreq sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp sg ioatdma ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdi
o ext3 jbd mbcache sd_mod ahci libahci megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
Sep 28 10:33:53 server kernel: [34304.441146] CPU: 30 PID: 127805 Comm: oracle Tainted: P B O 4.1.12-61.57.2.el6uek.x86_64 #2
Sep 28 10:33:53 server kernel: [34304.441148] Hardware name: Oracle Corporation SUN SERVER X4-2 /ASSY,MB,X4-2, 1U , BIOS 25050100 09/19/2017
Sep 28 10:33:53 server kernel: [34304.441149] 0000000000000000 ffff881012b3f8b8 ffffffff816993a0 ffffea0045f97680
Sep 28 10:33:53 server kernel: [34304.441152] ffffffff819614ce ffff881012b3f8e8 ffffffff8118e27d ffff881012b3f928
Sep 28 10:33:53 server kernel: [34304.441153] ffffea0045f97680 0000000000000000 00000000000200d2 ffff881012b3f938
Sep 28 10:33:53 server kernel: [34304.441155] Call Trace:
Sep 28 10:33:53 server kernel: [34304.441163] [] dump_stack+0x63/0x83
Sep 28 10:33:53 server kernel: [34304.441167] [] bad_page+0xed/0x140
Sep 28 10:33:53 server kernel: [34304.441169] [] prep_new_page+0x1b5/0x1d0
Sep 28 10:33:53 server kernel: [34304.441171] [] get_page_from_freelist+0x2d7/0x740
Sep 28 10:33:53 server kernel: [34304.441175] [] ? hrtimer_try_to_cancel+0x45/0x100
Sep 28 10:33:53 server kernel: [34304.441178] [] __alloc_pages_nodemask+0x19b/0x2d0
Sep 28 10:33:53 server kernel: [34304.441181] [] ? remove_wait_queue+0x3c/0x50
Sep 28 10:33:53 server kernel: [34304.441185] [] alloc_pages_current+0xaf/0x170
Sep 28 10:33:53 server kernel: [34304.441193] [] rds_page_remainder_alloc+0x1c3/0x2b0 [rds]
Sep 28 10:33:53 server kernel: [34304.441197] [] rds_message_copy_from_user+0x76/0x130 [rds]
Sep 28 10:33:53 server kernel: [34304.441201] [] rds_sendmsg+0x49b/0xa00 [rds]
Sep 28 10:33:53 server kernel: [34304.441205] [] ? rw_copy_check_uvector+0xa0/0x130
Sep 28 10:33:53 server kernel: [34304.441210] [] sock_sendmsg+0x4d/0x60
Sep 28 10:33:53 server kernel: [34304.441212] [] ___sys_sendmsg+0x30a/0x330
Sep 28 10:33:53 server kernel: [34304.441217] [] ? rds_ib_get_mr+0xd6/0x1a0 [rds_rdma]
Sep 28 10:33:53 server kernel: [34304.441221] [] ? __rds_rdma_map+0x16a/0x340 [rds]
Sep 28 10:33:53 server kernel: [34304.441224] [] __sys_sendmsg+0x49/0x90
Sep 28 10:33:53 server kernel: [34304.441227] [] ? syscall_trace_leave+0xf1/0x160
Sep 28 10:33:53 server kernel: [34304.441229] [] SyS_sendmsg+0x19/0x20
Sep 28 10:33:53 server kernel: [34304.441232] [] system_call_fastpath+0x12/0x71


For the above stack a kernel patch was applied through ksplice , but still the node rebooted after some time  running and generated a vmcore. From the vmcore we could see the system rebooted with the below stack. This time it points to sdp protocol.

[540143.878231] Call Trace:

[540143.878233] [] dump_stack+0x63/0x83

[540143.878235] [] warn_slowpath_common+0x95/0xe0

[540143.878237] [] warn_slowpath_fmt+0x46/0x70

[540143.878239] [] __list_del_entry+0x63/0xd0

[540143.878240] [] list_del+0x11/0x40

[540143.878243] [] get_page_from_freelist+0x256/0x740

[540143.878244] [] ? __switch_to+0x212/0x5e0

[540143.878246] [] __alloc_pages_nodemask+0x19b/0x2d0

[540143.878249] [] alloc_pages_current+0xaf/0x170

[540143.878251] [] sdp_post_recv+0x1a1/0x9f0 [ib_sdp]

[540143.878254] [] sdp_do_posts+0x1ec/0x2a0 [ib_sdp]

[540143.878257] [] sdp_recvmsg+0x348/0xe30 [ib_sdp]

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.