My Oracle Support Banner

Snapshot-based Backup via NFS Over Infiniband Network Will Freeze The Node (Doc ID 1632242.1)

Last updated on SEPTEMBER 24, 2021

Applies to:

Exadata Database Machine V2 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Backup server will go to a hung state when doing a Snapshot-based Backup from a Exadata Compute Node via NFS over Infiniband Network. The same backup procedure will work fine if its over normal 10Gb ethernet card with default setting.

The message file from the node will be showing the below stack information when running backup.

Aug 28 10:27:43 exahostdb01 lvm[79582]: Monitoring snapshot VGExaDb-u01_snap <<<
..
Aug 28 10:31:14 exahostdb01 kernel: nfs: server exaNFSbackup not responding, still trying
Aug 28 10:39:45 exahostdb01 kernel: nfs: server exaNFSbackup OK
Aug 28 10:55:00 exahostdb01 kernel: nfs: server exaNFSbackup not responding, still trying
...
Aug 28 11:01:10 exahostdb01 kernel: nfs: server exaNFSbackup OK
Aug 28 11:13:00 exahostdb01 kernel: nfs: server exaNFSbackup not responding, still trying
Aug 28 11:13:53 exahostdb01 kernel: nfs: server exaNFSbackup OK
Aug 28 12:41:20 exahostdb01 kernel: RDS/IB: re-connect to 169.XXX.XXX.XXX is stalling for more than 1 min...(drops=12 err=0)
Aug 28 12:41:20 exahostdb01 kernel: RDS/IB: re-connect to 169.XXX.XXX.XXX is stalling for more than 1 min...(drops=12 err=0)
Aug 28 12:41:58 exahostdb01 kernel: RDS/IB: re-connect to 10.XXX.XXX.XXX is stalling for more than 1 min...(drops=1 err=0)
Aug 28 14:13:49 exahostdb01 kernel: RDS/IB: connected to 10.XXX.XXX.XXX version 3.1
Aug 28 14:16:47 exahostdb01 kernel: RDS/IB: connected to 169.XXX.XXX.XXX version 3.1
Aug 28 14:16:47 exahostdb01 kernel: RDS/IB: connected to 169.XXX.XXX.XXX version 3.1
....
Sep  5 12:16:12 exahostdb01 kernel: INFO: task lsof:65691 blocked for more than 120 seconds.
Sep  5 12:16:12 exahostdb01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  5 12:16:12 exahostdb01 kernel: lsof          D 0000000000000000     0 65691  65680 0x00000080
Sep  5 12:16:12 exahostdb01 kernel:  ffff88116ec7bc08 0000000000000082 0000000000000000 ffffffffadf60c48
Sep  5 12:16:12 exahostdb01 kernel:  ffff88355e6ea080 ffffffff81aae4c0 ffff88355e6ea450 0000000176b6fa52
Sep  5 12:16:12 exahostdb01 kernel:  000000006ec7bc98 0000000000000000 0000000000000000 ffff88355e6ea080
Sep  5 12:16:12 exahostdb01 kernel: Call Trace:
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff814569cc>] io_schedule+0x42/0x5c
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa0614b02>] nfs_wait_bit_uninterruptible+0xe/0x12 [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff81456efb>] __wait_on_bit+0x4a/0x7c
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa0614af4>] ? nfs_wait_bit_uninterruptible+0x0/0x12 [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa0614af4>] ? nfs_wait_bit_uninterruptible+0x0/0x12 [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff81456fa0>] out_of_line_wait_on_bit+0x73/0x80
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff8107706d>] ? wake_bit_function+0x0/0x2f
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa0614af2>] nfs_wait_on_request+0x2b/0x2d [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa0618a6c>] nfs_sync_mapping_wait+0xec/0x1fa [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa0619073>] nfs_write_mapping+0x77/0x9e [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff810432d6>] ? should_resched+0xe/0x2f
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa06190b4>] nfs_wb_nocommit+0x1a/0x1c [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffffa060e184>] nfs_getattr+0x61/0xef [nfs]
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff8111ea7b>] vfs_getattr+0x4c/0x69
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff8111eae8>] vfs_fstatat+0x50/0x67
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff8111ebe5>] vfs_stat+0x1b/0x1d
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff8111ec06>] sys_newstat+0x1f/0x39
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff810a9d23>] ? audit_syscall_entry+0x103/0x12f
Sep  5 12:16:12 exahostdb01 kernel:  [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b
...
Sep 10 16:00:39 exahostdb01 kernel: ixgbe 0000:20:00.0: eth0: NIC Link is Up 1 Gbps, Flow Control: RX/TX
Sep 10 16:00:39 exahostdb01 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Sep 10 16:00:41 exahostdb01 kernel: ib0: packet len 2398 (> 2048) too long to send, dropping
Sep 10 16:00:41 exahostdb01 last message repeated 2 times
Sep 10 16:11:34 exahostdb01 kernel: ixgbe 0000:30:00.1: eth5: NIC Link is Down <<<<
bondib0   Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  (ib0 + ib1)
         inet addr:10.x.x.x  Bcast:10.x.x.255  Mask:255.255.255.0
         inet6 addr: fe80::221:2800:1fc:b3ed/64 Scope:Link
         UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1

 

 



Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.