My Oracle Support Banner

Oracle 12c RAC/ASM Instance will Crash if one Site is Down in an Stretch (Extended) Cluster. (Doc ID 2093559.1)

Last updated on AUGUST 04, 2018

Applies to:

Linux OS - Version Oracle Linux 6.0 and later
Linux x86-64

Symptoms

12c RAC/ASM is configured in multiple sites in an stretch (extended) cluster.

In this setup, database has ASM diskgroups having access to 2-pathed local disk device and 2-pathed remote disk device.

When the remote 2-paths are down, the I/Os (Input / Outputs) to the remote site are stuck and will not fail or complete by at least 2 minutes. At that tile, DB processes like LMON will be in D state (i.e. wait for I/O completions) as shown below:

loadavg : 9.67 4.11 2.41
System user time: 0.01 sys time: 0.00 context switch: 43694
Memory (Avail / Total) = 231891.14M / 257569.06M
Swap (Avail / Total) = 16896.00M / 16896.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D o_edsdp 19527 1 0 80 0 - 7795510 dio_aw 14:27 ? 00:00:13 ora_lmon_C1D12C18 

- Call stack will be shown as per below:

Oct 13 11:11:45 xxxxx kernel: [<ffffffff81599d09>] schedule+0x29/0x70
Oct 13 11:11:45 xxxxx kernel: [<ffffffff81599ddc>] io_schedule+0x8c/0xd0
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811ce9e4>] dio_await_completion+0x54/0xd0
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811d0c3a>] do_blockdev_direct_IO+0x5fa/0x860
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811cc0f0>] ? I_BDEV+0x10/0x10
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811d0eec>] __blockdev_direct_IO+0x4c/0x50
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811cc0f0>] ? I_BDEV+0x10/0x10
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811cd0cf>] blkdev_direct_IO+0x4f/0x60
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811cc0f0>] ? I_BDEV+0x10/0x10
Oct 13 11:11:45 xxxxx kernel: [<ffffffff81132642>] generic_file_read_iter+0x122/0x130
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811cc32b>] blkdev_read_iter+0x4b/0x70
Oct 13 11:11:45 xxxxx kernel: [<ffffffff8119395c>] do_aio_read+0xbc/0xd0
Oct 13 11:11:45 xxxxx kernel: [<ffffffff81193a17>] do_sync_read+0xa7/0xe0
Oct 13 11:11:45 xxxxx kernel: [<ffffffff81193d21>] vfs_read+0xb1/0x130
Oct 13 11:11:45 xxxxx kernel: [<ffffffff811946aa>] sys_pread64+0x9a/0xb0
Oct 13 11:11:45 xxxxx kernel: [<ffffffff815a3859>] system_call_fastpath+0x16/0x1b 

Following messages will be observed in DB alert log: 

LGWR (ospid: 19615) waits for event 'log file parallel write' for 75 secs.
CKPT (ospid: 19619) waits for event 'control file sequential read' for 77 secs.
Mon Oct 12 15:19:30 2015
LMON (ospid: 19527) waits for event 'control file sequential read' for 94 secs.
Mon Oct 12 15:19:40 2015
Dumping diagnostic data in directory=[cdmp_20151012151940], requested by (instance=10, osid=31297 (LMHB)), summary=[abnormal instance termination].
Errors in file /edsd/apps/o_edsdp/diag/rdbms/.../.../trace/C1D12C18_lmhb_19565.trc (incident=888953):
ORA-29770: global enqueue process LMON (OSID 19527) is hung for more than 70 seconds
Incident details in: /edsd/apps/o_edsdp/diag/rdbms/.../.../incident/incdir_888953/C1D12C18_lmhb_19565_i888953.trc
Mon Oct 12 15:19:47 2015
Sweep [inc][888953]: completed
Sweep [inc2][888953]: completed
Errors in file /edsd/apps/o_edsdp/diag/rdbms/.../.../trace/C1D12C18_lmhb_19565.trc (incident=888954):
ORA-29770: global enqueue process LMON (OSID 19527) is hung for more than 70 seconds
Incident details in: /edsd/apps/o_edsdp/diag/rdbms/.../.../incident/incdir_888954/C1D12C18_lmhb_19565_i888954.trc

Above issue will happen with below default multipathing settings for HP 3PARdata storage:

no_path_retry 18
polling_interval 5
/sys/module/scsi_transport_fc/parameters/dev_loss_tmo = 60

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.