Extended cluster instance crashes if one interconnect network fails (Doc ID 2236102.1)

Last updated on SEPTEMBER 16, 2024

Applies to:

Oracle Database - Enterprise Edition - Version 12.1.0.2 to 12.2.0.1 [Release 12.1 to 12.2]
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Gen 1 Exadata Cloud at Customer (Oracle Exadata Database Cloud Machine) - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Information in this document applies to any platform.

Symptoms

o Extended RAC configuration with two sites (Site A and Site B) and two nodes on each site.

o Two network interfaces registered to be used for the private communication, 'oifcfg getif' shows:

$ oifcfg getif
<interface1> <subnet1> global public
<interface2> <subnet2> global cluster_interconnect
<interface3> <subnet3> global cluster_interconnect

o Communication issue on one of the private network

o Several communication errors were produced on the database and ASM suggesting when the network communication was affected like:

- At the ASM instances:

ERROR: Network OS Ping failed to inst <instance number> on IP (169.254.*.*)

- At database instances:

PC Send timeout detected. Sender: ospid <PID> ...
Receiver: inst 3 binc 636266236 ospid <PID>
IPC Send timeout detected. Sender: ospid <PID>...
Receiver: inst 4 binc 636568242 ospid <PID>
...
LMON (ospid: <PID>) detects hung instances during IMR reconfiguration
LMON (ospid: <PID>) tries to kill the instance 3 in 10 seconds.
Please check instance 3's alert log and LMON trace file for more details.
LMON (ospid: 137826) aborts 1 previously scheduled instance kills
...
Evicting instance 3 from cluster
Evicting instance 4 from cluster
Waiting for instances to leave: 3 4

o When the issue happens, GIPC RANK remains as 99 even the number of messages decreases:

inf[ 0] <interface 2> - rank 99, avgms 0.327869 [ 196 / 61 / 61 ]

- Note that the values decreased to around 61 from 196 in earlier time.

o If stop one node to make it a 3-node cluster, issue will not happen.

o In 3-node configuration, GIPC RANK goes to 0 on the node when the issue happens resulting in HAIP to move HAIP 169.254.*.* to the other interface.

Changes

This is a RAC Extended implementation

Cause

	To view full details, sign in with your My Oracle Support account.
	Don't have a My Oracle Support account? Click to get started!

In this Document

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.

Extended cluster instance crashes if one interconnect network fails (Doc ID 2236102.1)

Applies to:

Symptoms

Changes

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!