Node reboot or eviction: How to check if your private interconnect CRS can transmit network heartbeats
(Doc ID 1445075.1)
Last updated on JUNE 14, 2024
Applies to:
Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.2.0.3 [Release 10.1 to 11.2]Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Information in this document applies to any platform.
Goal
Frequently, in the case of node reboots, the log of the CSS daemon processes (ocssd.log) indicates that the network heartbeat from one or more remote nodes was not received (for example, the message "CRS-1610:Network communication with node xxxxxx (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.656 seconds" appears in the ocssd.log), and that the node subsequently was rebooted (to avoid a split brain or because it was evicted by another node).
The script in here performs the network connectivity check using ssh. This check complements ping or traceroute since ssh uses TCP protocol while ping uses ICMP and traceroute in Linux/Unix uses UDP (traceroute on Windows use ICMP).
The network communication involves both the actual physical connection and the OS layer such as IP, UDP, and TCP.
CRS (10g and 11.1) uses TCP to communicate, so using ssh to test the connection as well as TCP and IP layer is a better test than ping or traceroute.
Because CRS on 11.2 uses UDP to communicate, using ssh to test TCP is not the optimal test, but this test will complement the traceroute test.
The script tests the private interconnect once every 5 seconds, so this script will put an insignificant load on the server.
Solution
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Goal |
Solution |