Reconfigure OC2B Cluster Heartbeat Timeout to Default Value on Exalogic Virtual
(Doc ID 2435272.1)
Last updated on MARCH 16, 2021
Applies to:Oracle Exalogic Elastic Cloud Software - Version 188.8.131.52.170418 to 184.108.40.206.170418
Oracle Exalogic Elastic Cloud Software - Version 220.127.116.11.170718 and later
Oracle Virtual Server x86-64
★★★★ O2CB Timeout Settings in this Note DOES NOT APPLY to 18.104.22.168.X Versions ★★★★
NOTE: The procedure in this document may be performed as required if the rack is running Apr 2017 PSU or a higher 22.214.171.124.x version and is currently configured with an O2CB cluster heartbeat timeout of 24 hours, implemented as described in the following MOS note:
Increase O2CB Cluster Heartbeat Timeout on Exalogic Virtual (Doc ID 1995593.1)
Customers who have implemented custom timeout values with approval from the Exalogic engineering team may also perform this reconfiguration to decrease the cluster heartbeat timeout to the original default value of 5 minutes.
This reconfiguration is not integrated into an Exalogic PSU at this time.
A detailed step-by-step procedure was provided in the following Document to increase the O2CB cluster heartbeat timeout from 5 min to a very large value of 24 hr:
The increased timeout effectively prevented a catastrophic reboot of all compute nodes on an Exalogic rack in a virtual configuration due to fencing, in the event of storage loss for an extended period of time, such as during an ZFS SW upgrade or a ZFS HA failover event.
The UEK2 kernel bundled in April 2017 PSU (and higher) delivers an enhancement fix with the following bug:
BUG 24305598 - Backport fix for bug 21862940 to UEK2 codeline
(UNPUBLISHED) BASE BUG:21862940
BUG 21862940 - OCFS2: ALL NODES WILL FENCE SELF WHEN STORAGE DOWN
The backport bug 24305598 is fixed in UEK2 v2.6.39-400.288.1. The Exalogic April 2017 PSU bundles UEK2 kernel v2.6.39-400.294.1 which includes this fix.
The fix implemented provides a key enhancement relative to compute node behavior upon storage loss:
- If ALL nodes observe ZFS outage, they renegotiate with each other and wait for an additional period of time without all getting rebooted due to fencing (this happens when the takeover time exceeds 4-5 min resulting in storage loss across ALL nodes)
- If only a subset (one or more, but not all) observes ZFS outage, then this subset will fence. So if one node is unable to access the ZFS while others are able to, this node will fence
IMPORTANTThis enhancement prevents such rack-wide compute node reboots due to fencing upon extended loss of storage. Consequently, we recommend that the O2CB cluster heartbeat timeout be reconfigured to its original default value of 5 minutes.
This document provides a step-by-step procedure to reconfigure the O2CB cluster heartbeat timeout to the default value of 5 minutes.
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document
|DCLI configuration and node list|
|Stop the Exalogic Control Stack|
|Reconfigure the O2CB cluster heartbeat timeout in the pool|
|Restore VM locks|
|Start the Exalogic Control Stack|
|Verify and update Cluster timeout value to 300 in OVMM (STEP ONLY APPLIES TO OCTOBER 2020 PSU VIRTUAL RACKS)|