Coherence Cluster Data Loss due to Temporary Network Outage in the Cluster Environment with Cluster Quorum in use.
(Doc ID 2699511.1)
Last updated on AUGUST 17, 2020
Applies to:Oracle Coherence - Version 188.8.131.52.0 to 184.108.40.206.0 [Release 12c]
Information in this document applies to any platform.
The cluster Timeout Survivor Quorum policy is designed to ensure that a minimum capacity number of members is maintained in the cluster at all times. It can also be used to ensure that in the case of a temporary network outage, that a minimum cluster size is held together so that once the network is restored that cluster will be used as the "winning" cluster for split brain considerations. Other members that find out once the network is restored that they were removed from the cluster, will restart all Coherence services and re-join the running cluster.
The Timeout Survivor Quorum policy does not however guarantee that there will be no data loss when there is an outage. Especially a complete network outage as is what happened in this case. The end result is that enough members were evicted that for some partitions both the primary and backup owners were lost. This can be seen in the log file as an orphaned partition. The senior must assign a new primary and backup owner from this partition and the partition starts out with no data in it.
What can be done to improve the situation that avoids data loss when one or more machines are out of sync in the cluster due to network outage?
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document