ECE Coherence Cluster is in ENDANGERED State during Rolling Upgrade and Does Not Recover
(Doc ID 2536501.1)
Last updated on MAY 05, 2021
Applies to:Oracle Coherence - Version 18.104.22.168.0 and later
Oracle Communications BRM - Elastic Charging Engine - Version 22.214.171.124.0 and later
Information in this document applies to any platform.
A customer reported that in an ECE environment after a node/server, ecsN, restarted it got into long (20min+) 'Partition Transferring...' state but it is still in the HA status, MACHINE_SAFE, and the cache counts are still matched the expectations. Given that it's in the HA status, MACHINE_SAFE, and the user tried restarting another ecs node/server, ecsN+1, on a different machine, and it gets some negative messages in the logs something like 'transferring already in progress...', and it becomes ENDANGERED_STATE and the cache counts show lost about Nk entries. After stopping ‘ecsN’ that has been in 'Partition transferring...' state for a longer time, it becomes MACHINE_SAFE again but the cache counts show lost about (N-n) k.
During rolling restart getting following messages, Current partition distribution has been pending for over 2133 seconds;
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document