Coherence Servers Configured For Different Machines Are Being Restarted When NM Is Restarted With CrashRecovery Enabled. (Doc ID 1486500.1)

Last updated on FEBRUARY 24, 2017

Applies to:

Oracle Coherence - Version 3.6.1.3 to 3.7.1 [Release AS10g]
Information in this document applies to any platform.

Symptoms

This article is specific to a WLS HA environment. The WLS domain contains WLS servers and Coherence Servers spread over two nodes/machines. The WLS domain's directory is mounted shared on both nodes using NFS.  See <Note 1299088.1> for more details on configuration of a shared directory domain.

The architecture can be summarized as follows:

Node 1 : MS1, MS2, COH1, COH2, managed with Node Manager (NM1)
Node 2 : MS3, MS4, COH3, COH4, managed with Node Manager (NM2)

MS<x> are WLS Managed Servers and COH<x> are Coherence Servers, all them managed through the WLS Node Manager.  The nodemanager's nodemanager.properties is configured with the settings CrashRecoveryEnabled=true and DomainsDirRemoteSharingEnabled=true on both nodes. 

The problem is encountered if you kill NM1 and then restart it (or if the Node1 machine crashes and is restarted) as the nodemanager NM1 tries to start COH3 and COH4 on Node 1, regardless of the fact that COH3 and COH4 are already running on Node 2. The problem does not happen for MS1 and MS2, and only happens for the Coherence Servers.

If CrashRecoveryEnabled is set to false in nodemanager.properties the behaviour does not reproduce. However, CrashRecoveryEnabled is set to true for HA and failover purposes.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms