My Oracle Support Banner

ECE Socket Timeouts at Production Environment (Doc ID 2631689.1)

Last updated on FEBRUARY 01, 2020

Applies to:

Oracle Communications BRM - Elastic Charging Engine - Version 11.3.0.6.0 and later
Information in this document applies to any platform.

Goal

The user has a topology with 5 servers. ECE01 and ECE02 were down due to hardware failure. ECE03, ECE04, and ECE05 were showing that the process was working fine but throwing a lot of socket timeout errors and Elastic Charging Engine (ECE) system was not responding. Coherence cluster communication went into loop after the first timeout error.

The charging-coherence-override-prod.xml configuration file uses the quorum policy configuration to decide the minimum number of nodes to keep the cluster in a healthy state.
    <cluster-quorum-policy>
         <timeout-survivor-quorum role="Server">40</timeout-survivor-quorum>
  </cluster-quorum-policy>

40 nodes are configured which looks fine using 5 servers with 10 nodes each. So, if one of the servers is down, the remaining 40 nodes should continue to handle the traffic.
Note: This configuration indicates that only one server/VM can fail at a time, but there were 2 servers (one server & one VM) failed within 4 hours window.

Which are the right values for this parameter based on actual or new configuration?

Additional doubts, in the ECE/Coherence cluster, there are another kind of members like: diameters, emGateways, brmGateways, etc., they have their own roleName. In this case, how to control the actions with those roleNames? Is it just adding more lines? Or, the behavior of these processes are different when they are not belong to "server" process?
 

Solution

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Goal
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.