Solaris Cluster - Resource Group Manager Daemon (RGMD) invoked node aborts. What you need to know. (Doc ID 1681697.1)

Last updated on FEBRUARY 27, 2017

Applies to:

Solaris Cluster - Version 3.2 to 4.3 [Release 3.2 to 4.3]
Solaris Cluster Geographic Edition - Version 3.2 to 4.3 [Release 3.2 to 4.3]
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on x86-64 (64-bit)
Background:

At a base level, a Resource Group is a container for various Resources which are created from specified Resource
Type(s). A Resource Type (RT) includes a special Registration (RTR) File which contains registration information
and parameters (properties) as part of an Agent which is installed and registered with the cluster framework to
make an application highly available. For example, HA-Oracle contains at least two of these special files:

/opt/SUNWscor/[oracle_server,oracle_listener]/etc/[SUNW.oracle_server,SUNW.oracle_listener].

Many others are located in /opt while some are located in /usr/cluster/lib/rgm/rtreg. You are encouraged to review
these files and/or review the respective man pages of the same name where available for a better understanding.

Each RT provides various callback methods with related properties which are provided by the agent and invoked as
needed by RGMD. These are generally defined in the RTR file. Such methods will, for example, start, stop, and probe
a service provided by its associated resource. These methods are invoked with a specific timeout (hatimerun(1M)).

If a timeout is hit, the method's process group is killed by a SIGTERM signal. If the process tree does not exit within
ten seconds, a SIGKILL is executed. Sometimes even SIGKILL will not work as the process is stuck in the kernel and
becomes unkillable. This is or course a very bad condition and considered a critical fault.

Example when method is unkillable:

SC[,SUNW.gds:6,,,gds_svc_stop]: [ID 606362 daemon.error] The stop command failed to stop the application. Will now use SIGKILL to stop the application.
Cluster.RGM.global.rgmd: [ID 764140 daemon.error] Method on resource , resource group , node : Timeout. <---<< STOP_TIMEOUT
Cluster.RGM.fed: [ID 922870daemon.error] tag ...1: unable to kill process with SIGKILL
Clluster.RGM.global.rgmd: [ID 904914 daemon.error] fatal: Aborting this node because method on resource for node is unkillable <---<< Stuck


Another problem you may encounter is when one of the callback methods fails to execute.

Example when method fails to execute:

Sep 17 08:27:12 Node2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3509 (oracle)
Sep 17 08:27:15 Node2 last message repeated 2 times
Sep 17 08:27:16 Node2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 4262 (perfd)
Sep 17 08:27:17 Node2 ufs: [ID 845546 kern.notice] NOTICE: alloc:/myfilesystem: file system full
Sep 17 08:27:18 Node2 tmpfs: [ID 518458 kern.warning] WARNING:/etc/svc/volatile: File system full, swap space limit exceeded

There were many of these 'swap' messages.

Sep 17 08:27:26 Node2 cl_eventlogd[4685]: [ID 349011 daemon.error]
cl_plugins_dispatch failed for plugin
/usr/cluster/lib/sc/events/default_plugin.so [0]
Sep 17 08:27:26 Node2 cl_eventlogd[4685]: [ID 693537 daemon.error]
cl_plugins_dispatch failed [1]
Sep 17 08:27:26 Node2 sshd[21906]: [ID 800047 auth.error] error: fork: Error 0
Sep 17 08:27:28 Node2 last message repeated 1 time
Sep 17 08:27:30 Node2 Cluster.RGM.global.rgmd: [ID 652764 daemon.notice] libsecurity, door_call: Interrupted system call; will retry
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 656416 daemon.error] libsecurity, door_call: Fatal, the server is not available.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 147501 daemon.error] Unable to make door call.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 116577 daemon.error] Error making door call
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 444001 daemon.error] FE_RUN: Call failed, return code=1
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 722984 daemon.error] call to rpc.fed failed for resource , resource group , method
Sep 17 08:27:32 Node2 sshd[21906]: [ID 800047 auth.error] error: fork: Error 0
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method for resource rs>, resource group , node , timeout <300> seconds
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 656416 daemon.error] libsecurity, door_call: Fatal, the server is not available.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 147501 daemon.error] Unable to make door call.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 116577 daemon.error] Error making door call
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 444001 daemon.error] FE_RUN: Call failed, return code=1
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 722984 daemon.error] call to rpc.fed failed for resource , resource group , method
Sep 17 08:27:33 Node2 Cluster.RGM.global.rgmd: [ID 341539 daemon.error] fatal: Aborting node Node2 because method failed on resource and Failover_mode is set to HARD

This method failed to execute (not an execution with failed return code).

Reference the Failover_mode property, which each resource will have some value for, in the r_properties(5) man page for your version of OSC.

Goal

Understanding RGMD (Resource Group Manager Daemon) handling of these critical conditions for various Solaris Cluster (OSC) versions and patch levels as well as knowing how to modify RGMD from its default behavior when/if necessary.
 

Solution

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms