My Oracle Support Banner

Solaris Cluster Aborting Node/System/Server due to an 'unkillable process or method execution failure' - What Data to Collect? (Doc ID 1310528.1)

Last updated on APRIL 13, 2017

Applies to:

Solaris Cluster Geographic Edition - Version 3.2 12/06 to OSC 4.1 [Release 3.2 to 4.1]
Solaris Cluster - Version 3.2 12/06 to OSC 4.1 [Release 3.2 to 4.1]
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on SPARC (32-bit)
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on x86 (32-bit)

Goal

To define what data needs to be collected when a Solaris Cluster node is aborted due to a method which is unkillable or fails to execute.

Example A) when method is unkillable:

SC[,SUNW.gds:6,<resourcegroup_name>,<resource_name>,gds_svc_stop]: [ID 606362 daemon.error] The stop command <stop_command_name> failed to stop the application. Will now use SIGKILL to stop the application.
Cluster.RGM.global.rgmd: [ID 764140 daemon.error] Method <gds_svc_stop> on resource <resource_name>, resource group <resourcegroup_name>, node <node_name>: Timeout.
Cluster.RGM.fed: [ID 922870daemon.error] tag <node_name>.<resourcegroup_name>.<resource_name>.1: unable to kill process with SIGKILL
Clluster.RGM.global.rgmd: [ID 904914 daemon.error] fatal: Aborting this node because method <gds_svc_stop> on resource <resource_name> for node <node_name> is unkillable


Example B) when method fails to execute:

Sep 17 08:27:12 Node2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3509 (oracle)
Sep 17 08:27:15 Node2 last message repeated 2 times
Sep 17 08:27:16 Node2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 4262 (perfd)
Sep 17 08:27:17 Node2 ufs: [ID 845546 kern.notice] NOTICE: alloc: /myfilesystem: file system full
Sep 17 08:27:18 Node2 tmpfs: [ID 518458 kern.warning] WARNING: /etc/svc/volatile: File system full, swap space limit exceeded

!! There were many of these 'swap' messages.

Sep 17 08:27:26 Node2 cl_eventlogd[4685]: [ID 349011 daemon.error] cl_plugins_dispatch failed for plugin /usr/cluster/lib/sc/events/default_plugin.so [0]
Sep 17 08:27:26 Node2 cl_eventlogd[4685]: [ID 693537 daemon.error] cl_plugins_dispatch failed [1]
Sep 17 08:27:26 Node2 sshd[21906]: [ID 800047 auth.error] error: fork: Error 0
Sep 17 08:27:28 Node2 last message repeated 1 time
Sep 17 08:27:30 Node2 Cluster.RGM.global.rgmd: [ID 652764 daemon.notice] libsecurity, door_call: Interrupted system call; will retry
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 656416 daemon.error] libsecurity, door_call: Fatal, the server is not available.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 147501 daemon.error] Unable to make door call.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 116577 daemon.error] Error making door call
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 444001 daemon.error] FE_RUN: Call failed, return code=1
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 722984 daemon.error] call to rpc.fed failed for resource <lhost-rs>, resource group <MY-RG>, method <hafoip_prenet_start>
Sep 17 08:27:32 Node2 sshd[21906]: [ID 800047 auth.error] error: fork: Error 0
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hafoip_stop> for resource <lhost-rs>, resource group <MY-RG>, node <Node2>, timeout <300> seconds
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 656416 daemon.error] libsecurity, door_call: Fatal, the server is not available.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 147501 daemon.error] Unable to make door call.
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 116577 daemon.error] Error making door call
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 444001 daemon.error] FE_RUN: Call failed, return code=1
Sep 17 08:27:32 Node2 Cluster.RGM.global.rgmd: [ID 722984 daemon.error] call to rpc.fed failed for resource <lhost-rs>, resource group <MY-RG>,   <---<<< method <hafoip_stop>
Sep 17 08:27:33 Node2 Cluster.RGM.global.rgmd: [ID 341539 daemon.error] fatal: Aborting node Node2 because method <hafoip_stop> failed on resource <lhost-rs> and Failover_mode is set to HARD <---<<<
Sep 17 08:27:34 Node2 sshd[21906]: [ID 800047 auth.error] error: fork: Error 0
Sep 17 08:27:35 Node2 last message repeated 1 time

This is the normal behavior of rgmd and it is working as designed. The node is aborted to recover from these critical conditions. This method nicely reboots the node.

However, modifying the behavior of rgmd to enable it to collect a system core file should allow Oracle Solaris Cluster technical support to perform the additional analysis required to root-cause the issue should the problem reoccur during a reasonable, post-event, monitoring period (1 - 2 weeks). A new SR could be opened when/if the problem does reoccur following that monitoring period, referencing the original.

Solution

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.