How to Diagnose Healthcheck Timeouts in OpenSAFfire Enabled Applications

(Doc ID 1496387.1)

Last updated on SEPTEMBER 07, 2017

Applies to:

Oracle Communications OpenSAFfire - Version 6.2.0 and later
Information in this document applies to any platform.
***Checked for relevance on 14-May-2014***
***Checked for relevance on 20-Oct-2015***

Goal

In OpenSAFfire, Availability Management Framework (AMF) is responsible for ensuring the health of applications (represented as Components) in an OpenSAFfire environment.

AMF does this by periodically invoking a callback function in the application.  A response is expected back from the component in order to deem it "healthy".

Unfortunately, it is possible for applications to get stuck or hang and fail to respond to AMF healthchecks in a reasonable timeframe.  This results in AMF performing the configured component recovery action.

When analysing logs, these are the typical steps which suggest the application/component is not responding in time to AMF:

1. AMF initiates the callback to the component:

Jun 17 20:23:40.761938 osafamfnd [4034:avnd_comp.c:2131] >> avnd_comp_cbk_send: 'safComp=Comp1,safSu=Su1,safSg=Sg1,safApp=App1' 1
Jun 17 20:23:40.761951 osafamfnd [4034:avnd_cbq.c:0661] >> avnd_comp_cbq_send
Jun 17 20:23:40.761964 osafamfnd [4034:avnd_cbq.c:0730] >> avnd_comp_cbq_rec_send
...
Jun 17 20:23:40.762132 osafamfnd [4034:avnd_tmr.c:0097] TR callback response timer started

2. The component does not does not respond in time so AMF escalates this:

Jun 17 20:23:46.865599 osafamfnd [4034:avnd_err.c:0298] >> avnd_err_process: Comp:'safComp=Comp1,safSu=Su1,safSg=Sg1,safApp=App1' esc_rcvr:'4'
Jun 17 20:23:46.865668 osafamfnd [4034:avnd_err.c:0345] NO 'safComp=Comp1,safSu=Su1,safSg=Sg1,safApp=App1' faulted due to 'healthCheckcallbackTimeout(6)' : Recovery is 'nodeSwitchover(4)'
Jun 17 20:23:46.865684 osafamfnd [4034:avnd_err.c:0437] >> avnd_err_recover: SU:safSu=Su1,safSg=Sg1,safApp=App1 Comp:safComp=Comp1,safSu=Su1,safSg=Sg1,safApp=App1

This resulted in the quiescing of the remaining components on the system and a failover to any standby node.

To determine if the Component is getting stuck, strace is available in Linux which will attach to a running process (identified by it's Process ID) and log all system calls made by the running process.  For example:

# strace -p 4226
...
write(3, "Jun 17 20:23:39.677742 ava [4226:"..., 67) = 67
select(21, [12 14 16 18 20], NULL, NULL, {3, 0}) = 0 (Timeout)
time(NULL)                              = 1339964622
write(3, "Jun 17 20:23:42.679122 ava [4226:"..., 72) = 72
write(3, "Jun 17 20:23:42.679229 ava [4226:"..., 76) = 76
futex(0xfe3c50c, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
write(1, "PID File created successfully.\n\n"..., 32) = 32
write(1, "sigterm_handler: exiting"..., 24) = 24
exit_group(0)                           = ?
Process 4226 detached

Note the SIGTERM is from AMF trying to recover the component.

Correlating the strace output to the logs, we can see that the last line written to the component log matches the final write() system call:

Jun 17 20:23:42.679229 ava [4226:lga_mds.c:1165] >> lga_mds_msg_async_send

Solution

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms