Solaris Cluster x86 Panic With "pm_tick delay" or have cacao issue "Common Agent Container exited abnormally" When Enabled C-state on CPU

(Doc ID 1368608.1)

Last updated on APRIL 11, 2017

Applies to:

Solaris Cluster - Version 3.2 to 4.3 [Release 3.2 to 4.3]
Solaris Cluster Geographic Edition - Version 3.2 to 4.3 [Release 3.2 to 4.3]
Oracle Solaris on x86 (32-bit)
Oracle Solaris on x86-64 (64-bit)

Symptoms

This problem only occurs on Intel Xeon processors (Nehalem/Westmere CPU's which have ACPI Power Management feature). This feature allows the operating system to place the processor in low power state (deep C-state) when not in use. This can appear as very long latency on I/O and appear as a hung system.

In case of the Solaris Cluster x86 on such system (e.g: X4470) the following error have been seen:

A) pm_tick delay which can cause a panic

Oct 12 05:25:08 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 4749 ms exceeds 2147 ms
Oct 12 05:25:08 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 4750 ms exceeds 2147 ms
Oct 12 05:25:09 node1 last message repeated 3 times
Oct 12 05:25:09 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 4749 ms exceeds 2147 ms
Oct 12 05:25:09 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 4750 ms exceeds 2147 ms
Oct 12 05:25:11 node1 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Oct 12 05:25:11 node1 unix: [ID 836849 kern.notice]
Oct 12 05:25:11 node1 ^Mpanic[cpu6]/thread=fffffe80006bec60:
Oct 12 05:25:11 node1 genunix: [ID 898738 kern.notice] Aborting node because pm_tick delay of 6804 ms exceeds 5050 ms
Oct 12 05:25:11 node1 unix: [ID 100000 kern.notice]
Oct 12 05:25:11 node1 genunix: [ID 655072 kern.notice] fffffe80006be9c0 genunix:vcmn_err+13 ()
Oct 12 05:25:11 node1 genunix: [ID 655072 kern.notice] fffffe80006be9d0 cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_element__nZsc_syslog_msg_status_enum__+24 ()
Oct 12 05:25:11 node1 genunix: [ID 655072 kern.notice] fffffe80006beab0 cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+9d ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006beb40 cl_comm:__1cMpath_managerHpm_tick6Mn0APcyclic_caller_t__v_+17b ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006beb50 cl_comm:__1cbDpath_manager_cyclic_interface6FnMpath_managerPcyclic_caller_t__v_+14 ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006beb60 cl_comm:cluster_heartbeat+d ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006bebf0 genunix:cyclic_softint+ba ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006bec00 unix:cbe_softclock+17 ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006bec40 unix:av_dispatch_softvect+62 ()
Oct 12 05:25:12 node1 genunix: [ID 655072 kern.notice] fffffe80006bec50 unix:intr_thread+b4 ()
Oct 12 05:25:12 node1 unix: [ID 100000 kern.notice]


B) Problem with Solaris Cluster Manager

Aug 25 21:58:11 node1 cacao_launcher[19819]: [ID 335192 daemon.error] Timeout occured on heartbeat channel, cleanup engaged
Aug 25 21:59:11 node1 cacao_launcher[19819]: [ID 920319 daemon.error] watchdog : warning, sub child (19820) still alive after sending SIGQUIT.
Aug 25 21:59:25 node1 cacao_launcher[19819]: [ID 219817 daemon.error] SUNWcacaort launcher : Common Agent Container exited abnormaly
Aug 25 21:59:25 node1 cacao_launcher[19819]: [ID 314456 daemon.error] SUNWcacaort launcher : no retries available
Aug 25 21:59:26 node1 cacao_v2: [ID 702911 daemon.error] Cannot connect to agent: Connection refused
Aug 26 00:45:31 node1 cacao_launcher[17272]: [ID 335192 daemon.error] Timeout occured on heartbeat channel, cleanup engaged
Aug 26 00:46:31 node1 cacao_launcher[17272]: [ID 920319 daemon.error] watchdog : warning, sub child (17273) still alive after sending SIGQUIT.
Aug 26 00:46:46 node1 cacao_launcher[17272]: [ID 219817 daemon.error] SUNWcacaort launcher : Common Agent Container exited abnormaly
Aug 26 00:46:46 node1 cacao_launcher[17272]: [ID 314456 daemon.error] SUNWcacaort launcher : no retries available
Aug 26 00:46:47 node1 cacao_v2: [ID 702911 daemon.error] Cannot connect to agent: Connection refused
Aug 26 07:17:26 node1 cacao_launcher[29830]: [ID 335192 daemon.error] Timeout occured on heartbeat channel, cleanup engaged
Aug 26 07:18:26 node1 cacao_launcher[29830]: [ID 920319 daemon.error] watchdog : warning, sub child (29831) still alive after sending SIGQUIT.
Aug 26 07:18:40 node1 cacao_launcher[29830]: [ID 219817 daemon.error] SUNWcacaort launcher : Common Agent Container exited abnormaly
Aug 26 07:18:40 node1 cacao_launcher[29830]: [ID 314456 daemon.error] SUNWcacaort launcher : no retries available
Aug 26 07:18:41 node1 cacao_v2: [ID 702911 daemon.error] Cannot connect to agent: Connection refused

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms