Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 Server: Domains running firmware 5.15.x or later with hang-policy set to "notify" may lose critical troubleshooting data (Doc ID 1018813.1)

Last updated on FEBRUARY 02, 2017

Applies to:

Sun Fire 6800 Server - Version Not Applicable and later
Sun Fire E4900 Server - Version Not Applicable and later
Sun Fire 3800 Server - Version Not Applicable and later
Sun Fire 4800 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
All Platforms

Symptoms

Symptoms
Starting with firmware level 5.15.0, ScApp detects and, depending on the
setting of the domain hang-policy variable, can attempt to reset a hung domain.
Systems initially installed with 5.15.0 or later will have the hang-policy
default to "reset", which will attempt to reset a hung domain.

The hang-policy variable was also present in earlier firmware versions.
However, systems that were initially installed with an earlier firmware version
will have the hang-policy set to "notify" by default. When these systems are
upgraded to 5.15.0 or later, the current value of hang-policy,
and all other existing domain and platform settings are left intact. This will
cause two issues.

First, the SC will not attempt to automatically reset a domain with hang-
policy=notify, negating the effects of this new feature in ScApp.

Second, and possibly more importantly, the new features in 5.15.x will cause
the SC to log that it noticed the hung domain. It will log this notice each time
it polls the domain to determine if it is active. The SC will log this notice
both on the loghost server, and in its internal log buffers, which are used to
display data via the showlogs command. This internal buffer is circular - as a
new entry is made, it removes the oldest entry still present in the buffer. The
end result is that a domain hang with hang-policy set to notify will overflow
the circular buffer and eliminate any useful data from "showlogs -d x" that
would indicate the initial condition that caused the hang. An example of these
messages:

...
Aug 09 07:55:12 sunfire-sc0 Domain-C.SC: [ID 180731 local0.notice] Domain C is
active again
Aug 09 07:55:12 sunfire-sc0 Domain-C.SC: [ID 690470 local0.error] Domain
watchdog timer expired.
Aug 09 07:55:12 sunfire-sc0 Domain-C.SC: [ID 398807 local0.notice]
hang-policy is NOTIFY. Not resetting domain.
Aug 09 07:55:13 sunfire-sc0 Domain-C.SC: [ID 180731 local0.notice] Domain C is
active again
Aug 09 07:55:13 sunfire-sc0 Domain-C.SC: [ID 690470 local0.error] Domain
watchdog timer expired.
Aug 09 07:55:13 sunfire-sc0 Domain-C.SC: [ID 398807 local0.notice]
hang-policy is NOTIFY. Not resetting domain.
...


If there is not a working loghost configured for the domain, the failure
cannot be troubleshot.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms