Solaris: ocssd becomes unresponsive when DTrace is pinned on a cpu
(Doc ID 2112932.1)
Last updated on AUGUST 04, 2018
Applies to:Oracle Database - Enterprise Edition - Version 188.8.131.52 and later
Oracle Solaris on SPARC (64-bit)
A node was evicted and ocssd is found not responsive:
1. NHB missing:
[OCSSD(7547)]CRS-1612: Network communication with node node2 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.106 seconds
[OCSSD(7547)]CRS-1611: Network communication with node node2 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.102 seconds
[OCSSD(7547)]CRS-1610: Network communication with node node2 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.099 seconds
[OCSSD(7547)]CRS-1632: Node node2 is being removed from the cluster in cluster incarnation 219844280
[OCSSD(7547)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node3 .
- OCSSD log files for nodes asegp3099/asegp3100 shows the NHB missing as reported in the previous alert messages.
2. OSW stops collecting data,
3. Reboot advisory message like:
[OHASD(4883)]CRS-8011: reboot advisory message from host: node2 component: cssagent, with time stamp: L-2015-11-13-02:35:40.625
[OHASD(4883)]CRS-8013: reboot advisory message text: Rebooting HUB node after limit 26801 exceeded; disk timeout 26801, network timeout 0, last heartbeat from CSSD at epoch seconds 1447400113.708, 26884 milliseconds ago based on invariant clock value of 1562985081
The Solaris team using the kernel dump found that DTrace may cause the hang, this is Solaris 10:
<leaf trap>int dtrace:dtrace_highbit+0x10((uint64_t)0x1b4b)
void dtrace:dtrace_dynvar_clean+0x20((dtrace_dstate_t *)0x301ef9326b0)
void dtrace:dtrace_state_clean+0x14((dtrace_state_t *)0x301ef932640)
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document