Unexpected OCFS2 Ring Fencing and Restarts For Compute Nodes In An Exalogic Virtual System Receiving Deleting Projects/Filesystems or Receiving ZFS Replication Streams (Doc ID 1961508.1)

Last updated on NOVEMBER 19, 2016

Applies to:

Oracle Exalogic Elastic Cloud Software - Version 2.0.4.0.0 and later
Linux x86-64
Oracle Virtual Server x86-64

Symptoms

You have an Exalogic system running Exalogic Elastic Cloud Software (EECS) version 2.0.4.x or 2.0.6.x in a virtual deployment, where you have encountered unexpected restarts across one or more compute nodes. At the time when the unexpected restarts occur the deletion of one or more Projects and/or Shares had been initiated on the rack's ZFS Appliance or the ZFS Appliance has been the target of snapshot data being replicated to it. In such circumstances the ZFS Appliance may stop responding to NFS operations for a period long enough to trigger the Oracle Clustered File System (OCFS2) software component to "ring fence" compute nodes off from the OCFS2 cluster by triggering them to automatically reboot.

To ensure the integrity of its cluster, each node within an OCFS2 cluster periodically updates it's node specific block in a clustered file that is updated by all nodes to show they are still alive. In an Exalogic system the clustered file system is mapped onto an NFSv3 file system and each compute node's update to the centralized "heartbeat" file translates to a write taking place to an NFSv3 mounted file system. If an OCFS2 monitor process running on each compute node does not see it node's update being successfully written over NFS, then it starts a timer to track the period since the last successful write took place. If the timer exceeds a configured threshold, which for Exalogic is 5 minutes (or 300000 milliseconds) then the monitor process will trigger the compute node to reboot itself in an attempt to recover from the NFS write problems so that it can successfully rejoin the OCFS2 cluster.

The presence of one or more of the following symptoms can be used to confirm that an unexpected compute node restart in an Exalogic system was triggered by the OCFS2 "ring-fencing" mechanism:

  1. Sustained o2hbmonitor "ping" failures being logged to /var/log/messages that indicate the last successful heartbeat update occurred more than 300000 milliseconds ago

    • Log messages similar to the following are logged to /var/log/messages every two seconds once ongoing ping failure have exceeded 50% of the period before which ring-fencing will occur:
       

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms