My Oracle Support Banner

After Node Reprovision/Other Server Resilency Commands - HDFS and Zookeeper Services in "Bad" Health on BDA V4.0 (Doc ID 1954228.1)

Last updated on DECEMBER 10, 2019

Applies to:

Big Data Appliance Integrated Software - Version 4.0 and later
Linux x86-64

Symptoms

NOTE: In the examples that follow, user details, cluster names, hostnames, directory paths, filenames, etc. represent a fictitious sample (and are used to provide an illustrative example only). Any similarity to actual persons, or entities, living or dead, is purely coincidental and not intended in any manner.

  

After reprovisioning a Node, for example Node 4 which has previously had its services migrated with:

# bdacli admin_cluster reprovision <HOSTNAME>

The following is observed:


1. In Cloudera Manager (CM) the hdfs service in 'Bad' health
The hdfs service reports:

On NameNode, bdanode01 (Standby) 1 failed status directories: /opt/hadoop/dfs/nn. Critical threshold: any.


2. The zookeeper service in 'Bad' health.  The Zookeeper Server Status (from CM > zookeeper) reports that one Zookeeper Follower is down like below:

The Zookeeper Leader on Node 2 is up
The Zookeeper Follower on Node 3 is up
The Zookeeper Follower on Node 1 is down

3. The Zookeeper log reports:

9:38:21.556 AM ERROR org.apache.zookeeper.server.SyncRequestProcessor
Severe unrecoverable error, exiting
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:355)
at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:491)
at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:164)
at org


4. Going to /tmp on Node 1 finds a large amount of storage is being used to hold the <CLUSTER_NAME>-cluster-install-summary logs.  For example:

a) Change directory to /tmp and examining disk usage reports the high consumption:

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Cause
Solution
 Reactive Solution
 Proactive Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.