After Cloudera Manager dfs.namenode.acls.enabled Property Change Cluster Fails to Start, No DataNode Logs Generated Since: BPServiceActorActionException: Failed to report bad block on the BDA (Doc ID 2061212.1)

Last updated on MAY 30, 2017

Applies to:

Big Data Appliance Integrated Software - Version 4.2.0 and later
Linux x86-64

Symptoms

 1. After updating the: NameNode's hdfs-site.xml safety valve, NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml, (Navigate in Cloudera Manager (CM) > Home > hdfs > Configuration > Search for: NameNode Advanced Configuration Snippet):

with:

<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>


restarting the cluster fails; the cluster will not come back up.

2. Navigating to CM > Running Commands, shows:

Command Progress
Completed 1 of 6 steps.

Execute command ZkStartPreservingDatastore on service zookeeper
Failed to execute command Start on service zookeeper
...


3. Reverting the above change, still does not allow the customer to come up.  In particular, DataNode 3, the Cloudera Manager Node, will not start.

4. On Node 3, the Cloudera Manager Node, the DataNode log and JournalNode log show the errors below:

a) DataNode: /var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-bdanode03.example.com.log.out:

Failed to report bad block BP-1798906644-10.192.2.21-1357858507180:blk_1082722151_1099532929917 to namenode:
bdanode01.example.com/*.*.*.*:8020
org.apache.hadoop.hdfs.server.datanode.BPServiceActorActionException: Failed to report bad block
BP-1798906644-*.*.*.*-1357858507180:blk_1082722151_1099532929917 to namenode:
at org.apache.hadoop.hdfs.server.datanode.ReportBadBlockAction.reportTo(ReportBadBlockAction.java:63)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processQueueMessages(BPServiceActor.java:1025)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:771)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:861)
at java.lang.Thread.run(Thread.java:745) 2:49:31.886 PM ERROR org.apache.hadoop.hdfs.server.datanode.DataNode
RECEIVED SIGNAL 15: SIGTERM 2:49:31.891
PM INFO org.apache.hadoop.hdfs.server.datanode.DataNode SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at bdanode03.example.com/*.*.*.*



b) JournalNode: /var/log/hadoop-hdfs/hadoop-cmf-hdfs-JOURNALNODE-bdanode01.example.com.log.out:

2015-09-28 14:47:58,939 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager:
Finalizing edits file /opt/hadoop/dfs/jn/<cluster-name>-journal/current/edits_inprogress_0000000000126045315 ->
/opt/hadoop/dfs/jn/<cluster-name>-journal/current/edits_0000000000126045315-0000000000126045576
2015-09-28 14:49:33,547 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: RECEIVED SIGNAL 15:
SIGTERM 2015-09-28 14:49:33,549 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down JournalNode at bdanode03.example.com/*.*.*.*
************************************************************/



5. The DataNode for Node 3 is decommissioned, and 'hadoop fsck /' shows the many associated missing/corrupt blocks: 


6.  On Node 3 it is observed that no updates are being made to the DataNode log file.

a) The timestamp of the DataNode log file in /var/log/hadoop-hdfs shows no updates made for several hours.

b) Running "tail -f" on the DataNode log file in /var/log/hadoop-hdfs, and restarting the DataNode in CM shows no updates being made to the DataNode log file.

7. The status of the DataNode on Node 3 and other nodes in CM shows a long timeout when restarting.  


Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms