After Removing MIT Kerberos on BDA V4.5 with bdacli HDFS is in Bad Health, Failover Controllers are Down, and Both NameNodes in Standby (Doc ID 2197964.1)

Last updated on SEPTEMBER 19, 2017

Applies to:

Big Data Appliance Integrated Software - Version 4.5.0 and later
Linux x86-64

Symptoms

After removing MIT Kerberos with "bdacli disable kerberos" (using: Instructions to Disable Kerberos on Oracle Big Data Appliance with Mammoth V3.*/V4.* Releases (Doc ID 1919431.1)) the hdfs service is in bad health. It is the case that both Failover Controllers will not start and both NameNodes are in standby.  Note you are more likely to see this issue if the cluster has been expanded.

Running "bdacli disable kerberos" finishes successfully but running the cluster verification checks after with "./mammoth -c" shows many failing tests.

1. The Failover Controller log e.g. hadoop-cmf-hdfs-FAILOVERCONTROLLER-systscbd-bdanode0x.example.com.log.out shows a FATAL error like:

2016-10-27 07:09:05,997 FATAL org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got a fatal error, exiting now
java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:NOAUTH for path /hadoop-ha/<cluster_name>-ns/ActiveStandbyElectorLock
at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:366)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:237)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:60)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:171)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:167)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:167)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:190)

2. In CM trying to bring one of the NameNodes into active mode by trying to force a failover fails.  The role log/stderr report:

a) From the role log there is an error like:

unable to failover from namenode<x> to namenode<y> of nameservice <cluster_name>-ns; see stderr log.

b) From stderr there is an error like:

+ acquire_kerberos_tgt hdfs.keytab
+ '[' -z hdfs.keytab ']'
+ '[' -n '' ']'
+ '[' validate-writable-empty-dirs = failover ']'
+ '[' file-operation = failover ']'
+ '[' bootstrap = failover ']'
+ '[' failover = failover ']'
+ ACTIVE='Failover failed: Can'\''t failover to an active service'
+ NS=<cluster_name>-ns
+ FROM_NN=namenode<x>
+ TO_NN=namenode<y>
+ FORCE=true

The reference to hdfs.keytab indicates that Kerberos is not fully cleaned up.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms