A BDA DataNode Fails to Start - Can Not Bind to DataNode Port 1004 as it is Already in Use (Doc ID 2042065.1)

Last updated on SEPTEMBER 17, 2017

Applies to:

Big Data Appliance Integrated Software - Version 4.2.0 to 4.4.0 [Release 4.2 to 4.4]
Linux x86-64

Symptoms

A DataNode (DN) can not start.  This leads to hdfs being in "bad" health.

/var/log/hadoop-hdfs/jsvc.err shows that the process can not bind to the DataNode port 1004 (0.0.0.0:1004) as the port is already in use. Note however, that there is no other obvious logging on the system to indicate why the DN is failing to come up.  

This can also happen with port 1006 which is also a DataNode port.

1. CM reports a DataNode failing to start, bringing hdfs into "bad" health.

1. Initially no obvious logging is found to explain the problem.

a) No updates at all are being made to the DataNode log file at: /var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-<FQDN>-log.out.

The latest update to that log is from the initial time the DataNode went down.  There is nothing from subsequent restarts.  The initial error shows a lack of connectivity
between the DataNode and NameNode like:

java.io.EOFException: End of File Exception between local host is: "<FQDN-DN>/<IB IP>"; destination host is: "<FQDN-NN>":8022; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy16.sendHeartbeat(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:140)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:598)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:696)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:861)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
2015-07-28 12:05:17,708 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2015-07-28 12:05:17,710 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at <FQDN-DN>/<IB IP>
************************************************************/

b) The associated agent logs are updated,  /var/run/cloudera-scm-agent/process/<latest>-hdfs-DATANODE, but nothing sheds light on the inability to start up.

c) The Cloudera Manager(CM) agent logs at /var/log/cloudera-scm-agent do not indicate a problem either.

d) The CM Host Inspector does not show a problem.


2. The output in /var/log/hadoop-hdfs/jsvc.err shows the DN can not start because it can not bind to port 1004 as it is in use:

Initializing secure datanode resources
Opened streaming server at /0.0.0.0:1004
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:437)
    at sun.nio.ch.Net.bind(Net.java:429)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.getSecureResources(SecureDataNodeStarter.java:131)
    at org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.init(SecureDataNodeStarter.java:73)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:207)
Cannot load daemon
Service exit with a return value of 3


3. After the failure to start the port 1004 appears to be free:

a) "netstat -pan | grep 1004" does not return anything for port 1004.

 

The same can be the case with port 1006.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms