A BDA DataNode Fails to Start - Can Not Bind to DataNode Port 1004 as it is Already in Use (Doc ID 2042065.1)

Last updated on FEBRUARY 16, 2016

Applies to:

Big Data Appliance Integrated Software - Version 4.2.0 to 4.4.0 [Release 4.2 to 4.4]
Linux x86-64

Symptoms

A DataNode (DN) can not start.  This leads to hdfs being in "bad" health.

/var/log/hadoop-hdfs/jsvc.err shows that the process can not bind to the DataNode port 1004 (0.0.0.0:1004) as the port is already in use. Note however, that there is no other obvious logging on the system to indicate why the DN is failing to come up.  


1. CM reports a DataNode failing to start, bringing hdfs into "bad" health.

1. Initially no obvious logging is found to explain the problem.

a) No updates at all are being made to the DataNode log file at: /var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-<FQDN>-log.out.

The latest update to that log is from the initial time the DataNode went down.  There is nothing from subsequent restarts.  The initial error shows a lack of connectivity
between the DataNode and NameNode like:

java.io.EOFException: End of File Exception between local host is: "<FQDN-DN>/<IB IP>"; destination host is: "<FQDN-NN>":8022; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy16.sendHeartbeat(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:140)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:598)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:696)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:861)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
2015-07-28 12:05:17,708 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2015-07-28 12:05:17,710 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at <FQDN-DN>/<IB IP>
************************************************************/

b) The associated agent logs are updated,  /var/run/cloudera-scm-agent/process/<latest>-hdfs-DATANODE, but nothing sheds light on the inability to start up.

c) The Cloudera Manager(CM) agent logs at /var/log/cloudera-scm-agent do not indicate a problem either.

d) The CM Host Inspector does not show a problem.


2. The output in /var/log/hadoop-hdfs/jsvc.err shows the DN can not start because it can not bind to port 1004 as it is in use:

Initializing secure datanode resources
Opened streaming server at /0.0.0.0:1004
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:437)
    at sun.nio.ch.Net.bind(Net.java:429)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.getSecureResources(SecureDataNodeStarter.java:131)
    at org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.init(SecureDataNodeStarter.java:73)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:207)
Cannot load daemon
Service exit with a return value of 3


3. After the failure to start the port 1004 appears to be free:

a) "netstat -pan | grep 1004" does not return anything for port 1004.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms