On Oracle Big Data Appliance with BDS enabled, Server Crash / Hang Noticed when Executing a Map Job (Doc ID 2014635.1)

Last updated on OCTOBER 11, 2016

Applies to:

Big Data Appliance Integrated Software - Version 4.1.0 and later
Linux x86-64

Symptoms

On Oracle Big Data Appliance (BDA) executing MR job throws below errors. Oracle Big Data SQL is enabled on BDA and thus cgroups is turned on. Also high CPU usage is noticed on Resource Manager nodes leading to crashing of RM nodes.

15/05/18 19:10:09 INFO mapreduce.Job: map 8% reduce 0%
15/05/18 19:26:09 INFO mapreduce.Job: Task Id : attempt_1431987421855_0002_m_000207_0, Status : FAILED
Error: java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:939)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
15/05/18 19:28:22 INFO mapreduce.Job: map 9% reduce 0%
15/05/18 19:29:32 INFO mapreduce.Job: Task Id : attempt_1431987421855_0002_m_000314_0, Status : FAILED
Error: java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:939)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
............................

15/05/18 19:44:12 INFO mapreduce.Job: Task Id : attempt_1431987421855_0002_m_000295_0, Status : FAILED
Error: java.io.IOException: Failing write. Tried pipeline recovery 5 times without success. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:939)
  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
................


Increased below Yarn service memory/heap parameters in Cloudera Manager which solved high CPU usage.

mapreduce.map.java.opts.max.heap
mapreduce.map.memory.mb
yarn.nodemanager.resource.memory-mb

For details about Yarn Configuration memory settings, please refer to https://support.oracle.com/epmos/main/hadoop/Adjusting-for-Mappers-and-Reducers-in-YARN 

But after above changes while executing the Map job, the BDA cluster became unstable. I.e Couple of nodes went into kernel panic mode and on some nodes Ethernet / Admin network went down.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms