Resource Manager Exposed to a Reduced Number of Cores and Memory When NMs are Unhealthy Due to DNs at Full Disk Capacity on BDA 4.1 Causing Failed Jobs (Doc ID 2033197.1)

Last updated on JULY 21, 2015

Applies to:

Big Data Appliance Integrated Software - Version 4.1.0 and later
Linux x86-64

Symptoms

The YARN service in a BDA cluster (in this case BDA V4.1/OL5) sees a reduced number of cores and reduced amount of memory.  For example YARN has access to 720 cores and 2TB of memory when it should have access to about 1200 virtual cores and 2.96 GB of RAM.  This situation leads to failing jobs since jobs have access to only about 2/3 of the resources they should.

For example MapReduce jobs fail with:

15/07/16 08:02:35 INFO input.CombineFileInputFormat: DEBUG: ! Terminated node allocation with : CompletedNodes: 54, size lef! t: 52518 3244
15/07/16 08:02:35 WARN split.JobSplitWriter: Max block location exceeded for split: Paths:/<path>,<path> Locations:bdanode04.example.com:bdanode06.example.com:...:bdanode0n.example.com:; splitsize: 18 maxsize: 10
15/07/16 08:02:35 INFO mapreduce.JobSubmitter: number of splits:148
15/07/16 08:02:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_*_0177
15/07/16 08:02:35 INFO impl.YarnClientIm! pl: Submitted application application_*_0177
15/07/16 08:02:35 INFO mapreduce.Job: The url to track the job: http://bdanode03.example.com:8088/proxy/application_*_0177/
15/07/16 08:02:35 INFO mapreduce.Job: Running job: job_*_0177
15/07/16 08:02:51 INFO mapreduce.Job: Job job_*_0177 running in uber mode : false
15/07/16 08:02:51 INFO mapreduce.Job: map 0% reduce 0%
15/07/16 08:18:58 INFO mapreduce.Job: map 1% reduce 0%
15/07/16 08:19:20 INFO mapreduce.Job: map 2% reduce 0%
15/07/16 08:19:25 INFO mapreduce.Job: map 3% reduce 0%
15/07/16 08:19:36 INFO mapreduce.Job: map 4% reduce 0%
15/07/16 08:21:36 INFO mapreduce.Job: map 0% reduce 0%
15/07/16 08:21:39 INFO mapreduce.Job: map 100% reduce 0%
15/07/16 08:21:39 INFO mapreduce.Job: Job job_*_0177 failed with state KILLED due to: MAP capability required is more than the supported max container capability in the cluster. Killin! g the Job. mapResourceRequest: maxCont! ainerCap ability:
Job received Kill while in RUNNING state.


Other symptoms:

1. In Cloudera Manager (CM) many NodeMangers are in an unhealthy state.

2. Many DataNodes are nearing full disk capacity, with a handful of them having full data drives.  For example the disks: (/u01 - /u12) are 100% full or close to 100% full.

Drilling down, a big usage point are the files under /u01/hadoop/dfs/current/finalized/subdir*/subdir*.  Where /u*/hadoop/dfs is the DataNode Data Directory (dfs.data.dir, dfs.datanode.data.dir).

3. Running 'hdfs fsck /' shows the hdfs file system is healthy.

For example for a few users  '/tmp/log/<userx>' the log usage is close to capacity.  These are the job history logs from YARN.  Note that if the replication factor is the default of 3 then the true usage is three times more.

c) A certain amount of space is allocated to /tmp and if /tmp exceeds its quota jobs stop writing there causing them to fail.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms