Formula to Calculate Threshold Percentage Passed to Balancer for Efficiently Balancing the HDFS Cluster
(Doc ID 1588427.1)
Last updated on DECEMBER 08, 2017
Applies to:Big Data Appliance Integrated Software - Version 2.0.1 and later
HDFS data might not always be be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:
- Policy to keep one of the replicas of a block on the same node as the node that is writing the block.
- Need to spread different replicas of a block across the racks so that cluster can survive loss of whole rack.
- One of the replicas is usually placed on the same rack as the node writing to the file so that cross-rack network I/O is reduced.
- Spread HDFS data uniformly across the DataNodes in the cluster.
Due to multiple competing considerations, data might not be uniformly placed across the DataNodes. HDFS provides a tool called balancer for administrators that analyzes block placement and rebalance the data across the DataNode
This document explains the formula used by the Balancer to balance the data on Hadoop Distributed File System (HDFS). This will assist in choosing the threshold value to efficiently balance the HDFS cluster. The threshold parameter is a fraction in the range of (0%, 100%) with a default value of 10%.
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document