As files are added, removed, or modified in an E-MapReduce (EMR) cluster, data distribution across DataNodes can become uneven: some DataNodes fill up while others sit largely idle. The HDFS balancer moves data blocks between DataNodes to correct this imbalance, reducing the risk of data loss and improving storage efficiency.
How it works
The Hadoop Distributed File System (HDFS) uses a master-slave architecture. The NameNode manages file system metadata—file names, block locations, and block information—while DataNodes store the actual data blocks.
A cluster is balanced when the disk usage of every DataNode differs from the overall cluster disk usage by no more than the configured threshold. For example, with the default 10% threshold, if overall cluster disk usage is 40%, each DataNode must stay between 30% and 50%. This means the maximum spread between the most-used and least-used DataNode can be up to 20%.
Prerequisites
Before you begin, ensure that you have:
-
Access to the master node of your EMR cluster. For details, see Log on to a cluster.
-
The
hdfsuser account to run the balancer.
Check DataNode storage usage
Run the following command on the master node to see the capacity and storage usage of each DataNode:
hdfs dfsadmin -report
The output shows the total capacity, used storage, remaining storage, and disk usage percentage for each DataNode. If data distribution is extremely uneven—for example, when any DataNode's usage differs from the overall cluster average by more than the balance threshold (default: 10%)—start the HDFS balancer.
Run the HDFS balancer
Method 1: Run the hdfs balancer command
Run the following command to start the balancer with optional parameters:
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-source [-f <hosts-file> | <comma-separated list of hosts>]]
[-blockpools <comma-separated list of blockpool ids>]
[-idleiterations <idleiterations>]
| Parameter | Default | Description |
|---|---|---|
threshold |
10% | The maximum allowed difference between a DataNode's disk usage and the cluster's overall disk usage. For example, if overall cluster usage is 40%, each DataNode must stay between 30% and 50%. To achieve a finer balance, set a lower value such as 5%. |
policy |
datanode |
The balancing policy. datanode: the cluster is balanced when all DataNodes are balanced. blockpool: the cluster is balanced when all block pools in each DataNode are balanced. |
exclude |
— | DataNodes to exclude from balancing. |
include |
— | DataNodes to include in the balancing operation. |
source |
— | DataNodes to use as source nodes for block moves. |
blockpools |
— | Block pools in which to run the balancer. |
idleiterations |
5 | Maximum number of idle iterations before the balancer exits. |
Method 2: Use the start-balancer.sh script
You can also use the start-balancer.sh tool by running the hdfs daemon start balancer command. To use the start-balancer.sh tool, perform the following steps:
-
Log on to a node in the cluster. For details, see Log on to a cluster.
-
(Optional) Set the maximum bandwidth for the balancer to limit its impact on running workloads:
hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>Set a lower value on heavily loaded clusters to protect ongoing jobs, and a higher value on idle clusters to finish balancing faster. Common values:
Target bandwidth Value (bytes/s) Calculation 20 MB/s (heavy load) 2097152020 × 1024 × 1024 200 MB/s 209715200200 × 1024 × 1024 1 GB/s (idle cluster) 10737418241 × 1024 × 1024 × 1024 -
Switch to the
hdfsuser and run the balancer:-
DataLake cluster:
su hdfs /opt/apps/HDFS/hdfs-current/sbin/start-balancer.sh -threshold 5 -
Hadoop cluster:
su hdfs /usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 5
The
-threshold 5flag sets the balance threshold to 5%. Adjust this value based on your requirements. -
-
Monitor the balancer log to track progress:
-
DataLake cluster:
tail -f /var/log/emr/hadoop-hdfs/hadoop-hdfs-balancer-master-1-1.c-xxx.log -
Hadoop cluster:
tail -f /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-1.cluster-xxx.log
When the log contains
Successfully, the balancer has completed. -
Tune the HDFS balancer
Run the balancer during off-peak hours to minimize the impact on production workloads. The default configuration works for most clusters. To adjust the parameters, go to the Configure tab on the HDFS service page in the EMR console and click hdfs-site.xml.
Client parameters
| Parameter | Default | Description |
|---|---|---|
dfs.balancer.dispatcherThreads |
200 | Number of dispatcher threads used to identify blocks to move. The balancer repeatedly retrieves block lists until enough data is scheduled for transfer. |
dfs.balancer.rpc.per.sec |
20 | Number of RPCs sent by dispatcher threads per second. Dispatcher threads send getBlocks() RPCs to the NameNode, which can increase NameNode load. On heavily loaded clusters, reduce this value by 5 or 10 to limit the impact on overall cluster performance. |
dfs.balancer.getBlocks.size |
2 GB | Total data size of the blocks fetched in each getBlocks() call. The NameNode holds a lock while processing each RPC, so a large value can slow down writes. Tune this based on NameNode load. |
dfs.balancer.moverThreads |
1000 | Total number of threads used to move blocks. Each block move uses one thread. |
DataNode parameters
| Parameter | Default | Description |
|---|---|---|
dfs.datanode.balance.bandwidthPerSec |
— | Per-DataNode bandwidth cap for balancing operations. Set to 100 MB/s as a starting point. Adjust this at runtime using hdfs dfsadmin -setBalancerBandwidth without restarting DataNodes. Increase the value when load is low; decrease it when load is high. |
dfs.datanode.balance.max.concurrent.moves |
5 | Maximum number of concurrent block moves on a single DataNode. Set this to 4 × number of disks as the upper limit. For example, for a DataNode with 28 disks, set the balancer-side value to 28 and the DataNode-side value to 112. Increase when load is low; decrease when load is high. |
FAQ
Why is the difference between DataNodes still around 20% after running the balancer with `threshold` set to 10%?
This is expected behavior. The threshold controls how far each DataNode's usage can deviate from the cluster average—not how far DataNodes can deviate from each other. With a 10% threshold and 40% overall cluster usage, each DataNode lands between 30% and 50%, which means the maximum spread between any two DataNodes is 20%. To reduce the spread, lower the threshold to 5%.