Use the HDFS balancer and configure the tuning parameters of the balancer - E-MapReduce

You can use the Hadoop Distributed File System (HDFS) balancer to analyze the distribution of data blocks and redistribute data that is stored in DataNodes. This topic describes how to use the HDFS balancer and configure the tuning parameters of the balancer.

Background information

HDFS adopts a master-slave architecture. The NameNode manages the metadata of the file system, such as file names, block information, and file locations. Actual data blocks are stored in multiple DataNodes. The master-slave architecture allows data to be stored in different locations, which improves the fault tolerance of the file system.

As files are added, removed, or modified, data distribution among DataNodes may be uneven. The storage space of specific DataNodes is nearly full, and the storage space of other DataNodes remains idle. Uneven data distribution affects the storage efficiency of the file system and increases the risk of data loss. This is because DataNodes that store a large amount of data are more susceptible to hardware failures.

To resolve the preceding issues, HDFS provides the balancer tool, which is a command line tool used to rebalance data distribution among DataNodes. The HDFS balancer moves data blocks between DataNodes to rebalance data distribution. This ensures that the storage resources of a cluster can be used in an efficient manner.

View the capacity and the storage space usage of DataNodes

You can view the capacity and the storage space usage of DataNodes to learn the allocation of storage resources, identify and resolve issues about insufficient storage at the earliest opportunity, and make sure that data is evenly distributed among nodes. This improves the overall performance and stability of the system.

Log on to the master node of the cluster that you want to manage. For more information, see Log on to a cluster.
Run the following command to view the capacity and the storage space usage of each DataNode:

hdfs dfsadmin -report

The result describes information, such as the total capacity, used storage space, storage space usage, and remaining storage space, of each DataNode. This helps you identify storage imbalance issues.

If data distribution is extremely uneven, you can start the HDFS balancer. For example, the storage space usage of specific DataNodes is much higher than that of other nodes, and the difference exceeds the default or specified balance threshold. In most cases, the threshold is 10%.

Use the HDFS balancer

Method 1: Run the hdfs balancer command

Run the following command to configure the HDFS balancer:

hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-source [-f <hosts-file> | <comma-separated list of hosts>]]
[-blockpools <comma-separated list of blockpool ids>]
[-idleiterations <idleiterations>]

The following table describes the parameters of the HDFS balancer.

Parameter	Description
threshold	The threshold of the disk usage difference, in percentage. Default value: 10%. This value ensures that the disk usage on each DataNode differs from the overall disk usage of the cluster by no more than 10%. If the overall disk usage of the cluster is high, set this parameter to a smaller value. If you added a large number of nodes to the cluster, you can set this parameter to a larger value to move data from high-usage nodes to low-usage nodes.
policy	The balancing policy. Valid values: datanode: If data is evenly distributed to all DataNodes, your cluster is balanced. This is the default value. blockpool: If all block pools in each DataNode are balanced, your cluster is balanced.
exclude	Excludes specific DataNodes.
include	Specifies the DataNodes on which you want to perform the balancing operation.
source	The DataNode that serves as the source node.
blockpools	The block pools in which you want to run the HDFS balancer.
idleiterations	The maximum number of idle loops that are allowed. Default value: 5.

Method 2: Use the start-balancer.sh tool

You can use the start-balancer.sh tool by running the hdfs daemon start balancer command. To use the start-balancer.sh tool, perform the following operations:

Log on to a node of the cluster that you want to manage. For more information, see Log on to a cluster.
Optional. Run the following command to modify the maximum bandwidth of the HDFS balancer:
```
hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
```
Note
<bandwidth in bytes per second> specifies the maximum bandwidth per second. For example, if you want to configure a maximum bandwidth of 200 MB/s, set <bandwidth in bytes per second> to 209715200, in bytes, which is calculated based on the following formula: 200 × 1024 × 1024. The complete command is hdfs dfsadmin -setBalancerBandwidth 209715200. To make full use of network resources and ensure the continuity of core business, we recommend that you specify a small value for the maximum bandwidth if the cluster is heavily loaded. For example, you can set the value to 20971520, which indicates 20 MB/s. To accelerate the data balancing process, we recommend that you specify a large value for the maximum bandwidth if the cluster is idle. For example, you can set the value to 1073741824, which indicates 1 GB/s.
Run the following commands to switch to the hdfs user and run the HDFS balancer:
- DataLake cluster
```
su hdfs
/opt/apps/HDFS/hdfs-current/sbin/start-balancer.sh -threshold 5
```
- Hadoop cluster
```
su hdfs
/usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 5
```
  Note
  The -threshold parameter specifies the threshold for data balancing. If you set the threshold to 5%, the HDFS balancer considers that data is evenly distributed and no longer moves data blocks from or to the DataNode when the difference between the data storage capacity of a DataNode and the average storage capacity of the cluster is less than or equal to 5%. You can configure this parameter based on your business requirements to achieve the expected balance effect.
Run the following command to check the status of the HDFS balancer:
- DataLake cluster
```
tail -f /var/log/emr/hadoop-hdfs/hadoop-hdfs-balancer-master-1-1.c-xxx.log
```
- Hadoop cluster
```
tail -f /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-1.cluster-xxx.log
```
  Note
  hadoop-hdfs-balancer-master-1-1.c-xxx.log and hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log in the command are the log names obtained in the previous step.
If the command output includes Successfully, the HDFS balancer is run as expected.

Tuning parameters of the HDFS balancer

The HDFS balancer consumes system resources. We recommend that you use the HDFS balancer during off-peak hours. By default, you do not need to modify the parameters of the HDFS balancer. If you want to modify the parameters of the HDFS balancer, go to the Configure tab of the HDFS service page in the E-MapReduce (EMR) console and click hdfs-site.xml. On the hdfs-site.xml tab, adjust the configurations of the client and DataNodes based on your business requirements.

Client configurations

Parameter	Description
dfs.balancer.dispatcherThreads	The number of dispatcher threads used by the HDFS balancer to determine the blocks that need to be moved. Before the HDFS balancer moves a specific amount of data between two DataNodes, the balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled. Note The default value is 200.
dfs.balancer.rpc.per.sec	The number of remote procedure calls (RPCs) sent by dispatcher threads per second. Default value: 20. Before the HDFS balancer moves data between two DataNodes, the balancer uses dispatcher threads to repeatedly send the getBlocks() RPC to the NameNode. This results in a heavy load on the NameNode. To prevent this issue and balance the cluster load, we recommend that you configure this parameter to limit the number of RPCs sent per second. For example, you can decrease the value of the parameter by 10 or 5 for a cluster that has a high load to minimize the impact on the overall moving progress.
dfs.balancer.getBlocks.size	The total data size of the blocks moved each time. Before the HDFS balancer moves data between two DataNodes, the balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled. By default, the size of blocks in each block list is 2 GB. When the NameNode receives a getBlocks() RPC, the NameNode is locked. If an RPC queries a large number of blocks, the NameNode is locked for a long period of time. This slows down data writing. To prevent this issue, we recommend that you configure this parameter based on the NameNode load.
dfs.balancer.moverThreads	The total number of threads that are used to move blocks. Each block move requires a thread. Default value: 1000.

DataNode configurations

Parameter

Description

dfs.datanode.balance.bandwidthPerSec

The bandwidth of each DataNode that is used to balance the workloads of the cluster. We recommend that you set the bandwidth to 100 MB/s. You can also configure the dfsadmin -setBalancerBandwidth parameter to adjust the bandwidth. You do not need to restart DataNodes.

For example, you can increase the bandwidth when the cluster load is low and decrease the bandwidth when the cluster load is high.

dfs.datanode.balance.max.concurrent.moves

The maximum number of concurrent block moves that are allowed in a DataNode. Default value: 5. You can configure this parameter based on the number of disks. We recommend that you set this parameter to 4 × Number of disks as the upper limit for a DataNode.

For example, if a DataNode has 28 disks, set this parameter to 28 on the HDFS balancer and 112 on the DataNode. You can adjust the value based on the cluster load. You can increase the value when the cluster load is low and decrease the value when the cluster load is high.

FAQ

Q: Why is the difference approximately 20% after the balancing operation is performed even if the threshold parameter is set to 10%?

A: The threshold parameter is used to prevent the usage of each DataNode from becoming much higher or lower than the average usage of the cluster. As a result, the difference between the DataNodes that have the highest and the lowest usage may be 20% after the balancing operation is performed. To reduce the difference, you can set the threshold parameter to 5%.