You can use a Hadoop Distributed File System (HDFS) balancer to analyze the distribution of data blocks and redistribute data that is stored on DataNodes. This topic describes how to use an HDFS balancer and optimize the performance of the HDFS balancer.

Prerequisites

Use the HDFS balancer

Run the following command to configure the HDFS balancer:
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-source [-f <hosts-file> | <comma-separated list of hosts>]]
[-blockpools <comma-separated list of blockpool ids>]
[-idleiterations <idleiterations>]
The following table describes the parameters of the HDFS balancer.
Parameter Description
threshold The threshold of the disk usage, in percentage.

Default value: 10%. This value ensures that the disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%.

If the overall usage of the cluster is high, set this parameter to a smaller value.

If a large number of new nodes are added to the cluster, you can set this parameter to a larger value to move data from the high-usage nodes to the low-usage nodes.

policy The balancing policy. Valid values:
  • datanode: If data is evenly distributed to all DataNodes, your cluster is balanced. This is the default value.
  • blockpool: If all block pools in each DataNode are balanced, your cluster is balanced.
exclude Excludes specified DataNodes.
include Specifies DataNodes on which the balancing operation is performed.
source The DataNode that serves as the source node.
blockpools The block pools in which the HDFS balancer runs.
idleterations The maximum number of idle loops that are allowed. If this threshold is exceeded, the balancing operation is stopped. Default value: 5.

Example

An Alibaba Cloud EMR cluster is used.

  1. Log on to a node of your cluster.
    In this example, log on to the master node of your cluster. For more information, see Log on to a cluster.
  2. Switch to the hdfs user and start the HDFS balancer:
    /usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 10
  3. Run the following command to go to the hadoop-hdfs directory:
    cd /var/log/hadoop-hdfs
  4. Run the ll command to view the logs of the HDFS balancer:
    The information similar to the output in the following figure is returned. balancer
  5. Run the following command to check the status of the HDFS balancer:
    tailf /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log
    If the command output includes Successfully, the HDFS balancer is running.
    Note hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log is the name of the log file that is obtained in Step 4.

Parameters used to optimize the performance of the HDFS balancer

The HDFS balancer occupies system resources. Therefore, we recommend that you use the HDFS balancer during off-peak hours. By default, you do not need to adjust the settings of the Balancer parameters. You can adjust the configurations on clients and DataNodes based on your business requirements.
  • To adjust the configurations on clients, perform the following steps: Go to the Configure tab on the HDFS service page. Click the hdfs-site tab in the Service Configuration section, click Custom Configuration in the upper-right corner of the Service Configuration section, and then add the parameters that are described in the following table. Then, rerun the HDFS balancer for the modifications to take effect. For more information about how to add parameters, see Add parameters.
    Parameter Description
    dfs.balancer.dispatcherThreads The number of dispatcher threads used by the HDFS balancer to determine the blocks that need to be moved. Before the HDFS balancer moves a specific amount of data between two DataNodes, the HDFS balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled.
    Note Default value: 200.
    dfs.balancer.rpc.per.sec The number of remote procedure calls (RPCs) sent by dispatcher threads per second. Default value: 20.

    Before the HDFS balancer moves data between two DataNodes, the HDFS balancer uses dispatcher threads to repeatedly send the getBlocks() RPC to the NameNode. This results in a heavy load on the NameNode. To avoid this issue and balance the cluster load, we recommend that you set this parameter to limit the number of RPCs sent per second.

    For example, you can decrease the value of the parameter by 10 or 5 for a cluster with a high load to minimize the impact on the overall moving progress.

    dfs.balancer.getBlocks.size The total data size of the blocks moved each time. Before the HDFS balancer moves data between two DataNodes, the HDFS balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled. By default, the size of blocks in each block list is 2 GB. When the NameNode receives a getBlocks() RPC, the NameNode is locked. If an RPC queries a large number of blocks, the NameNode is locked for a long period of time. This slows down data writing. To avoid this issue, we recommend that you set this parameter based on the NameNode load.
    dfs.balancer.moverThreads Default value: 1000.

    Each block move requires a thread. This parameter limits the total number of concurrent moves.

  • To adjust the configurations on DataNodes, perform the following steps: Go to the Configure tab on the HDFS service page. Click the hdfs-site tab in the Service Configuration section, click Custom Configuration in the upper-right corner of the Service Configuration section, and then add the parameters that are described in the following table. Then, restart the specific DataNodes and rerun the HDFS balancer for the modifications to take effect. For more information about how to add parameters, see Add parameters.
    Parameter Description
    dfs.datanode.balance.bandwidthPerSec

    Specifies the bandwidth for each DataNode to balance the workloads of the cluster. We recommend that you set the bandwidth to 100 MB/s. You can also set the dfsadmin -setBalancerBandwidth parameter to adjust the bandwidth. You do not need to restart DataNodes.

    For example, you can increase the bandwidth when the cluster load is low and decrease the bandwidth when the cluster load is high.

    dfs.datanode.balance.max.concurrent.moves

    Specifies the maximum number of concurrent block moves that are allowed in a DataNode.

    Default value: 5. Set this parameter based on the number of disks. We recommend that you set this parameter to 4 × Number of disks as the upper limit for a DataNode.

    For example, if a DataNode has 28 disks, set this parameter to 28 on the HDFS balancer and 112 on the DataNode. Adjust the value based on the cluster load. Increase the value when the cluster load is low and decrease the value when the cluster load is high.