HDFS Diskbalancer - E-MapReduce - Alibaba Cloud Documentation Center

If data is unevenly distributed among the disks on a node due to a large amount of data writes and deletions, or disk replacement and expansion operations on the cluster, the concurrent read and write performance of Hadoop Distributed File System (HDFS) may decrease. You can use the HDFS Diskbalancer to evenly distribute data to all disks on DataNodes. This topic describes how to use an HDFS Diskbalancer and optimize the performance of the HDFS balancer.

Background information

The HDFS Diskbalancer is a command-line tool provided by Hadoop 3.x. This tool can evenly distribute data to all disks on DataNodes. Unlike HDFS balancers which are responsible for data balancing among clusters, HDFS Diskbalancers perform operations on a specific DataNode to move blocks from one disk to another.

An HDFS Diskbalancer is to create a plan and then execute the plan in the specified DataNode. A plan is a set of statements that describe how much data should be moved between two disks. A plan consists of multiple steps. The source disk, destination disk, and the number of bytes to be moved are involved in the moving steps. The plan can be run on the node on which the operational data is stored. Diskbalancers limit the amount of data replicated per second to prevent interference with other processes.

Use the HDFS Diskbalancer

You can run the following command to use HDFS Diskbalancer to view the details of a command:

hdfs diskbalancer --help <command>

plan

Specify a DataNode.

hdfs diskbalancer -plan <hostname> [options]

Sample code:

hdfs diskbalancer -plan core-1-1.c-xxxxxxxxxxx

Parameter in command	Description
-bandwidth <arg>	The maximum disk bandwidth that can be used by the Diskbalancer. Unit: MB/s. The value must be an integer. Example: 10 MB/s.
-maxerror <arg>	The maximum number of errors that can be tolerated when data is replicated between a pair of disks.
-out <arg>	The local path in which the file to be written or exported is stored. If this parameter is not specified, the default path is used.
-thresholdPercentage <arg>	The percentage of data skew that is allowed before the Diskbalancer starts to work. For example, if the total amount of data on a two-disk node is 100 GB, the expected amount of data on each disk that is calculated by the Diskbalancer is 50 GB. If the tolerance is 10%, the amount of data on a disk must be greater than 60 GB, which is 50 GB plus the tolerance value 10%, for the Diskbalancer to start to work.
-v	The verbose mode. If you specify this parameter, the plan command is forced to print a summary of the plan on the standard output.

execute

Specify a planfile generated by a plan command to run the Diskbalancer.

hdfs diskbalancer -execute <planfile>

Sample code:

hdfs diskbalancer -execute /system/diskbalancer/core-1-1.c-xxxxxxxxxxx.plan.json

Parameter in command	Description
-skipDateCheck	Specifies whether to skip the date check and enforce the execution.

query

Query the execution status of the Diskbalancer from the specified DataNode.

hdfs diskbalancer -query <hostname>  [options]

Sample code:

hdfs diskbalancer -query core-1-1.c-xxxxxxxxxxx

Parameter in command	Description
-v	Prints the details of the plan that is running on the node.

cancel

Cancel the plan that is running. You can also restart the DataNode to cancel the plan.

hdfs diskbalancer -cancel <planFile> | -cancel <planID> -node <hostname>

Sample code:

hdfs diskbalancer -cancel /system/diskbalancer/nodename.plan.json

hdfs diskbalancer -cancel planID -node nodename

Note

You can run the query command to obtain the value of the planID parameter from a DataNode.

Parameter in command	Description
-node <arg>	Cancels a plan that is running with a planID and hostname.

report

The report command provides a detailed report of the specified node or top nodes that benefit from the execution of the Diskbalancer. You can use a file of hosts or separate multiple nodes with commas (,) to specify nodes.

hdfs diskbalancer -fs http://namenode.uri -report -node <file://> | [<DataNodeID|IP|Hostname>,...]

hdfs diskbalancer -fs http://namenode.uri -report -top topnum

Parameter in command	Description
-node <arg>	The address of the DataNode. The value can be an ID, IP address, or hostname of a DataNode.
-top <arg>	Specifies the number of nodes to be listed with unbalanced data.

Parameters used to optimize the performance of the HDFS Diskbalancer

If you need to modify the parameters of the HDFS Diskbalancer, you can choose Configure > hdfs-site.xml in the HDFS section of the Services tab in the E-MapReduce (EMR) console to add or adjust the following types of configurations:

Parameter	Description
dfs.disk.balancer.enabled	Specifies whether to enable the HDFS Diskbalancer. Default value: true.
dfs.disk.balancer.max.disk.throughputInMBperSec	The maximum disk bandwidth that can be used when a Diskbalancer is running. Unit: MB/s. Default value: 10.
dfs.disk.balancer.max.disk.errors	The maximum number of errors that can be tolerated when data is moved between disks. If the number of errors exceeds this value, the balancing fails. Default value: 5.
dfs.disk.balancer.plan.valid.interval	The maximum validity period of the Diskbalancer plan. Default value: 1. Unit: days.
dfs.disk.balancer.block.tolerance.percent	The threshold for the difference between the amount of data stored on each disk and the ideal amount of data stored on each disk when data is balanced between disks. Valid values: 1 to 100. Default value: 10. For example, if the ideal data storage capacity for each disk is 100 GB, set this parameter to 10. Then, when the amount of data that is stored on a disk reaches 90 GB, the storage status of the disk reaches the expected value.
dfs.disk.balancer.plan.threshold.percent	The threshold of data density difference between two disks that can be tolerated when balancing is running. Valid values: 1 to 100. Default value: 10. If the absolute value of the data density difference between any two disks exceeds the threshold, it indicates that the disks need to be balanced. For example, if the total amount of data on a two-disk node is 100 GB, the expected amount of data on each disk that is calculated by the Diskbalancer is 50 GB. If the tolerance is 10%, the amount of data on a disk must be greater than 60 GB, which is 50 GB plus the tolerance value 10%, for the Diskbalancer to start to work.