Optimize HDFS - E-MapReduce - Alibaba Cloud Documentation Center

This topic provides suggestions on how to use Hadoop Distributed File System (HDFS) to perform scenario-specific configurations in E-MapReduce (EMR). The suggestions can improve the performance and stability of HDFS.

Background information

This topic provides the following optimization suggestions:

Control the number of small files
Specify the maximum number of files that can be stored in a directory of HDFS
Specify the maximum number of damaged volumes that are allowed
Use a balancer to balance capacity

Control the number of small files

Background information: The HDFS NameNode loads the metadata of all files to the memory. If the disk capacity of your cluster is constant and a large number of small files are generated, the NameNode encounters a bottleneck issue of memory capacity.
Suggestion: Control the number of small files. We recommend that you merge existing small files into large files.

Specify the maximum number of files that can be stored in a directory of HDFS

Background information: When your cluster is running, multiple clients or services, such as Spark and YARN, may continuously write files to the same directory in HDFS. A limit is imposed on the maximum number of files that can be written to a directory in HDFS. Therefore, you must plan data storage in advance to prevent the number of files in a directory from exceeding the threshold. An excessive number of files may cause job failures.
Suggestion: Go to the Configure tab of the HDFS service page in the EMR console. Click the hdfs-site tab. Click Add Configuration Item, add the dfs.namenode.fs-limits.max-directory-items configuration item to specify the maximum number of files that can be stored in a directory, and then click Save. For more information about how to add a configuration item, see the Manage configuration items section in the Manage configuration items topic.
Note
You must plan data storage in advance. You can classify files by time or business type. We recommend that you set the maximum number of files that can be stored in a directory to 1,000,000.

Specify the maximum number of damaged volumes that are allowed

Background information: By default, if multiple data storage volumes are configured for a DataNode and one of the volumes is damaged, the DataNode no longer provides services. By default, at least one available volume is configured.
Suggestion: Specify the maximum number of damaged volumes that are allowed. If the number of damaged volumes on a DataNode does not exceed the upper limit, the DataNode can still provide services. To specify the maximum number of damaged volumes that are allowed, perform the following steps: Go to the Configure tab of the HDFS service page in the EMR console, search for the dfs.datanode.failed.volumes.tolerated parameter, and then change the value of this parameter.
Parameter
Description
Default value
dfs.datanode.failed.volumes.tolerated
The maximum number of damaged data storage volumes that are allowed before a DataNode stops providing services.
0
Note
If a DataNode has a damaged disk and no other nodes can take over the services on the DataNode, you can temporarily increase the value of this parameter to allow the DataNode to continuously provide services.

Use a balancer to balance capacity

Background information: HDFS in an EMR cluster may experience unbalanced disk usage among DataNodes in scenarios such as the addition of DataNodes. If data distribution in HDFS is unbalanced, some DataNodes may be overloaded.
Suggestion: Use a balancer to balance capacity.
Note
The balancer occupies network bandwidth resources of DataNodes. Therefore, we recommend that you use the balancer during off-peak hours.
1. Log on to a node of your cluster.
2. Optional. Run the following command to modify the maximum bandwidth of the balancer:
```
hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
```
  Note
  For example, if you want to configure a maximum bandwidth of 200 MB/s, set <bandwidth in bytes per second> to 209715200, in bytes, which is calculated based on the following formula: 200 × 1024 × 1024. The complete command is hdfs dfsadmin -setBalancerBandwidth 209715200. If the cluster is heavily loaded, we recommend that you specify a small value for the maximum bandwidth. For example, you can set the value to 20971520, which indicates 20 MB/s. If the cluster is idle, we recommend that you specify a large value for the maximum bandwidth. For example, you can set the value to 1073741824, which indicates 1 GB/s.
3. Run the following commands to switch to the hdfs user and start the balancer:
```
su hdfs
/opt/apps/HDFS/hdfs-current/sbin/start-balancer.sh -threshold 10
```
4. Run the following command to view the log of the balancer:
```
tail -f /var/log/emr/hadoop-hdfs/hadoop-hdfs-balancer-master-1-1.c-xxx.log
```
  Note
  hadoop-hdfs-balancer-master-1-1.c-xxx in the command is the name of the log that you obtained in the previous step.

If the command output contains Successfully, the HDFS balancer is run as expected.

Parameter	Description	Default value
dfs.datanode.failed.volumes.tolerated	The maximum number of damaged data storage volumes that are allowed before a DataNode stops providing services.	0