This topic provides suggestions on how to use Hadoop Distributed File System (HDFS) to perform scenario-specific configurations in E-MapReduce (EMR). The suggestions can help improve the performance and stability of HDFS.

Background information

Control the number of small files

  • Background information: The HDFS NameNode loads all file metadata to the memory. If the disk capacity of your cluster is constant and a large number of small files are generated, the NameNode encounters a bottleneck issue of memory capacity.
  • Suggestion: Control the number of small files. We recommend that you merge existing small files into large files.

Specify the maximum number of files that can be stored in a directory of HDFS

  • Background information: When your cluster is running, multiple clients or components, such as Spark and YARN, may continuously write files to the same directory in HDFS. A limit is imposed on the number of files that can be stored in a directory of HDFS. Therefore, you must plan data storage in advance to prevent the number of files in a directory from exceeding the limit. An excessive number of files may cause job failures.
  • Suggestion: Go to the Configure tab on the HDFS service page in the EMR console. Click the hdfs-site tab in the Service Configuration section. Click Add Configuration Item, add the dfs.namenode.fs-limits.max-directory-items parameter to specify the maximum number of files that can be stored in a directory, and then click Save to save the configuration. For more information about how to add a parameter, see Add parameters.
    Note You must plan data storage in advance. You can classify files by time or business type. We recommend that you set the maximum number of files that can be stored in a directory to 1,000,000.

Specify the maximum number of damaged volumes that are allowed

  • Background information: By default, if multiple data storage volumes are configured for a DataNode and one of the volumes is damaged, the DataNode no longer provides services.
  • Suggestion: Specify the maximum number of damaged volumes that are allowed. If the number of damaged volumes on a DataNode does not exceed the upper limit, the DataNode can still provide services. To specify the maximum number of damaged volumes that are allowed, perform the following steps: Go to the Configure tab on the HDFS service page in the EMR console, search for the dfs.datanode.failed.volumes.tolerated parameter in the Configuration Filter section, and then change the value of this parameter.
    Parameter Description Default value
    dfs.datanode.failed.volumes.tolerated The maximum number of damaged data storage volumes that are allowed before a DataNode stops providing services. By default, at least one available volume must be configured. 0
    Note If a DataNode has a damaged disk and no other nodes can take over the services on the DataNode, you can temporarily increase the value of this parameter to allow the DataNode to continuously provide services.

Use a balancer to balance capacity

  • Background information: HDFS in an EMR cluster may experience unbalanced disk usage among DataNodes in scenarios such as the addition of DataNodes. If data distribution in HDFS is unbalanced, individual DataNodes may be overloaded.
  • Suggestion: Use a balancer to balance capacity.
    Note The balancer occupies network bandwidth resources of DataNodes. Therefore, we recommend that you use the balancer during off-peak hours.
    1. Log on to a node of your cluster.
    2. Optional:Run the following command to change the maximum bandwidth of the balancer:
      hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
      Note Replace <bandwidth in bytes per second> in the preceding command with the maximum bandwidth. For example, if you want to set the maximum bandwidth to 20 MB/s, replace <bandwidth in bytes per second> with 20971520. The complete command is hdfs dfsadmin -setBalancerBandwidth 20971520. If the cluster load is high, you can change the value to 209715200 (200 MB/s). If the cluster is in the idle state, you can change the value to 1073741824 (1 GB/s).
    3. Run the following commands to switch to the hdfs user and start the balancer:
      su hdfs
      /opt/apps/HDFS/hdfs-current/sbin/start-balancer.sh -threshold 10
    4. Run the following command to go to the installation directory of HDFS:
      cd /var/log/emr/hadoop/
    5. Run the ll command to view the logs of the balancer.
    6. Run the following command to check the status of the balancer:
      tailf hadoop-hdfs-balancer-master-1-1.c-xxx.log
      Note hadoop-hdfs-balancer-master-1-1.c-xxx in the preceding command is the log name obtained in the preceding step.

      If the command output includes Successfully, the balancer is running.