This topic provides suggestions on how to use Hadoop Distributed File System (HDFS) to perform scenario-specific configurations in E-MapReduce (EMR). The suggestions can help improve the performance and stability of HDFS.

Background information

Configure a recycle bin mechanism

  • Background information: In HDFS, files that are deleted are moved to a recycle bin so that the data can be recovered in case of misoperation.

    You can set a time threshold for files to be retained in the recycle bin. When the storage duration of a file in the recycle bin exceeds the threshold, the system automatically deletes the file from the recycle bin and the file cannot be recovered. You can also manually delete files in the recycle bin.

  • Suggestion: Go to the Configure tab on the HDFS service page in the EMR console, search for the fs.trash.interval parameter in the Configuration Filter section, and specify the parameter. The following table describes the parameter. fs.trash.interval
    Parameter Description Default value
    fs.trash.interval The time threshold for data to be retained in the recycle bin, in minutes. If the storage duration of data in the recycle bin exceeds the threshold, the system deletes the data and the data cannot be recovered.

    If you set this parameter to 0, the recycle bin mechanism is disabled.

    1440
    Note We recommend that you use the default value of this parameter. This way, you can recover the files that are deleted by mistake. We recommend that you do not set this parameter to an excessively large value. Otherwise, excessive space of your cluster may be occupied by files in the recycle bin.

Control the number of small files

  • Background information: The HDFS NameNode loads all file metadata to the memory. If the disk capacity of your cluster is constant and a large number of small files are generated, the NameNode encounters a bottleneck issue of memory capacity.
  • Suggestion: Control the number of small files. We recommend that you merge existing small files into large files.

Set the number of files in a directory of HDFS

  • Background information: When your cluster is running, multiple clients or components, such as Spark and YARN, may continuously write files to the same directory in HDFS. A limit is imposed on the number of files that can be written to a directory in HDFS. Therefore, you must plan data storage in advance to prevent the number of files in a directory from exceeding the threshold. An excessive number of files may cause job failures.
  • Suggestion: Go to the Configure tab on the HDFS service page in the EMR console. Click the hdfs-site tab in the Service Configuration section. Click Custom Configuration, add the dfs.namenode.fs-limits.max-directory-items parameter to specify the maximum number of files that can be stored in a directory, and then click Save to save the configuration. For more information about how to add a parameter, see Add parameters.
    Note You must plan data storage in advance. You can classify files by time or business type. We recommend that you set the maximum number of files that can be written to a directory to 1,000,000.

Reduce the probability that an exception occurs on a running client in case of network instability

  • Background information: In case of network instability, the probability that an exception occurs on a running client increases.
  • Suggestion: Set the ipc.client.connect.max.retries.on.timeouts and ipc.client.connect.timeout parameters to large values to increase the maximum number of connection retries and extend the timeout period for establishing a connection. This helps reduce the probability that an exception occurs on a running client. To specify these two parameters, perform the following steps: Go to the Configure tab on the HDFS service page in the EMR console. Search for the ipc.client.connect.max.retries.on.timeouts and ipc.client.connect.timeout parameters in the Configuration Filter section and increase the values of these two parameters.
    Parameter Description Default value
    ipc.client.connect.max.retries.on.timeouts The maximum number of retries for the client to establish a socket connection with the server. 45
    ipc.client.connect.timeout The timeout period for the client to establish a socket connection with the server.

    Unit: milliseconds.

    20000

Specify the maximum number of volumes that are allowed to be damaged

  • Background information: By default, if multiple data storage volumes are configured for a DataNode and one of the volumes is damaged, the DataNode will no longer provide services.
  • Suggestion: Specify the maximum number of volumes that are allowed to be damaged. If the number of damaged volumes on a DataNode does not exceed the upper limit, the DataNode can still provide services. To specify the maximum number of damaged volumes that are allowed, perform the following steps: Go to the Configure tab on the HDFS service page in the EMR console, search for the dfs.datanode.failed.volumes.tolerated parameter in the Configuration Filter section, and change the value of this parameter.
    Parameter Description Default value
    dfs.datanode.failed.volumes.tolerated The maximum number of data storage volumes that are allowed to be damaged before a DataNode stops providing services. By default, at least one available volume must be configured. 0
    Note If a DataNode has a defective disk and no other nodes can take over the services on the DataNode, you can temporarily increase the value of this parameter to allow the DataNode to continuously provide services.

Prevent directories from being deleted by mistake

  • Background information: In HDFS, some directories can be configured as protected directories to prevent these directories from being deleted by mistake. However, you can still move these directories to the recycle bin.
  • Suggestion: Go to the Configure tab on the HDFS service page in the EMR console. Click the core-site tab in the Service Configuration section. Click Custom Configuration, add the fs.protected.directories parameter, and then save the configuration. The parameter value is the directory you want to configure as a protected directory. If you want to set this parameter to multiple directories, separate the directories with commas (,). Add Configuration Item dialog box

Use a balancer to balance capacity

  • Background information: HDFS in an EMR cluster may experience unbalanced disk usage among DataNodes in scenarios such as the addition of DataNodes. If data distribution in HDFS is unbalanced, individual DataNodes may be overloaded.
  • Suggestion: Use a balancer to balance capacity.
    Note The balancer occupies network bandwidth resources of DataNodes. Therefore, we recommend that you use the balancer during off-peak hours.
    1. Log on to a node of your cluster.

      In this example, log on to the master node of your cluster. For more information, see Log on to a cluster.

    2. Optional:Run the following command to change the maximum bandwidth of the balancer:
      hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
      Note <bandwidth in bytes per second> is the maximum bandwidth. For example, if you want to set the maximum bandwidth to 20 MB/s, the value of <bandwidth in bytes per second> is 20971520. The complete command is hdfs dfsadmin -setBalancerBandwidth 20971520. If the cluster load is high, you can change the value to 209715200 (200 MB/s). If the cluster is in the idle state, you can change the value to 1073741824 (1 GB/s).
    3. Run the following commands to switch to the hdfs user and start the HDFS balancer:
      su hdfs
      /usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 10
    4. Run the following command to go to the hadoop-hdfs directory:
      cd /var/log/hadoop-hdfs
    5. Run the ll command to view the logs of the HDFS balancer:
      The information similar to the output in the following figure is returned. balancer
    6. Run the following command to check the status of the HDFS balancer:
      tailf /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log
      If the command output includes Successfully, the HDFS balancer is running.
      Note hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log is the name of the log file that is obtained in Step 5.