This topic provides answers to some frequently asked questions about HDFS.

Why does the NameNode take a long period of time to restart?

  • Problem description: When an exception occurs on the running NameNode, the NameNode is restarted. However, the restart process is particularly slow and is still in progress after more than 10 minutes. Then, the NameNode is automatically restarted again. The log information shows that the FsImage and EditLog files are being loaded during the restart of the NameNode.
  • Cause: When the FsImage and EditLog files are loaded during the restart of the NameNode, a large amount of memory space is occupied.
  • Solution: Increase the heap size of the NameNode. For more information, see Adjust the memory size of the JVM for the NameNode.

Why does the NameNode fail to respond to access requests?

  • Problem description: The NameNode is fully loaded for a long period of time, the CPU utilization of the node on which the NameNode runs reaches 100%, and the NameNode can no longer respond to access requests.
  • Cause: The NameNode does not have sufficient memory space to accommodate excessive files, and full garbage collection is frequently triggered.
  • Solution: Increase the heap size of the NameNode. For more information, see Adjust the memory size of the JVM for the NameNode.

Why are a large number of EditLog files generated?

  • Problem description: A large number of EditLog files are generated in the data directory of the NameNode. The data directory of the NameNode occupies a large amount of disk space.
  • Cause: The EditLog files are not merged in a timely manner. This is because the Secondary NameNode in a non-HA cluster or the standby NameNode in an HA cluster is abnormal. The exception may be caused by insufficient memory space.
  • Solution: Increase the heap size of the NameNode to allow the NameNode to run as expected. For more information, see Adjust the memory size of the JVM for the NameNode.

Why are a large number of under replicated blocks generated?

  • Problem description: A large number of under replicated blocks are generated after the fsck command is run.
  • Cause: The number of replicas is less than 3 and takes a long period of time to change back to 3 because a node is decommissioned or an abnormal node is removed from your cluster.
  • Solution: To allow the number of replicas to change back to 3, go to the Configure tab on the HDFS service page in the EMR console. Search for the parameters that are described in the following table in the Configuration Filter section and increase the values of these parameters.
    Parameter Description
    dfs.namenode.replication.work.multiplier.per.iteration Default value: 100. We recommend that you set this parameter to 200. The parameter value cannot exceed 500.

    This parameter determines the total number of block transfers to begin in parallel at a DataNode for replication.

    The parameter value is a coefficient. The actual number of block transfers is the coefficient multiplied by the number of nodes in your cluster.

    dfs.namenode.replication.max-streams We recommend that you set this parameter to 100.

    You can use this parameter to adjust the parallelism of low-priority replication streams.

    dfs.namenode.replication.max-streams-hard-limit Default value: 100. We recommend that you set this parameter to 200. The parameter value cannot exceed 500.

    You can use this parameter to adjust the parallelism of replication streams of all priorities, including the highest-priority replication streams.

How do I troubleshoot the issue of missing blocks or corrupted blocks?

  • Problem description: After the fsck command is run, an error message indicating missing blocks or corrupted blocks appears.
  • Cause: DataNodes may stop providing services, or the disk is damaged or abnormal.
  • Solution: If the issue occurs because DataNodes stop providing services, restart the DataNodes. If the issue occurs because the disk is damaged or abnormal, run the hdfs fsck / -files command to check the status of all files and identify the damaged files. Then, export all files, delete the damaged files, and re-upload the remaining files.

How do I troubleshoot the issue that the NameNode fails to be restarted when the txid values of EditLog files are not continuous?

  • Problem description: The NameNode fails to be restarted if the JournalNode is powered off, the disk space is exhausted by data directories, or the network is abnormal.
  • Cause: The txid values of EditLog files on the JournalNode may be not continuous. As a result, some EditLog files may be damaged.
  • Solution: Perform the following steps to resolve the issue of damaged EditLog files:
    1. Back up the metadata storage directory /mnt/disk1/hdfs of the NameNode to prevent data loss caused by accidental operations.
    2. Observe the NameNode startup log and record the txid values of the EditLog files that fail to be loaded.
    3. Access another NameNode, find the EditLog files that have the same txid values as the EditLog files that fail to be loaded, and then copy these files to the current NameNode to overwrite the EditLog files that fail to be loaded.
    4. Restart the NameNode and check whether the operation is successful.
      Note If the restart operation still fails, .