This topic describes how to troubleshoot the issue that files cannot be closed when data is being written to Hadoop Distributed File System (HDFS).

Error message

java.io.IOException: Unable to close file because the last block xxx:xxx does not have enough number of replicas.

Cause

In most cases, this is because data blocks cannot be reported at the earliest opportunity due to the heavy write load of DataNodes.

Solution

We recommend that you refer to the following instructions to resolve the issue:
  • View the configurations of the HDFS service.
    Check the value of the dfs.client.block.write.locateFollowingBlock.retries parameter in the hdfs-site.xml file. This parameter specifies the number of retries to close a file after data is written to data blocks. By default, the system tries to close a file five times within 30 seconds. We recommend that you set this parameter to 8, which indicates that the system tries to close a file eight times within 2 minutes. You can increase the value of this parameter for a cluster with a high load.
    Note If you increase the value of the dfs.client.block.write.locateFollowingBlock.retries parameter, the wait time before the system closes a file is prolonged when a node is busy, and data writing is not affected.
  • Check whether the cluster has only a small number of DataNodes but a large number of task nodes. If a large number of jobs are submitted concurrently, a large number of JAR files need to be uploaded, which may cause heavy load on DataNodes. In this case, you can increase the value of the dfs.client.block.write.locateFollowingBlock.retries parameter or increase the number of DataNodes.
    Note If you increase the value of the dfs.client.block.write.locateFollowingBlock.retries parameter, the wait time before the system closes a file is prolonged when a node is busy, and data writing is not affected.
  • Check whether jobs that consume a large amount of resources on DataNodes exist in the cluster. For example, the checkpoints of Flink create and delete a large number of small files, which causes heavy load on DataNodes. In such scenarios, you can run Flink on an independent cluster. Then, the checkpoints use an independent HDFS cluster. You can also use Object Storage Service (OSS) Connector or OSS-HDFS Connector to optimize checkpoints.