This topic describes how to troubleshoot the issue that the number of replicas to which files are written is less than the minimum threshold.
Error message
The following error message is returned. [X] indicates the number of DataNodes that are running in your cluster. [Y] indicates the number of DataNodes that are excluded from the write operation.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /foo/file1 could only be written to 0 of the 1 minReplication nodes,there are [X] datanode(s) running and [Y] node(s) are excluded in this operation.Cause
The error message indicates that the number of replicas to which files are written is less than the minimum threshold in your cluster. As a result, files cannot be written to a specific path in Hadoop Distributed File System (HDFS).
Solutions
If [X] is 0, an error occurs when DataNodes report information to the NameNode in the cluster. In this case, you need to check the following items on each DataNode:
Run a command to check whether a DataNode can connect to the NameNode.
If the command times out or no command output is returned after an extended period of time, run the following command to check whether the NameNode and DataNode are in the same security group:
hadoop dfs -ls /Check whether the DataNode is started.
Log on to the node in which the DataNode is deployed. For more information, see Log on to a cluster.
Switch to the hdfs user.
su - hdfsRun the following command to check whether a DataNode process is running:
jpsIf the command output contains DataNode, a DataNode process is running.
If [Y] is not 0, specific DataNodes are excluded by the HDFS client. In this case, you need to check the following items:
Check whether the HDFS client can connect to all DataNodes.
Check whether the load on a DataNode is excessively high. If a DataNode is overloaded or the capacity of a DataNode is small, the DataNode cannot process the requests from the NameNode to write multiple files to blocks at the same time. In this case, you must scale up the DataNode.
NoteYou can view the load of DataNodes on the web UIs of HDFS components. For more information, see Web UIs of HDFS components.
If [X] is not 0 and [X] is greater than [Y], you can check whether the following scenario occurs:
If the number of DataNodes is less than 10 or the capacity of DataNodes is about to reach the upper limit, the NameNode may fail to find an appropriate DataNode to write data when some jobs are running in the cluster to write a large number of small files. This issue may occur if the checkpoints of Flink are written to HDFS or a large number of dynamic partitions need to be created when data is written to dynamic partitions in Hive or Spark.
To resolve this issue, you can increase the number or capacity of DataNodes in the cluster. We recommend that you increase the number of DataNodes in the cluster to resolve the issue. This facilitates the writing of a large number of small files. If you no longer need a large number of DataNodes in the future, you can scale in DataNodes with ease. You can also use OSS-HDFS to decouple storage from computing to reduce costs.