This topic provides suggestions on how to use Hadoop Distributed File System (HDFS) to perform scenario-specific configurations for real-time computing in E-MapReduce (EMR). The suggestions can help improve the stability of HDFS.
Adjust the number of Xceiver connections on DataNodes
- Background information: In most cases, a real-time computing framework allows you to use a large number of file write streams to continuously write data to files in HDFS. However, a limit is imposed on the number of files that you can open at the same time in HDFS. The limit is specified by the dfs.datanode.max.transfer.threads parameter.
- Suggestion: Go to the Configure tab on the HDFS service page in the EMR console. Search for the dfs.datanode.max.transfer.threads parameter in the configuration filter section, and change the parameter value. This
parameter specifies the maximum number of threads that can be used to process read
or write streams. Default value: 4096. You can increase the value of this parameter
when one of the following error messages appears:
- Error message in the DataNode service log stored in the /var/log/emr/hadoop/ directory
java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:150)
- Error message in the client running log
DataXceiver error processing WRITE_BLOCK operation src: /10.*.*.*:35692 dst: /10.*.*.*:50010 java.io.IOException: Premature EOF from inputStream
- Error message in the DataNode service log stored in the /var/log/emr/hadoop/ directory
Reserve disk space
- Background information: HDFS reserves 128 MB disk space for each file that you open
to ensure that data can be written to the file. If the actual size of the file is
particularly small, such as 8 MB, only 8 MB disk space is occupied when you use the
close() method to close an input stream.
In most cases, a real-time computing framework allows you to use a large number of file write streams to continuously write data to files in HDFS. If you open multiple files at the same time in HDFS, HDFS reserves more disk space. If the remaining disk space is insufficient, files cannot be created.
- Suggestion: Use the following formula to calculate the minimum disk space that must
be reserved on your cluster:
N × 128 MB × Number of replicas
. In this formula, N represents the number of files that you want to open at the same time.