This topic provides suggestions on how to use Hadoop Distributed File System (HDFS) to perform scenario-specific configurations for real-time computing in E-MapReduce (EMR). The suggestions can help improve the stability of HDFS.

Adjust the number of Xceiver connections on DataNodes

  • Background information: In most cases, a real-time computing framework allows you to use a large number of file write streams to continuously write data to files in HDFS. However, a limit is imposed on the number of files that you can open at the same time in HDFS. The limit is specified by the dfs.datanode.max.transfer.threads parameter.
  • Suggestion: Go to the Configure tab on the HDFS service page in the EMR console. Search for the dfs.datanode.max.transfer.threads parameter in the configuration filter section, and change the parameter value. This parameter specifies the maximum number of threads that can be used to process read or write streams. Default value: 4096. You can increase the value of this parameter when one of the following error messages appears:
    • Error message in the DataNode service log stored in the /var/log/emr/hadoop/ directory
      java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096
              at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:150)
    • Error message in the client running log
      DataXceiver error processing WRITE_BLOCK operation  src: /10.*.*.*:35692 dst: /10.*.*.*:50010
      java.io.IOException: Premature EOF from inputStream

Reserve disk space

  • Background information: HDFS reserves 128 MB disk space for each file that you open to ensure that data can be written to the file. If the actual size of the file is particularly small, such as 8 MB, only 8 MB disk space is occupied when you use the close() method to close an input stream.

    In most cases, a real-time computing framework allows you to use a large number of file write streams to continuously write data to files in HDFS. If you open multiple files at the same time in HDFS, HDFS reserves more disk space. If the remaining disk space is insufficient, files cannot be created.

  • Suggestion: Use the following formula to calculate the minimum disk space that must be reserved on your cluster: N × 128 MB × Number of replicas. In this formula, N represents the number of files that you want to open at the same time.