This topic describes how to adjust the memory size of your Java Virtual Machines (JVMs) for the NameNode and DataNodes to enhance the stability of the Hadoop Distributed File System (HDFS).

Prerequisites

An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.

Adjust the memory size of the JVM for the NameNode

  • Background information: In HDFS, the NameNode records the metadata of each file. The metadata occupies a specific amount of memory space. The default JVM memory configuration can meet some common requirements of HDFS. Sometimes more files are written to HDFS and the number of files stored in HDFS continuously increases. When the number of newly written files exceeds the default memory space configuration, these files cannot be stored. In this case, you must modify the memory size.
  • Suggestion: Modify the memory size of the JVM for the NameNode. Method:
    • High availability (HA) cluster

      Go to the Configure tab on the HDFS service page in the EMR console. Search for the hadoop_namenode_heapsize parameter in the Configuration Filter section and change the value of this parameter based on your business requirements.

    • Non-HA cluster

      Go to the Configure tab on the HDFS service page in the EMR console. Search for the hadoop_namenode_heapsize and hadoop_secondary_namenode_heapsize parameters in the Configuration Filter section and change the values of the parameters based on your business requirements.

      Note After the configuration is complete, you must restart NameNode or the secondary NameNode for the configuration to take effect.
    You can view the number of files and the number of data blocks on the web UI of HDFS. For more information, see Access the web UIs of open source components. You can use the following formula to calculate the memory size of the JVM for the NameNode:
    Recommended memory size = (Number of files, measured in millions + Number of data blocks, measured in millions) × 512 MB

    For example, you have 10 million files, all of which are small- and medium-sized files, and the size of each file does not exceed the size of a block. Therefore, the number of blocks is also 10 million. In this example, the recommended memory size is 10,240 MB, which is calculated by using the following formula: (10 + 10) × 512 MB.

    The following table describes the mapping between the number of files and the recommended memory size. The size of most files does not exceed the size of a block.
    Number of files Recommended memory size (MB)
    10,000,000 10240
    20,000,000 20480
    50,000,000 51200
    100,000,000 102400

Adjust the memory size of the JVM for each DataNode

  • Background information: In HDFS, DataNodes record metadata of data blocks. The metadata occupies a specific amount of memory space. The default JVM memory configuration can meet some common requirements of HDFS. Sometimes more files are written to HDFS and the number of files stored in HDFS continuously increases. The number of blocks on DataNodes depends on the number of files. When the number of newly written files exceeds the default memory space configuration, these files cannot be stored. In this case, you must modify the memory size.
  • Suggestion: Go to the Configure tab on the HDFS service page in the EMR console. Search for the hadoop_datanode_heapsize parameter in the Configuration Filter section and change the value of this parameter based on your business requirements.
    You can use the following formula to calculate the memory size of the JVM for each DataNode:
    Number of replicas per DataNode in an EMR cluster = Number of data blocks × 3/Number of DataNodes
    Recommended memory size = Number of replicas per DataNode, measured in millions × 2048 MB
    Note After the configuration is complete, you must restart DataNodes to apply the configuration.

    For example, HDFS in an EMR cluster provides triplicate storage, the Elastic Compute Service (ECS) instances deployed in the EMR cluster belong to a big data instance family, and the number of core nodes is 6. If you have 10 million files and the number of data blocks is also 10 million because all the files are small- and medium-sized files, the number of replicas per DataNode is 5,000,000, which is calculated by using the following formula: (10,000,000 × 3/6). The recommended memory size is 10,240 MB, which is calculated by using the following formula: 5 × 2048 MB.

    The following table describes the mapping between the number of replicas per DataNode and the recommended memory size. The size of most files does not exceed the size of a block.
    Number of replicas per DataNode Recommended memory size (MB)
    1,000,000 2048
    2,000,000 4096
    5,000,000 10240
    Note The recommended value includes the memory space for the JVM kernel and the memory space to run jobs during peak hours. You can directly use the recommended value under normal circumstances.