As the number of files in your Hadoop Distributed File System (HDFS) cluster grows, the Java Virtual Machine (JVM) heap memory on the NameNode and DataNodes must grow with it. Without enough heap, new writes fail. This topic shows you how to calculate the required heap size and apply the configuration in the E-MapReduce (EMR) console.
Prerequisites
Before you begin, ensure that you have:
-
Access to the EMR console with permissions to modify cluster service configurations
-
The current file count and data block count from the HDFS web user interface (UI) — see Access the web UIs of open source components for instructions on opening the HDFS web UI
Adjust the NameNode heap size
Calculate the recommended heap size
Use the following formula:
Recommended memory size = (Number of files in millions + Number of data blocks in millions) × 512 MB
Example: A cluster has 10 million files. All files are small to medium sized (each fits within one block), so the block count also equals 10 million. Recommended heap size: (10 + 10) × 512 MB = 10,240 MB.
The following table shows recommended heap sizes for common file counts, assuming most files fit within one block.
| Number of files | Recommended memory size (MB) |
|---|---|
| 10,000,000 | 10,240 |
| 20,000,000 | 20,480 |
| 50,000,000 | 51,200 |
| 100,000,000 | 102,400 |
Apply the configuration
The procedure differs depending on whether your cluster uses high availability (HA).
HA cluster
-
Log on to the EMR console.
-
Find the target cluster and click Services in the Actions column.
-
On the Services tab, find the HDFS service and click Configure.
-
On the Configure tab, search for
hadoop_namenode_heapsize. -
Set the value based on your calculation.
-
Restart the NameNode for the change to take effect.
Non-HA cluster
-
Log on to the EMR console.
-
Find the target cluster and click Services in the Actions column.
-
On the Services tab, find the HDFS service and click Configure.
-
On the Configure tab, search for
hadoop_namenode_heapsizeandhadoop_secondary_namenode_heapsize. -
Set the values based on your calculation.
-
Restart the NameNode or the Secondary NameNode for the change to take effect.
Adjust the DataNode heap size
Calculate the recommended heap size
The heap demand on each DataNode depends on how many block replicas that node holds, not on the total file count. Use the following formulas:
Number of replicas per DataNode = Number of data blocks × 3 / Number of DataNodes
Recommended memory size = Number of replicas per DataNode in millions × 2,048 MB
The recommended value accounts for JVM kernel overhead and peak-hour job memory, so use it directly under normal circumstances.
Example: A cluster uses triplicate storage, runs on Elastic Compute Service (ECS) instances of the big data instance family, and has 6 core nodes. With 10 million files and 10 million data blocks (all small to medium sized):
-
Replicas per DataNode: 10,000,000 × 3 / 6 = 5,000,000
-
Recommended heap size: 5 × 2,048 MB = 10,240 MB
The following table shows recommended heap sizes based on the number of replicas per DataNode, assuming most files fit within one block.
| Number of replicas per DataNode | Recommended memory size (MB) |
|---|---|
| 1,000,000 | 2,048 |
| 2,000,000 | 4,096 |
| 5,000,000 | 10,240 |
Apply the configuration
-
Log on to the EMR console.
-
Find the target cluster and click Services in the Actions column.
-
On the Services tab, find the HDFS service and click Configure.
-
On the Configure tab, search for
hadoop_datanode_heapsizein the Configuration Filter section. -
Set the value based on your calculation.
-
Restart the DataNodes for the change to take effect.