E-MapReduce (EMR) supports several Elastic Compute Service (ECS) instance families. Each instance family is optimized for a different combination of compute, memory, and storage. Choosing the right instance family for each node type directly affects cluster performance, storage cost, and data reliability.
Instance families at a glance
| Instance family | Storage | vCPU:memory ratio | Best for |
|---|---|---|---|
| General-purpose | Cloud disk | 1:4 (e.g., 8 vCPUs / 32 GiB) | Balanced compute and memory; master nodes and small-scale core nodes |
| Compute-optimized | Cloud disk | 1:2 (e.g., 8 vCPUs / 16 GiB) | CPU-intensive workloads; core nodes with moderate data volumes |
| Memory-optimized | Cloud disk | 1:8 (e.g., 8 vCPUs / 64 GiB) | Memory-intensive workloads; master nodes and in-memory processing |
| Big data | Local SATA disk | — | Large-scale HDFS storage (10 terabytes or more); lower cost per GiB |
| Local SSD type | Local SSD | — | High random IOPS and throughput; latency-sensitive workloads |
| Shared type (entry level) | — | — | Entry-level users only; not suitable for enterprise customers |
| GPU | — | — | Machine learning and heterogeneous computing |
Choose an instance family by node type
Each node type in an EMR cluster has a distinct role, and that role determines its resource profile. Match the instance family to what the node actually does.
Master nodes
Master nodes coordinate cluster services and require reliable memory and stable storage. They need reliable memory and stable storage, not raw throughput.
Use general-purpose or memory-optimized instances for master nodes. Both use cloud disks, which provide high data reliability for coordination state and metadata.
Core nodes
Core nodes process tasks and store data in HDFS. The right instance family depends on your data volume:
-
Below 10 terabytes: Use general-purpose, compute-optimized, or memory-optimized instances. These store data on cloud disks and work well when OSS is the primary storage layer.
-
10 terabytes or more: Use the big data instance family. Local SATA disks offer significantly lower cost per GiB for large-scale HDFS storage.
When core nodes use local disks (big data or Local SSD type), HDFS data is stored on those local disks, which cannot ensure data reliability.
Core nodes with the big data instance family can only be created in Hadoop, Data Science, Dataflow, and Druid clusters.
Task nodes
Task nodes add compute capacity to a cluster without storing HDFS data. Because they have no data persistence role, all instance families except the big data type are supported. Choose based on your workload's resource bottleneck:
-
CPU-bound jobs: compute-optimized
-
Memory-bound jobs: memory-optimized or general-purpose
-
Machine learning inference: GPU
Instance family details
General-purpose
Stores data on cloud disks with a 1:4 vCPU-to-memory ratio. Suitable for a wide range of workloads that do not have extreme memory or compute requirements.
Compute-optimized
Stores data on cloud disks with a 1:2 vCPU-to-memory ratio. Use when your jobs are CPU-bound and memory is not the bottleneck.
Memory-optimized
Stores data on cloud disks with a 1:8 vCPU-to-memory ratio. Use for workloads that keep large datasets in memory, such as Spark in-memory processing.
Big data
Uses local SATA disks for storage. The cost per GiB is significantly lower than cloud disks, making this the recommended choice when storing large volumes of HDFS data on core nodes. Available only in Hadoop, Data Science, Dataflow, and Druid clusters.
Local SSD type
Uses local SSDs for storage. Delivers high random IOPS and throughput, suited for latency-sensitive workloads.
Shared type (entry level)
Instances share physical CPU resources, which can lead to inconsistent performance under heavy compute loads. Suitable for entry-level users, but not enterprise customers.
GPU
A heterogeneous instance type backed by GPU hardware. Use for machine learning and other GPU-accelerated workloads on task nodes.