Region and storage are two of the most consequential decisions you make before creating an E-MapReduce (EMR) cluster. The right region eliminates cross-region transfer costs and reduces latency. The right storage architecture determines how your cluster scales, what it costs, and whether your data survives cluster termination.
Select a region
Deploy your EMR cluster in the same region as your data. For example, if your source data is in an Object Storage Service (OSS) bucket or ApsaraDB RDS instance in China (Shanghai), create your cluster in China (Shanghai). If you also write output to OSS, create that bucket in the same region. Co-locating the cluster and its data eliminates cross-region transfer fees and reduces latency.
Beyond data locality, consider the following factors when selecting a region.
| Factor | What to check |
|---|---|
| EMR service availability | Confirm that EMR is available in the region. Some services—such as OSS-HDFS and Data Lake Formation (DLF)—are not available in all regions. Instance types with local SSDs are also region-specific. |
| ECS instance pricing | Prices for Elastic Compute Service (ECS) instances vary by region. Use the ECS Price Calculator to compare costs before committing. |
| Service topology | Place EMR in the same region as dependent services—such as Virtual Private Cloud (VPC), Server Load Balancer (SLB), and databases—to avoid cross-region operation fees. For hybrid cloud deployments, choose the region closest to your data center's access point. |
Supported regions
| Geography | Regions |
|---|---|
| Asia Pacific - China | China (Hangzhou), China (Shanghai), China (Qingdao), China (Beijing), China (Zhangjiakou), China (Hohhot), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong) |
| Asia Pacific - others | Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta) |
| Europe and America | Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia) |
| Middle East | UAE (Dubai) |
Plan storage
Choose a storage architecture
EMR supports two storage architectures: compute-storage separation and compute-storage integration.
HDFS is ephemeral. All data stored in HDFS is permanently deleted when an EMR cluster is released. If you use HDFS, back up critical data to OSS or OSS-HDFS before releasing the cluster.
| Compute-storage separation (OSS-HDFS or OSS) | Compute-storage integration (HDFS) | |
|---|---|---|
| Use when | Data lake architecture; cold data analysis | Low-latency reads and writes required |
| Data persistence | Data is retained after the cluster is released | Data is deleted when the cluster is released |
| Data durability | 99.9999999999% (twelve nines) | Relies on replica mechanism; no cross-region disaster recovery |
| Data reliability | OSS supports locally redundant storage (LRS) and zone-redundant storage (ZRS), providing cross-zone high reliability | Three replicas for local disks, two replicas for cloud disks; replicas are confined to the cluster |
| Scaling | Add compute nodes (CNs) independently without touching storage | Must scale compute and storage together; node removal is sequential and rebalancing is required |
| Storage cost | USD 0.0170 per GB-month (OSS Standard storage). OSS-HDFS also generates auxiliary data, which incurs additional OSS storage fees. See OSS billing and OSS pricing. | USD 0.051 per GiB-month. See EBS block storage billing and ECS pricing. |
| O&M | CNs are stateless and can be replaced quickly; storage capacity expands without manual cluster adjustment | DataNode failures require manual rebalancing; cluster resizing requires manual intervention |
| Access | oss://bucket-name.endpoint/path/to/data. See Getting started. |
HA cluster: hdfs://namespace/path; non-HA cluster: hdfs://namenode-host:port/path |
Choose a disk type
Each node in an EMR cluster has a system disk and optionally one or more data disks.
| Disk | Purpose | Supported types |
|---|---|---|
| System disk | Installs the operating system; does not store business data | Cloud disks only |
| Data disk | Stores data, local logs, and shuffle data | Cloud disks and local disks |
With the same storage capacity, you can configure multiple data disks to improve service availability. If you configure multiple data disks, specific services can provide the fault tolerance capability, and the overall functionality of data disks is not affected in case of a disk failure.
Cloud disks
Cloud disks use a distributed triplicate mechanism and provide 99.9999999% (nine nines) data reliability. EMR supports three cloud disk types.
| Disk type | Latency | IOPS and throughput | Use when |
|---|---|---|---|
| ESSD | 0.2 ms | High; supports performance levels PL0 to PL3. See ESSDs. | Latency-sensitive or I/O-intensive workloads: large-scale OLTP databases, NoSQL databases, Elasticsearch |
| Standard SSD | 0.5–2 ms | Relatively high | I/O-intensive applications; small and medium-sized relational and NoSQL databases |
| Ultra disk | 1–3 ms | Medium | Development and testing; system disks |
For a detailed comparison of cloud disk and local disk performance, see Block storage performance.
Local disks
Local disks are physically attached to the host server and provide extremely low latency and high throughput for massive-scale data storage.
In the EMR console, local disks are attached when you set the node group Type to Big Data or Local SSD.
Local disks are only supported on core and task nodes, not master nodes.
Data on local disks can be lost if the host hardware fails. Configure a backup policy when using local disks for business data.
Evaluate storage capacity
After selecting a storage architecture, estimate the required disk capacity based on your data volume and growth trend. Plan for at least six months of data growth.
| Data type | Description | Calculation |
|---|---|---|
| Raw data | Data generated directly by your business, such as logs | Required space = raw data volume |
| Intermediate data | Temporary data generated during processing, such as ETL results | Required space = raw data volume × 1.5 (adjust based on workload complexity) |
| Result data | Final output that must be retained | Required space = raw data volume × 10%–50% (adjust based on retention requirements) |
Compute-storage integration (HDFS): Add replica overhead to the total. HDFS defaults to three replicas for local disks and two replicas for cloud disks.
Compute-storage separation (OSS-HDFS or OSS): Data disks only need to accommodate temporary compute results, local logs, and shuffle data. Business data is stored durably in OSS.