In Alibaba Cloud E-MapReduce (EMR), the region and storage configurations of an EMR cluster directly affect the cluster performance and cost. An appropriate region helps you reduce network latency, meet data localization requirements, and reduce resource costs. Proper storage configurations, such as use of HDFS, Object Storage Service (OSS), or OSS-HDFS, help you improve data read and write efficiency, reduce storage costs, and ensure data reliability. This topic provides strategies and key factors to help you quickly select a region and plan storage configurations.
Region selection strategy
You can select a region based on the core factors described in the following table to ensure an optimal match between your business and required resources.
Factor | Description |
Data localization (higher priority) |
|
EMR service availability |
|
Price differences of ECS instances | The pricing of Elastic Computing Service (ECS) instances varies based on the region that you select. For more information, see ECS Price Calculator. |
Service topology optimization |
|
Regions that support EMR:
Asia Pacific - China
China (Hangzhou), China (Shanghai), China (Qingdao), China (Beijing), China (Zhangjiakou), China (Hohhot), China (Ulanqab), China (Shenzhen), China (Chengdu), and China (Hong Kong)
Asia Pacific - Others
Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), and Indonesia (Jakarta)
Europe and America
Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia)
Middle East
UAE (Dubai)
Storage planning
Storage architecture selection
EMR supports the compute-storage separation and compute-storage integration architectures. OSS-HDFS and OSS adopt the compute-storage separation architecture, and HDFS adopts the compute-storage integration architecture. You can select a storage architecture based on your data requirements and cost budget. The following table describes the differences between the architectures.
Comparison item | Compute-storage separation (OSS-HDFS or OSS) | Compute-storage integration (HDFS) |
Characteristic |
| Computing and storage resources are integrated, and data is stored in HDFS deployed in an EMR cluster. |
Scenario |
| Low-latency reads and writes are required. |
Data reliability |
|
|
Data durability |
| Data is deleted after an EMR cluster is released. |
Scaling flexibility | Computing and storage resources are separated. This allows you to independently add compute nodes (CNs). | Computing and storage resources are integrated. This way, you must adjust the computing and storage resources at the same time.
|
Storage cost (example) | USD 0.0170 per GB-month (OSS Standard storage) Note
| USD 0.051 per GiB-month Note
|
O&M complexity |
|
|
Access method | You can access OSS or OSS-HDFS by using For more information, see Getting started. |
|
Disk selection
EMR provides system disks and data disks for nodes in an EMR cluster.
Disk type | Description | |
System disk | System disks are used to install operating systems and do not store business data. | |
Data disk | Data disks are used to store data, local logs, and shuffled data. You can evaluate the capacity based on the storage architecture that you select. For more information, see Storage capacity evaluation. Note With the same storage capacity, you can configure multiple data disks to improve service availability. If you configure multiple data disks, specific services can provide the fault tolerance capability, and the overall functionality of data disks is not affected in case of a disk failure. |
Disk types
EMR provides the following types of disks for you to store data.
Cloud disks
Cloud disks are block-level data storage devices provided by Alibaba Cloud for ECS. Cloud disks use a distributed triplicate mechanism to achieve 99.9999999% (nine nines) data reliability for ECS instances.
Cloud disks are classified into standard SSDs, ultra disks, and enhanced SSDs (ESSDs) based on the disk performance.
Disk type | Characteristic | Scenario |
| Latency-sensitive applications or I/O-intensive business scenarios:
| |
Standard SSD |
|
|
Ultra disk |
|
|
For information about the performance of cloud disks and local disks, see Block storage performance.
Local disks
Local disks provide local storage for ECS instances and reside on the physical machines that host the instances. Local disks are suitable for scenarios that require high storage I/O performance and high cost-effectiveness for massive data storage.
Scenarios
When you configure a node group in the EMR console, if you set the Type parameter to Big Data or Local SSD, the data disks are the physically connected local disks that are directly attached to the server and provide extremely low latency and high throughput.
Local disks are only suitable for core and task nodes.
If you use local disks as data disks, data loss may occur. We recommend that you configure backup policies when you use local disks to store big data.
Storage capacity evaluation
After you select the storage architecture, you must evaluate the required storage capacity based on the scale and growth trend of your business data. This helps you ensure that disk configuration meets your business requirements.
Data type | Description | Calculation rule |
Raw data | Initial data directly generated by your business, such as logs | Required storage space = Raw data volume |
Intermediate data | Temporary data generated during processing, such as the result of extract, transform, load (ETL) operations | Required storage space = Raw data volume × 1.5 (adjust based on your business complexity) |
Result data | Final output data that needs to be stored | Required storage space = Raw data volume × A value that ranges from 10% to 50% (adjust based on your business requirements) |
When you evaluate the required storage capacity, you must consider the data growth at least in the subsequent 6 months.
Compute-storage integration (HDFS)
You need to evaluate the data disk capacity based on raw data, intermediate data, result data, and replica redundancy (3 replicas by default).
Compute-storage separation (OSS-HDFS or OSS)
Business data is persistently stored in OSS. Data disks are only used to store temporary computing results, local logs, and shuffled data of tasks.