Appropriate hardware configuration and network environment design are key factors to ensure cluster performance, cost-effectiveness, and reliability when creating an Alibaba Cloud EMR cluster. This topic describes how to select high availability services, node specifications, and network configuration solutions based on big data processing requirements.
High availability service selection
You can choose whether to enable the high availability feature based on business scenarios and actual requirements. When high availability service is enabled, the cluster uses a multi-master node mode to eliminate failure risks of single node and ensure service continuity through distributed and failover mechanisms.
Dimension | Single-master node cluster | Multi-master node cluster |
Scenarios |
|
|
Core features | Single node architecture, simple deployment. Failure risks of single node. |
|
Failback | No automatic recovery: Requires manual intervention for troubleshooting and restart. | Automatic failback: EMR service automatically replaces failed master node. It configures same environment and bootstrap actions as the original node. |
Cost | Low cost: Only 1 master node needs configuration. | Higher cost: 3 master nodes require configuration. They implement majority decision-making mechanisms through consensus algorithms in distributed systems, meet the strong consistency requirements of open-source components (such as ZooKeeper and HDFS) and tolerate single node failures, and avoiding split brain. |
Node specification selection
The cluster configuration process is as follows:
Determine the business scenario: Complete the selection (for example: data lake, data analysis, real-time data stream, data service, or custom cluster scenario) based on scenarios.
Select storage architecture: Determine whether to choose coupled storage and compute (HDFS) or decoupled storage and compute (OSS-HDFS/OSS) architecture based on scenarios.
Configure node specifications and disk size:
Configure node specifications: Select appropriate ECS instance types (such as general-purpose, compute-optimized, memory-optimized, big data, etc.) for different node types (such as Master, Core, Task) based on the selected storage architecture, cluster scale, business characteristics, and other factors.
Configure disk size: Calculate storage capacity and configure the appropriate disk size based on data volume and growth expectations.
Data lake scenario
Coupled storage and compute (HDFS)
Node type | Recommended specification |
Master Manage the cluster and coordinating tasks. Services deployed: NameNode, ResourceManager, HiveServer, HiveMetastore, SparkHistoryServer. |
|
Core Provides computing power and storage resources. Services deployed: DataNode, NodeManager. | Core node instance specifications are based on resources requirements.
|
Task Only provides computing power, does not store data. Mainly used to supply CPU and memory requirements of Core nodes. Services deployed: NodeManager. | Recommendations for peak-valley scenarios:
|
Decoupled storage and compute (OSS-HDFS/OSS)
Node type | Recommended specification |
Master Manage the cluster and coordinating tasks. Services deployed: ResourceManager, HiveServer, HiveMetastore, SparkHistoryServer. |
|
Core Similar function to Task nodes, does not store data. Services deployed: NodeManager. | Core nodes do not support elastic scaling capability. We recommend you to use only Task nodes, not configure Core nodes. |
Task Provides computing power. Services deployed: NodeManager. |
|
Data analysis scenario
Coupled storage and compute
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. Services deployed: StarRocks FE, Doris FE, Zookeeper. |
|
Core Provides computing power and storage resources. Services deployed: StarRocks BE, Doris BE, ClickhouseKeeper, ClickhouseServer. | Core node instance specifications are related to business computing requirements and data storage volume.
|
Task Provides computing power. Services deployed: StarRocks CN. | Only StarRocks Compute Node supports deployment on Task nodes. If you are not using StarRocks components, you do not need to use Task nodes. Recommendations for peak-valley scenarios:
|
Decoupled storage and compute
Only StarRocks 3.x version supports decoupled storage and compute.
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. Services deployed: StarRocks FE, Zookeeper. |
|
Task Provides computing power. Services deployed: StarRocks CN. | In StarRocks decoupled storage and compute architecture, there are no Core nodes, only Task nodes.
Instance specifications should be evaluated based on actual business computing requirements, generally selecting ≥16 cores 64 GiB. The number of nodes can be elastically scaled according to business requirements. |
Real-time data stream scenario
Coupled storage and compute (HDFS)
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. Services deployed: NameNode, ResourceManager, FlinkHistoryServer, Zookeeper. |
|
Core Provides computing power and storage resources. Serviced deployed: DataNode, NodeManager. | Core node instance specifications are related to business type and resource requirements.
|
Task Only provides computing power, does not store data, mainly used to supplement CPU and memory requirements of Core nodes. Services deployed: NodeManager. | Recommendations for peak-valley:
|
Decoupled storage and compute (OSS-HDFS/OSS)
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. Services deployed: ResourceManager, FlinkHistoryServer, Zookeeper. |
|
Core Similar function to Task nodes, does not store data. Services deployed: NodeManager. | Core nodes do not support elastic scaling capability. It is recommended to use only Task nodes and not configure Core nodes. |
Task Provides computing power. Services deployed: NodeManager. |
|
Data service scenario
Coupled storage and compute (HDFS)
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. Services deployed: NameNode, HMaster, Zookeeper. |
|
Core Provides computing power and storage resources. Services deployed: DataNode, HRegionServer. | Core node instance specifications are related to business request volume and storage volume.
|
Task Only provides computing power, does not store data, mainly used to supplement CPU and memory requirements of Core nodes. Services deployed: HRegionServer. | In data service, since data is stored on Core nodes, Task nodes are typically not recommended to ensure data locality. |
Decoupled storage and compute (OSS-HDFS/OSS)
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. Services deployed: NameNode, HMaster, Zookeeper. |
|
Core Provides computing power and storage resources. Services deployed: DataNode, HRegionServer. | Using OSS-HDFS/OSS to store HBase HLog has a significant impact on write performance. It is recommended to save HBase HLog on HDFS. Core node instance specifications are related to business request volume. General-purpose instances are recommended, with disk space ≥ 500 GiB.
|
Task Provides computing power. Services deployed: HRegionServer. | Recommendations for peak-valley scenarios:
|
Custom cluster scenario
When business involves multiple mixed scenarios such as offline ETL, real-time ETL, complex aggregation analysis, and high-concurrency query services:
Recommended approach: Multiple cluster type combination solution. By independently deploying clusters with different characteristics (such as offline batch processing clusters, real-time stream processing clusters, analytical clusters, and query acceleration clusters), you can achieve resource isolation and scenario adaptation. This ensures the performance and stability of various tasks.
If your business scale is small and there are no resource conflicts between scenarios, choose a custom cluster: Reduce deployment complexity and improve resource utilization through flexible configuration.
Coupled storage and compute (HDFS)
Node type | Recommended specifications |
Master Responsible for managing the cluster and coordinating tasks. |
|
Core Provides computing power and storage resources. | Core node instance specifications are related to business type and resource requirements.
|
Task Only provides computing power, does not store data, mainly used to supplement CPU and memory requirements of Core nodes. | Recommendations for peak-valley scenarios:
|
Decoupled storage and compute (OSS-HDFS/OSS)
Node type | Recommended specifications |
Master Manage the cluster and coordinating tasks. | Small clusters (≤ 8 instances): General-purpose instances 8 cores and 32 GiB, select cloud disks. |
Core Similar function to Task nodes, does not store data. |
|
Task Provides computing power. | When only to configure Task nodes:
When both Core nodes and Task nodes are configured, peak-valley scenarios need to be considered:
|
Network configuration recommendations
Key dimension | Configuration recommendations |
VPC network configuration |
|
Security group configuration |
|
Network connectivity configuration |
|
Appendix: ECS instance types
Please refer to Instance family to view the characteristics, specifications, and applicable scenarios of available ECS instance families. It provides reference for configuring node instance specifications in the EMR console.
Instance type | Features |
General-purpose | vCPU:Memory=1:4. Abbreviated as g series. |
Compute-optimized | vCPU:Memory=1:2, provide more computing resources. Abbreviated as c series. |
Memory-optimized | vCPU:Memory=1:8, provide more memory resources. Abbreviated as r series. |
Local SSD | vCPU:Memory=1:4, uses local SSD disks, has high random IOPS and high throughput capabilities, but there is a risk of data loss. This instance type is not available for master nodes. Abbreviated as i series. |
Big data | vCPU:Memory=1:4, uses local SATA disks, has high storage cost-effectiveness, the recommended instance type for big data volume (TB-level data volume) scenarios. Abbreviated as d series. |
Sharing | Instance type with shared CPU, not stable enough for large computational loads, only suitable for entry-level learning. Not recommended for enterprise customers. This instance type is only available for task nodes. |