Create a cluster

Alibaba Cloud E-MapReduce (EMR) helps you build and run open source big data frameworks—Hadoop, Spark, Hive, and Presto—for large-scale data processing and analysis. This topic walks you through creating an EMR on ECS cluster and explains each configuration option.

This topic covers cluster creation via the EMR console. To create a cluster using the API, see CreateCluster.

If you created an EMR cluster for the first time after 17:00 (UTC+8) on December 19, 2022, you cannot select the Hadoop, Data Science, Presto, or ZooKeeper cluster types.

Prerequisites

Before you begin, ensure that you have:

Completed RAM authorization. For more information, see Alibaba Cloud account role authorization.

Usage notes

After a cluster is created, you cannot change its configurations except the cluster name. Confirm all settings carefully before clicking Confirm.
Services selected during cluster creation cannot be uninstalled after the cluster is created.
For DataLake, DataFlow, DataServing, and Custom clusters running EMR 5.12.1 and later or EMR 3.46.1 and later: if the selected services do not depend on core nodes, click Remove Node Group in the Node Group section to remove the core node group.

Log on to the E-MapReduce console.
In the top navigation bar, select a region and a resource group.
- Region: The cluster is created in the selected region and cannot be moved after creation.
- Resource Group: Defaults to all resources in your account.
Click Create Cluster.
Complete the Software configuration, Hardware configuration, and Basic configuration steps described below.
Review your settings and click Confirm.
- Pay-as-you-go clusters: Creation starts immediately. The cluster status changes to Running when ready.
- Subscription clusters: An order is generated. The cluster is created after payment.

Configuration details

Software configuration

Configuration	Description
Region	The geographic location of the data center. Select a region close to your users to reduce latency. The region cannot be changed after the cluster is created.
Business scenario	Select a scenario based on your use case. See Choose a business scenario below for details.
Product version	The EMR release version. For more information, see Release versions.
High service availability	Disabled by default. When enabled, EMR creates multiple master nodes to support high availability for ResourceManager and NameNode, distributing them across different hardware to reduce the risk of failure.
Optional services (select one at least)	Select services based on your workload. The more services you select, the higher the instance type requirements. Select your instance type accordingly in the hardware configuration step.<br><br> Important Services cannot be uninstalled after the cluster is created.
Collect service operational logs	Enabled by default. Collects service logs used only for cluster diagnostics. To change this after cluster creation, update the Collection Status of Service Operational Logs on the Basic Information page.<br><br> Important Disabling log collection limits EMR health checks and technical support, though other features remain available. See How do I stop collecting service logs?.
Metadata	Determines how metadata is stored. See Choose a metadata option below.
Root storage directory of cluster	Required when you select the OSS-HDFS service. Not required if you select HDFS.<br><br>Select a bucket with OSS-HDFS enabled in the same region, or click Create OSS-HDFS Bucket to create one.<br><br> Important Buckets created via Create OSS-HDFS Bucket in the EMR console can only be read from and written to through EMR—console and API operations are not supported.<br><br>For first-time use of OSS-HDFS: the Alibaba Cloud account must complete authorization. For RAM users, the Alibaba Cloud account must grant the AliyunEMRDlsFullAccess permission and the AliyunOSSDlsDefaultRole and AliyunEMRDlsDefaultRole roles. For more information, see Grant permissions to a RAM user.<br><br> Note OSS-HDFS is supported for DataLake, DataFlow, DataServing, and Custom clusters running EMR 5.12.1 and later or EMR 3.46.1 and later. Before using OSS-HDFS, confirm it is available in your selected region. See Activate and authorize access to the OSS-HDFS service for supported regions.

Choose a business scenario

Select the scenario that matches your workload:

Data Lake: Runs big data compute engines on a flexible, reliable cluster. Supports building a data lake architecture with JindoFS for acceleration, and supports Hadoop Distributed File System (HDFS) on Object Storage Service (OSS) as a fully managed storage layer—reducing O&M costs and billed based on usage. For more information, see DataLake cluster.
Data Analytics (OLAP): Loads data into an online analytical processing (OLAP) engine—ClickHouse or StarRocks—for real-time analytics. Suited for user personas, audience segmentation, BI reports, and business analytics.
Real-time Data Streaming (DataFlow): A one-stop real-time computing solution built on Kafka (distributed, high-throughput messaging) and a commercial Flink kernel from Ververica (based on Apache Flink). Handles real-time extract, transform, and load (ETL), log collection, and streaming analytics end to end. You can also use Kafka or Flink independently.
Data Service (DataServing): Provides a semi-managed HBase cluster with compute-storage decoupling via OSS-HDFS (JindoFS). Supports JindoData local cache to improve read/write performance. For more information, see DataServing cluster.
Custom Cluster: Lets you combine services freely. In production environments, avoid deploying multiple storage services on the same node group.

Choose a metadata option

Option	Description	Recommended for
DLF Unified Metadata (Recommended)	Stores metadata in Data Lake Formation (DLF). The default DLF Catalog uses your UID. To use different catalogs for different clusters, click Create Catalog, enter a catalog ID, click OK, then select it from the DLF Catalog drop-down list.	Test and production environments
Self-managed RDS	Uses your own or an Alibaba Cloud RDS instance as the metastore. Configure the RDS parameters after selecting this option. See Configure a self-managed RDS database.	Production environments
Built-in MySQL (Not recommended)	Stores metadata in MySQL within the cluster's local environment.	Not recommended for any environment

Service and version-specific settings

The following parameters appear only for specific EMR versions and services.

For EMR 5.12.0 and earlier or EMR 3.46.0 and earlier — Hive service:

Parameter	Description
Hive storage mode	Sets the storage directory for the data warehouse: OSS-HDFS, OSS, or cluster HDFS (default when the check box is cleared). If selected, also configure Hive Data Warehouse Path. Select a bucket with HDFS enabled. Confirm that you have access to the selected OSS or OSS-HDFS bucket.

For EMR 5.12.0 and earlier or EMR 3.46.0 and earlier — HBase service:

Parameter	Description
HBase storage mode	Sets the storage for HBase data files: OSS-HDFS or OSS. If you select OSS-HDFS, also configure HBase Storage Path and select a bucket with HDFS enabled.

For EMR 5.12.1 and later or EMR 3.46.1 and later — OSS-HDFS + HBase services:

Parameter	Description
HBase log storage	Selected by default. Stores HBase HLog files in HDFS. After the cluster is created, an HBase-HDFS service is generated. For more information, see HBASE-HDFS.

Advanced settings (optional)

Configuration	Description
Kerberos authentication	Disabled by default. Kerberos is a symmetric-key identity authentication protocol that secures access to cluster services. For more information, see Kerberos.<br><br> Important Knox does not support Kerberos authentication. Kudu requires extra configuration even when Kerberos is enabled—see Authentication in the Apache Kudu documentation.
Custom software configuration	Disabled by default. Specify a JSON file to configure cluster software (Hadoop, Spark, Hive, and others). For more information, see Configure custom software.<br><br> Note For guidance on setting Hive job concurrency, see How do I estimate the upper limit for the concurrency of Hive jobs?.

Legacy cluster types

Note

If you created an EMR cluster for the first time after 17:00 (UTC+8) on December 19, 2022, the following cluster types are unavailable.

Machine Learning (Data Science): Designed for big data and AI workloads. Includes a distributed deep learning framework, 200+ classic machine learning algorithm packages, AutoML capabilities, and 10+ deep learning algorithms covering recommendation and advertising scenarios.
Old Data Lake: Builds large-scale data processing pipelines. Supports the following cluster types:
- Hadoop: A comprehensive set of open source components, fully compatible with the Hadoop ecosystem. Supports offline processing, real-time processing, and interactive search, along with data lake architecture and JindoFS acceleration.
- ZooKeeper: An independent distributed coordination service for large-scale Hadoop, HBase, and Kafka clusters.
- Presto: A memory-based distributed SQL engine for interactive queries across petabyte-scale data and multiple data sources.

Hardware configuration

Configuration	Description
Billing method	Defaults to Subscription. Choose based on your use case:<br><br>- Pay-as-you-go: Billed hourly based on actual usage. Best for short-term testing or flexible workloads.<br>- Subscription: Pre-paid. For subscription instances, select a Subscription Duration and whether to enable Auto-renewal. Auto-renewal defaults to enabled with a six-month renewal period and runs seven days before expiry. For more information, see Renewal policy.<br><br>Tip: Use pay-as-you-go for testing. After validating your setup, create a subscription cluster for production.
Zone	A distinct physical area within a region. Zones in the same region communicate over the internal network. The default zone works for most cases.
VPC	An isolated virtual private cloud (VPC) network you control. Select an existing VPC, or click Create VPC to create one in the VPC console. For more information, see Create and manage a VPC.<br><br> Note The cluster's private IP is bound to the VPC and cannot be changed after cluster creation.
vSwitch	A basic network module that connects cloud resources within a VPC. Select an existing vSwitch, or click Create vSwitch to create one in the VPC console. For more information, see Create and manage a vSwitch.
Default security group	A virtual firewall controlling inbound and outbound traffic. Select an existing security group, or click create a new security group to create one in the ECS console. For more information, see Security group overview and Create a security group.<br><br> Important Do not use advanced security groups created on ECS.
Node group	Configure node groups based on your workload. See Node group reference below for details.
Cluster scaling	Select a scaling rule:<br><br>- Do Not Use Auto Scaling Rules (default)<br>- Custom Auto Scaling Rule: Scales based on time or load metrics. See Create a custom auto scaling rule.<br>- Managed Auto Scaling Rule: Pre-allocates task nodes when the cluster starts. See Create a managed auto scaling rule.<br><br> Note Scaling rules apply only when the task node billing method is pay-as-you-go or spot instance. Clusters containing Trino, Presto, StarRocks, Impala, or ClickHouse cannot use managed scaling rules.

Node group reference

Node group	Role	Default instances	Notes
Master	Runs control processes (ResourceManager, NameNode)	1; multiple when high availability is enabled	Added to a deployment set by default when high availability is enabled
Core	Stores all cluster data	2; adjustable	Can scale out after cluster creation
Task	Provides additional compute capacity (no data storage)	Disabled by default	Supported billing methods: pay-as-you-go, spot instance, subscription

For each node group, configure:

System disk: Standard SSD, Enhanced SSD (ESSD), or ultra disk. For ESSD, set the performance level (PL). System disks support PL0, PL1 (default), and PL2.
Data disk: Standard SSD, ESSD, or ultra disk. For ESSD, set the PL. Data disks support PL0, PL1 (default), PL2, and PL3. For more information, see Cloud disk overview.
Additional security group: Attach up to two additional security groups per node group to control access to external resources.
Assign public network IP: Disabled by default. Assigns an elastic IP address (EIP) to the cluster. Only DataLake clusters support public IPs at the node group level. To access the cluster via a public IP after creation without enabling this option, apply for a public IP on ECS. See Apply for an EIP.

For guidance on selecting instance types, see ECS instance types and Instance families.

Basic configuration

Configuration	Description
Cluster name	1–64 characters. Accepts Chinese characters, letters, digits, hyphens (-), and underscores (_). The cluster name is the only configuration you can change after creation.
Identity credentials	Controls SSH access to the master node. See Log on to a cluster for logon steps.<br><br>- Key Pair (default): Select an existing key pair, or click Create Key Pair to create one. Consists of a public key and a private key; supported on Linux instances only. See SSH key pairs.<br>- Password: Sets the root account password for the master node. Length: 8–30 characters, and must include uppercase letters, lowercase letters, digits, and special characters (! @ # $ % ^ & *).

Configuration

Description

Cluster name

1–64 characters. Accepts Chinese characters, letters, digits, hyphens (-), and underscores (_). The cluster name is the only configuration you can change after creation.

Identity credentials

Controls SSH access to the master node. See Log on to a cluster for logon steps.<br><br>- Key Pair (default): Select an existing key pair, or click Create Key Pair to create one. Consists of a public key and a private key; supported on Linux instances only. See SSH key pairs.<br>- Password: Sets the root account password for the master node. Length: 8–30 characters, and must include uppercase letters, lowercase letters, digits, and special characters (! @ # $ % ^ & *).

Advanced settings (optional)

Configuration	Description
ECS application role	Allows programs running on EMR compute nodes to access other Alibaba Cloud services (such as OSS) without an AccessKey. EMR automatically requests a temporary AccessKey. This role controls the permissions for that AccessKey.
Bootstrap actions	Scripts executed before the cluster starts. Use them to install third-party software or modify the runtime environment. See Execute a script using a bootstrap action.
Release protection	Prevents accidental cluster deletion. When enabled, you must disable release protection before releasing the cluster. Available for pay-as-you-go clusters at creation or at any time after creation. See Enable and disable release protection.
Tags	Attach tags to identify and manage cluster resources. Tags can also be added after cluster creation. See Set tags.
Resource group	Assign the cluster to a resource group organized by purpose, permission, or ownership. See Use resource groups.
Data disk encryption	Encrypts data in transit and at rest on the data disk. Can only be enabled at cluster creation. See Enable data disk encryption.
System disk encryption	Encrypts the operating system, program files, and system data on the system disk. Can only be enabled at cluster creation. See Enable system disk encryption.
Remarks	Notes about the cluster. Edit this field on the Basic Information page after creation if not set at creation time.

Save as cluster template (optional)

If you selected Key Pair for identity authentication, save the current configuration as a reusable template.

Click Save as Cluster Template.

In the dialog box, fill in the following:

Parameter	Description
Cluster template name	1–64 characters. Accepts Chinese characters, letters, digits, hyphens (-), and underscores (_).
Cluster template resource group	Select an existing resource group. To create a new one, click Create Resource Group. See Create a resource group.

Click OK.

The template is added to the Manage Cluster Templates panel. For more information, see Create a cluster template.

FAQ

The "EntityNotExist.Role" error appears when creating a cluster

This error means the account lacks the permissions required to create an EMR cluster.

Alibaba Cloud account: Click Cloud Resource Access Authorization to grant the required roles (AliyunEmrEcsDefaultRole, AliyunEMRDefaultRole, AliyunECSInstanceForEMRRole, AliyunECSInstanceForEMRStudioRole). For more information, see Authorize an Alibaba Cloud account.

RAM user: Use the Alibaba Cloud account to grant the AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to a RAM user.

What's next

Add a service — Add services to a running cluster.
Log on to a cluster — Access the master node via SSH.
FAQ about cluster management — Common cluster creation and management questions.
FAQ — Frequently asked questions about individual components.