This topic describes how to create an EMR on ACK cluster using the EMR console. EMR on ACK runs big data workloads — Spark, Presto, Flink, and Shuffle Service — on Container Service for Kubernetes (ACK) clusters, letting you separate compute and storage and use Kubernetes for resource scheduling.
Prerequisites
Complete the following steps before creating a cluster. If you have already completed a step, skip it.
Attach the AliyunOSSFullAccess and AliyunDLFFullAccess policies to a RAM role. For more information, see Attach policies to a RAM role.
Create an ACK cluster. For more information, see Create an ACK dedicated cluster or Create an ACK managed cluster.
Create a node pool. For more information, see Create a node pool.
Activate Object Storage Service (OSS). For more information, see Activate OSS.
Choose a cluster type
EMR on ACK supports four cluster types. The cluster type cannot be changed after creation, so select the type that matches your workload before proceeding.
| Cluster type | Best for | Key characteristics |
|---|---|---|
| Shuffle Service | Spark clusters on ACK nodes without local disks | Provides a remote shuffle service using Celeborn. Requires nodes from big data instance families or instance families with local SSDs. Supports dynamic resource allocation. |
| Presto | Interactive queries on large datasets | An in-memory distributed SQL engine that supports various data sources, suitable for complex analysis of petabytes of data and cross-source queries. |
| Spark | ETL, batch processing, and data modeling | A general-purpose distributed big data processing engine. To associate a Spark cluster with a Shuffle Service cluster, both clusters must use the same EMR version — for example, EMR-5.x-ack. |
| Flink | Stateful processing on bounded or unbounded data streams | Developed based on EMR on ACK and Flink Kubernetes Operator 1.0.1. Uses the Flink Enterprise Edition kernel by default, requiring no additional configuration. |
Shuffle Service requirements:
Nodes in the dedicated node pool or the associated ACK cluster must belong to big data instance families or instance families with local SSDs. Otherwise, the remote shuffle service fails to deploy.
Shuffle Service clusters include a built-in cleanup task named
rss-pvc-cleanthat automatically removes unused PersistentVolumeClaim (PVC) resources, preventing storage consumption by stale data.
Create a cluster
Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
On the EMR on ACK page, click Create Cluster.
On the E-MapReduce on ACK page, configure the parameters described in the following table.
ImportantThe region cannot be changed after the cluster is created.
Parameter Description Region The region where the cluster is created. Cluster type The type of the cluster. For details, see Choose a cluster type. Product version The version of EMR. The latest version is selected by default. Keep the default unless you need a specific version. Component version The type and version of the component deployed in the cluster of the specified type. ACK cluster Select an existing ACK cluster or create one in the ACK console. To dedicate nodes to EMR workloads, click Configure Dedicated Nodes to add taints and labels to a node pool. If no node pool is available, create a node pool. OSS bucket Select an existing bucket or create one in the OSS console. Cluster name A name for the cluster. Must be 1–64 characters and can contain only letters, digits, hyphens (-), and underscores (_). Example: my-emr-cluster.Click Create.
Verify the cluster
After you click Create, the cluster enters a provisioning state.
In the left-side navigation pane, click EMR on ACK.
Locate the cluster in the cluster list.
When the cluster status changes to Running, the cluster is ready to use.