Get started with EMR on ACK - E-MapReduce

Run your first Spark job on E-MapReduce (EMR) on Container Service for Kubernetes (ACK). This guide walks you through assigning the required role, creating a Spark cluster, and submitting a Spark job using a custom resource definition (CRD).

This guide uses a JAR file pre-packaged in the EMR image. To use your own JAR file, upload it to Object Storage Service (OSS) and replace local:///opt/spark/examples/spark-examples.jar in the job manifest with your OSS path: oss://<yourBucketName>/<path>.jar. For upload instructions, see Simple upload.

Prerequisites

Before you begin, ensure that you have:

An ACK cluster (dedicated or managed). See Create an ACK dedicated cluster or Create an ACK managed cluster
The AliyunOSSFullAccess and AliyunDLFFullAccess policies attached to your Alibaba Cloud account. AliyunOSSFullAccess allows EMR to read and write job artifacts in OSS; AliyunDLFFullAccess allows EMR to interact with Data Lake Formation metadata. See Attach policies to a RAM role
kubectl configured to connect to your ACK cluster. Note your cluster namespace — you need it when submitting the job
(Optional) OSS activated, if you plan to store JAR files in OSS. See Activate OSS

Before you start, collect the following values:

Value	Where to find it
ACK cluster name	ACK console, cluster list
Cluster namespace	EMR console > Cluster Details tab, after cluster creation
OSS bucket name	OSS console, bucket list

Step 1: Assign a role

Assign the system default role AliyunEMROnACKDefaultRole to your Alibaba Cloud account. This role grants EMR on ACK the permissions it needs to manage compute resources in your ACK cluster. For instructions, see Assign a role to an Alibaba Cloud account.

Step 2: Create a cluster

Create a Spark cluster on the EMR on ACK page. For full parameter reference, see Create a cluster.

Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
Click Create Cluster.

On the E-MapReduce on ACK page, configure the following parameters.

To associate a Spark cluster with a Shuffle Service cluster, both must share the same major EMR version (for example, EMR-5.x-ack with EMR-5.x-ack).

To configure dedicated nodes for EMR workloads, click Configure Dedicated Nodes. Configure taints and labels on a node pool rather than individual nodes. If no node pool exists, create one first. See Create a node pool and Node pool overview.

Parameter	Example	Description
Region	China (Hangzhou)	The region where the cluster is created. Cannot be changed after creation.
Cluster Type	Spark	The compute framework. Spark supports extract, transform, and load (ETL), batch processing, and data modeling.
Product Version	EMR-5.6.0-ack	The EMR version. Defaults to the latest version.
Component Version	SPARK (3.2.1)	The component type and version deployed in the cluster.
ACK Cluster	Emr-ack	Select an existing ACK cluster. The same ACK cluster cannot be associated with multiple clusters of the same type.
OSS Bucket	oss-spark-test	Select an existing bucket or create one in the OSS console.
Cluster Name	Emr-Spark	1–64 characters. Allowed: letters, digits, hyphens (-), and underscores (_).

Click Create. The cluster is ready when its status changes to Running.

Step 3: Submit a job

Submit a Spark job to the cluster using a SparkApplication CRD manifest. For other job types, see Submit a Spark job, Use the CLI to submit a Presto job, and Submit a Flink job.

Connect to your ACK cluster using kubectl. See Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

Create a file named spark-pi.yaml with the following content.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-simple
spec:
  type: Scala
  sparkVersion: 3.2.1
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/spark-examples.jar"
  arguments:
    - "1000"
  driver:
    cores: 1
    coreLimit: 1000m
    memory: 4g
  executor:
    cores: 1
    coreLimit: 1000m
    memory: 8g
    memoryOverhead: 1g
    instances: 1

This example uses Spark 3.2.1 for EMR V5.6.0. Adjust sparkVersion if you use a different version. For a full field reference, see spark-on-k8s-operator API docs.

Submit the job.
```
kubectl apply -f spark-pi.yaml --namespace <namespace>
```
Replace <namespace> with the namespace of your EMR cluster. To find it, log on to the EMR console and go to the Cluster Details tab. A successful submission returns:
```
sparkapplication.sparkoperator.k8s.io/spark-pi-simple created
```
spark-pi-simple is the name of the submitted Spark job.
Optional. View the information about the submitted Spark job on the Job Details tab.

Step 4: (Optional) Release the cluster

Release the cluster when you no longer need it to avoid unnecessary charges.

On the EMR on ACK page, find the cluster and click Release in the Actions column.
In the Release Cluster dialog, click OK.

What's next

View cluster information — check all clusters in your account.
View jobs — monitor and manage jobs running in your cluster.
Quick start — learn more about Spark self-contained applications (select your language and Spark version).