Create an EMR cluster based on Kubernetes and run jobs in the cluster - E-MapReduce

This topic describes how to use an Alibaba Cloud account to log on to the Alibaba Cloud E-MapReduce (EMR) console, create clusters on the EMR on ACK page, and then run jobs in the console.

Precautions

In this topic, the desired JAR file is packaged into an image. If you are using your own JAR file, you can upload the JAR file to Alibaba Cloud Object Storage Service (OSS). For more information about how to upload a file, see Simple upload.

In this case, you need to replace local:///opt/spark/examples/spark-examples.jar in a command with the actual path in which the JAR file is stored in OSS. The path is specified in the oss://<yourBucketName>/<path>.jar format.

Preparations

Before you can create a cluster on the EMR on ACK page, you must perform the following operations in the Container Service for Kubernetes (ACK) console:

Create an ACK cluster. For more information, see Create an ACK dedicated cluster or Create an ACK managed cluster.
Attach the AliyunOSSFullAccess and AliyunDLFFullAccess policies to the Alibaba Cloud account. For more information, see Attach policies to a RAM role.

Note

If you want to store the JAR package in Alibaba Cloud Object Storage Service (OSS), you must activate OSS first. For more information, see Activate OSS.

Step 1: Assign a role

Before you can use EMR on ACK, your Alibaba Cloud account must be assigned the system default role AliyunEMROnACKDefaultRole. For more information, see Assign a role to an Alibaba Cloud account.

Step 2: Create a cluster

Create a Spark cluster on the EMR on ACK page. For more information, see Create a cluster.

Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
On the EMR on ACK page, click Create Cluster.

On the E-MapReduce on ACK page, configure the parameters. The following table describes the parameters.

Parameter	Example	Description
Region	China (Hangzhou)	The region in which you want to create a cluster. You cannot change the region after the cluster is created.
Cluster Type	Spark	The type of the cluster. Spark is a common distributed big data processing engine that provides various capabilities, such as extract, transform, and load (ETL), batch processing, and data modeling. Important If you want to associate a Spark cluster with a Shuffle Service cluster, the major EMR versions of the clusters must be the same. For example, a Spark cluster whose EMR version is EMR-5.x-ack can be associated with only a Shuffle Service cluster whose EMR version is EMR-5.x-ack.
Product Version	EMR-5.6.0-ack	The version of EMR. By default, the latest version is selected.
Component Version	SPARK (3.2.1)	Displays the type and version of the component that is deployed in the cluster of the specified type.
ACK Cluster	Emr-ack	Select an existing ACK cluster or create an ACK cluster in the ACK console. You can click Configure Dedicated Nodes to configure an EMR-dedicated node. You can configure an EMR-dedicated node or node pool by adding taints and labels to the node or node pool. Note We recommend that you configure dedicated nodes in a node pool. If no node pool is available, create a node pool. For more information about how to create a node pool, see Create a node pool. For more information about node pools, see Node pool overview. Important The same ACK cluster cannot be associated with multiple clusters of the same type that are created on the EMR on ACK page.
OSS Bucket	oss-spark-test	Select an existing bucket or create a bucket in the OSS console.
Cluster Name	Emr-Spark	The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).

Click Create.
If the status of the cluster changes to Running, the cluster is created.

Step 3: Submit a job

After a cluster is created, you can submit jobs. This section describes how to submit a Spark job by using a custom resource definition (CRD). For more information about Spark, see Quick Start. When you view information in Quick Start, select a programming language type and a Spark version.

For information about how to submit different types of jobs, see the following topics:

Connect to an Alibaba Cloud Container Service for Kubernetes (ACK) cluster by using kubectl. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

Create a job file named spark-pi.yaml. The following code shows the content in the file:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-simple
spec:
  type: Scala
  sparkVersion: 3.2.1
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/spark-examples.jar"
  arguments:
    - "1000"
  driver:
    cores: 1
    coreLimit: 1000m
    memory: 4g
  executor:
    cores: 1
    coreLimit: 1000m
    memory: 8g
    memoryOverhead: 1g
    instances: 1

For information about the fields in the code, see spark-on-k8s-operator.

Note

You can specify a custom file name. In this example, spark-pi.yaml is used.
In this example, Spark 3.2.1 for EMR V5.6.0 is used. If you use another version of Spark, configure the sparkVersion parameter based on your business requirements.

Run the following command to submit a job:
```
kubectl apply -f spark-pi.yaml --namespace <Namespace in which the cluster resides>
```
Replace <Namespace in which the cluster resides> with the namespace based on your business requirements. To view the namespace, log on to the EMR console and go to the Cluster Details tab.
The following information is returned:
```
sparkapplication.sparkoperator.k8s.io/spark-pi-simple created
```
Note
spark-pi-simple is the name of the submitted Spark job.
Optional. View the information about the submitted Spark job on the Job Details tab.

Step 4: (Optional) Release the cluster

If you no longer require a cluster, you can release the cluster to reduce costs.

On the EMR on ACK page, find the cluster that you want to release and click Release in the Actions column.
In the Release Cluster message, click OK.

References

For information about how to view clusters in the current Alibaba Cloud account, see View cluster information.
For information about how to view jobs in your cluster, see View jobs.