This topic describes how to use an Alibaba Cloud account to log on to the Alibaba Cloud E-MapReduce (EMR) console, create clusters on the EMR on ACK page, and then run jobs in the console.
In this topic, the desired JAR file is packaged into an image. If you are using your own JAR file, you can upload the JAR file to Alibaba Cloud Object Storage Service (OSS). For more information about how to upload a file, see Simple upload.
In this case, you need to replace
local:///opt/spark/examples/spark-examples.jar in a command with the actual path in which the JAR file is stored in OSS. The path is specified in the
Before you can create a cluster on the EMR on ACK page, you must perform the following operations in the Container Service for Kubernetes (ACK) console:
Attach the AliyunOSSFullAccess and AliyunDLFFullAccess policies to the Alibaba Cloud account. For more information, see Attach policies to a RAM role.
If you want to store the JAR package in Alibaba Cloud Object Storage Service (OSS), you must activate OSS first. For more information, see Activate OSS.
Step 1: Assign a role
Before you can use EMR on ACK, your Alibaba Cloud account must be assigned the system default role AliyunEMROnACKDefaultRole. For more information, see Assign a role to an Alibaba Cloud account.
Step 2: Create a cluster
Create a Spark cluster on the EMR on ACK page. For more information, see Create a cluster.
Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
On the EMR on ACK page, click Create Cluster.
On the E-MapReduce on ACK page, configure the parameters. The following table describes the parameters.
The region in which you want to create a cluster. You cannot change the region after the cluster is created.
The type of the cluster. Spark is a common distributed big data processing engine that provides various capabilities, such as extract, transform, and load (ETL), batch processing, and data modeling.Important
If you want to associate a Spark cluster with a Shuffle Service cluster, the major EMR versions of the clusters must be the same. For example, a Spark cluster whose EMR version is EMR-5.x-ack can be associated with only a Shuffle Service cluster whose EMR version is EMR-5.x-ack.
The version of EMR. By default, the latest version is selected.
Displays the type and version of the component that is deployed in the cluster of the specified type.
Select an existing ACK cluster or create an ACK cluster in the ACK console.
You can click Configure Dedicated Nodes to configure an EMR-dedicated node. You can configure an EMR-dedicated node or node pool by adding taints and labels to the node or node pool. This way, the node or node pool can be used only for EMR.
Select an existing bucket or create a bucket in the OSS console.
The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
If the status of the cluster changes to Running, the cluster is created.
Step 3: Submit a job
After a cluster is created, you can submit jobs. This section describes how to submit a Spark job by using a custom resource definition (CRD). For more information about Spark, see Quick Start. When you view information in Quick Start, select a programming language type and a Spark version.
For information about how to submit different types of jobs, see the following topics:
Connect to an Alibaba Cloud Container Service for Kubernetes (ACK) cluster by using kubectl. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Create a job file named spark-pi.yaml. The following code shows the content in the file:
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi-simple spec: type: Scala sparkVersion: 3.2.1 mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/spark-examples.jar" arguments: - "1000" driver: cores: 1 coreLimit: 1000m memory: 4g executor: cores: 1 coreLimit: 1000m memory: 8g memoryOverhead: 1g instances: 1
For information about the fields in the code, see spark-on-k8s-operator.Note
You can specify a custom file name. In this example, spark-pi.yaml is used.
In this example, Spark 3.2.1 for EMR V5.6.0 is used. If you use another version of Spark, configure the sparkVersion parameter based on your business requirements.
Run the following command to submit a job:
kubectl apply -f spark-pi.yaml --namespace <Namespace in which the cluster resides>
<Namespace in which the cluster resides>with the namespace based on your business requirements. To view the namespace, log on to the EMR console and go to the Cluster Details tab.
The following information is returned:
spark-pi-simpleis the name of the submitted Spark job.
Optional. View the information about the submitted Spark job on the Job Details tab.
Step 4: (Optional) Release the cluster
If you no longer require a cluster, you can release the cluster to reduce costs.
On the EMR on ACK page, find the cluster that you want to release and click Release in the Actions column.
In the Release Cluster message, click OK.