Submit a Spark job

Alibaba Cloud E-MapReduce (EMR) allows you to submit jobs by using a Custom Resource Definition (CRD), by running the spark-submit command, or in the EMR console. This topic describes how to submit a Spark job by using these methods.

Prerequisites

A Spark cluster is created on the EMR on ACK page. For more information, see Create a cluster.

Precautions

In this topic, the desired JAR file is packaged into an image. If you are using your own JAR file, you can upload the JAR file to Alibaba Cloud Object Storage Service (OSS). For more information about how to upload a file, see Simple upload.

In this case, you need to replace local:///opt/spark/examples/spark-examples.jar in a command with the actual path in which the JAR file is stored in OSS. The path is specified in the oss://<yourBucketName>/<path>.jar format.

Method 1: Submit a Spark job by using a CRD

Connect to an Alibaba Cloud Container Service for Kubernetes (ACK) cluster by using kubectl. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

Create a job file named spark-pi.yaml. The following code shows the content in the file:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-simple
spec:
  type: Scala
  sparkVersion: 3.2.1
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/spark-examples.jar"
  arguments:
    - "1000"
  driver:
    cores: 1
    coreLimit: 1000m
    memory: 4g
  executor:
    cores: 1
    coreLimit: 1000m
    memory: 8g
    memoryOverhead: 1g
    instances: 1

For information about the fields in the code, see spark-on-k8s-operator.

Note

You can specify a custom file name. In this example, spark-pi.yaml is used.
In this example, Spark 3.2.1 for EMR V5.6.0 is used. If you use another version of Spark, configure the sparkVersion parameter based on your business requirements.

Run the following command to submit a job:
```
kubectl apply -f spark-pi.yaml --namespace <Namespace in which the cluster resides>
```
Replace <Namespace in which the cluster resides> with the namespace based on your business requirements. To view the namespace, log on to the EMR console and go to the Cluster Details tab.
The following information is returned:
```
sparkapplication.sparkoperator.k8s.io/spark-pi-simple created
```
Note
spark-pi-simple is the name of the submitted Spark job.
Optional. View the information about the submitted Spark job on the Job Details tab.

Method 2: Submit a Spark job by running the spark-submit command

Connect to an Alibaba Cloud ACK cluster by using kubectl. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Run the following command to install the emr-spark-ack tool that is provided by EMR and grant permissions to the tool:
```
wget https://ecm-repo-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/emr-on-ack/util/emr-spark-ack
chmod 755 emr-spark-ack
```
Submit a Spark job by using the emr-spark-ack tool.
Syntax:
```
 ./emr-spark-ack -n <Namespace in which the cluster resides> <Spark command>
```
Note
You can replace <Spark command> with the spark-submit, spark-sql, spark-shell, or pyspark command based on the mode that Spark requires.
- Cluster mode:
  Run the spark-submit command to submit a job named spark-pi. Syntax:
```
./emr-spark-ack -n <Namespace in which the cluster resides> spark-submit \
    --name spark-pi-submit \
    --deploy-mode cluster \
    --class org.apache.spark.examples.SparkPi \
    local:///opt/spark/examples/spark-examples.jar \
    1000
```
- Client mode:
  - Run the spark-sql command to submit a Spark job. Syntax:
```
# Prepare an SQL file on your on-premises machine.
echo "select 1+1">test.sql
# Submit a Spark job.
./emr-spark-ack -n <Namespace in which the cluster resides> spark-sql -f test.sql
```
    In Spark 3 or later for EMR V5.X, the emr-spark-ack tool supports the automatic upload of dependencies of on-premises files. The dependencies of on-premises files in the command for job submitting, including the on-premises files specified by fields such as --jars, --files, and -f, can be automatically uploaded to the Spark cluster. The dependencies are used to submit a Spark job in an ACK cluster.
    The following figure shows the sample code and returned information.
  - Run the spark-shell command to submit a Spark job:
```
./emr-spark-ack -n <Namespace in which the cluster resides> spark-shell
```
    The following figure shows the sample code and returned information.
Optional. View the information about the submitted job on the Job Details tab.
Optional. Use the emr-spark-ack tool to terminate the Spark job.
Syntax:
```
 ./emr-spark-ack -n <Namespace in which the cluster resides> kill <Spark_app_id>
```
Note
The emr-spark-ack tool generates the ID of the Spark application specified by <Spark_app_id> when you submit a Spark job. You can view the ID in the output log.

Method 3: Submit a Spark job in the EMR console

Go to the Access Links and Ports tab.
1. Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
2. On the EMR on ACK page, find the desired cluster and click the name of the cluster in the Cluster ID/Name column.
3. On the page that appears, click the Access Links and Ports tab.
On the Access Links and Ports tab, click the link that corresponds to SparkSubmitGateway UI in the Access URL column.
Then, you can go to the Shell terminal.

In the Shell terminal, run one of the following commands:

spark-sql command:
```
spark-sql
```
After you run the spark-sql command, you can perform interactive queries.

spark-submit command:

spark-submit \
    --name spark-pi-submit \
    --deploy-mode cluster \
    --class org.apache.spark.examples.SparkPi \
    local:///opt/spark/examples/spark-examples.jar \
    1000

Optional. View the information about the submitted Spark job on the Job Details tab.

References

For information about how to use kubectl to manage Spark jobs, see Use kubectl to manage jobs.
For information about how to use Simple Log Service to collect the logs of Spark jobs, see Use Simple Log Service to collect the logs of Spark jobs.
For information about how to configure the metadata of Spark clusters in EMR, see Configure metadata management for a Spark cluster.
For information about how to use elastic container instances to elastically schedule Spark jobs, see Use Elastic Container Instance to elastically schedule Spark jobs.