This article describes how to use the spark-submit command line interface (CLI) to submit a Spark job after E-MapReduce (EMR) Serverless Spark is connected to Elastic Compute Service (ECS).
Java Development Kit (JDK) V1.8 or later is installed.
1. Click emr-serverless-spark-tool-0.1.0-bin.zip to download the installation package.
2. Run the following command to decompress and install the spark-submit tool for EMR Serverless:
unzip emr-serverless-spark-tool-0.1.0-bin.zip
1. Run the following command to modify the configuration in connection.properties:
vim emr-serverless-spark-tool-0.1.0/conf/connection.properties
2. The following sample code provides an example of the key-value pairs in the file:
accessKeyId=yourAccessKeyId
accessKeySecret=yourAccessKeySecret
# securityToken=yourSecurityToken
regionId=cn-hangzhou
endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
workspaceId=w-xxxxxxxxxxxx
resourceQueueId=dev_queue
# networkServiceId=xxxxxx
releaseVersion=esr-2.1 (Spark 3.3.1, Scala 2.12, Java Runtime)
The following table describes the parameters:
Parameter | Required | Description |
accessKeyId | Yes | The AccessKey ID of the Alibaba Cloud account or RAM user that is used to run the Spark job. |
accessKeySecret | Yes | The AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job. |
securityToken | No | RAM User's Token. Note This field is only required when conducting STS authentication. |
regionId | Yes | The region ID. In this example, the China (Hangzhou) region is used. |
endpoint | Yes | The endpoint of the EMR Serverless Spark workspace. Format: emr-serverless-spark.<yourRegionId>.aliyuncs.com. In this example, the China (Hangzhou) region is used. T he value of this parameter is emr-serverless-spark.cn-hangzhou.aliyuncs.com.
|
workspaceId | Yes | The ID of the EMR Serverless Spark workspace. |
resourceQueueId | No | The queue name. Default value: dev_queue. |
networkServiceId | No | The network connection name. Note This parameter is required only if the Spark job needs to access virtual private clouds (VPCs). For more information, see Network connection between EMR Serverless Spark and other VPCs. |
releaseVersion | No | The version of EMR Serverless Spark. Default value: esr-2.1 (Spark 3.3.1, Scala 2.12, Java Runtime). |
1. Run the following command to go to the spark-submit tool directory:
cd emr-serverless-spark-tool-0.1.0
2. Submit the Spark job in the following formats:
• Spark job launched from Java/Scala
In this example, the test JAR package spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package. Then, you can upload the JAR package to Object Storage Service (OSS). The JAR package is a simple example provided by Spark. It is used to calculate the value of pi.
./bin/spark-submit --name SparkPi \
--queue dev_queue \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--class org.apache.spark.examples.SparkPi \
oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \
10000
• Spark job launched from PySpark
In this example, the test files DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files. Then, you can upload a JAR package that consists of the files to OSS.
Note
o The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.
o The employee.csv file contains data such as employee names, departments, and salaries.
./bin/spark-submit --name PySpark \
--queue dev_queue \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.tags.key=value \
--files oss://<yourBucket>/path/to/employee.csv \
oss://<yourBucket>/path/to/DataFrame.py \
10000
The following table describes the parameters:
• Parameters of compatible open source spark-submit tool:
Parameter | Example | Description |
---|---|---|
--class | org.apache.spark.examples.SparkPi | The entry class of the Spark job. This parameter is required if the Spark job is launched from Java or Scala. You do not need to configure this parameter if the Spark job is launched from Python. |
--num-executors | 10 | The number of executors of the Spark job. |
--driver-cores | 1 | The number of driver cores of the Spark job. |
--driver-memory | 4g | The size of driver memory of the Spark job. |
--executor-cores | 1 | The number of executor cores of the Spark job. |
--executor-memory | 1024m | The size of the executor memory of the Spark job. |
--files | oss://<yourBucket>/file1,oss://<yourBucket>/file2 |
The resource files used by the Spark job. Only resource files stored in OSS are supported. Separate multiple files with commas (,). |
--py-files | oss://<yourBucket>/file1.py,oss://<yourBucket>/file2.py |
The Python scripts used by the Spark job. Only Python scripts stored in OSS are supported. Separate multiple scripts with commas (,). This parameter is valid only if the Spark job is launched from PySpark. |
--jars | oss://<yourBucket>/file1.jar,oss://<yourBucket>/file2.jar |
The JAR packages used by the Spark job. Only JAR packages stored in OSS are supported. Separate multiple packages with commas (,). |
--archives | oss://<yourBucket>/archive.tar.gz#env,oss://<yourBucket>/archive2.zip |
The archive packages used by the Spark job. Only archive packages stored in OSS are supported. Separate multiple packages with commas (,). |
--queue | root_queue | The name of the queue in which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace. |
--conf | spark.tags.key=value | The custom parameter of the Spark job. |
--status | jr-8598aa9f459d**** |
The Spark job state. |
--kill | jr-8598aa9f459d**** |
Terminates the Spark job. |
• Parameters of non-open source spark-submit tool:
Parameter | Example | Description |
---|---|---|
--detach | You do not need to enter a value. | Quits the spark-submit tool. If you use this parameter, you will not wait for the tool to return the job status. Instead, you will immediately exit the spark-submit tool after the Spark job is submitted. |
--detail | jr-8598aa9f459d**** |
The Spark job details. |
• Parameters of incompatible open source spark-submit tool:
o --deploy-mode
o --master
o --proxy-user
o --repositories
o --keytab
o --principal
o --total-executor-cores
o --driver-library-path
o --driver-class-path
o --supervise
o --verbose
• Use the CLI
Query the Spark job state
cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --status <jr-8598aa9f459d****>
Query the Spark job details
cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --detail <jr-8598aa9f459d****>
• Use the UI
a) In the left-side navigation pane of the EMR Serverless Spark page, click Job Runs.
b) On the Development Job Runs tab of the Job Runs page, you can view all submitted jobs.
cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --kill <jr-8598aa9f459d****>
Note
You can terminate a job only if the job is in the RUNNING state.
62 posts | 6 followers
FollowAlibaba Cloud Native - March 5, 2024
Alibaba EMR - November 8, 2024
Apache Flink Community - April 8, 2024
Alibaba Cloud MaxCompute - August 15, 2022
Alibaba Clouder - November 23, 2020
Alibaba Cloud MaxCompute - June 2, 2021
62 posts | 6 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreTransform your business into a customer-centric brand while keeping marketing campaigns cost effective.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba EMR