×
Community Blog Use Cases for EMR Serverless Spark | Use the spark-submit CLI to Submit a Spark Job

Use Cases for EMR Serverless Spark | Use the spark-submit CLI to Submit a Spark Job

This article describes how to use the spark-submit command line interface (CLI) to submit a Spark job after EMR Serverless Spark is connected to ECS.

This article describes how to use the spark-submit command line interface (CLI) to submit a Spark job after E-MapReduce (EMR) Serverless Spark is connected to Elastic Compute Service (ECS).

Prerequisites

Java Development Kit (JDK) V1.8 or later is installed.

Procedure

Step 1: Download and install the spark-submit tool for EMR Serverless

1.  Click emr-serverless-spark-tool-0.1.0-bin.zip to download the installation package.

2.  Run the following command to decompress and install the spark-submit tool for EMR Serverless:

unzip emr-serverless-spark-tool-0.1.0-bin.zip

Step 2: Configure parameters

1.  Run the following command to modify the configuration in connection.properties:

vim emr-serverless-spark-tool-0.1.0/conf/connection.properties

2.  The following sample code provides an example of the key-value pairs in the file:

accessKeyId=yourAccessKeyId
accessKeySecret=yourAccessKeySecret
# securityToken=yourSecurityToken
regionId=cn-hangzhou
endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
workspaceId=w-xxxxxxxxxxxx
resourceQueueId=dev_queue
# networkServiceId=xxxxxx
releaseVersion=esr-2.1 (Spark 3.3.1, Scala 2.12, Java Runtime)

The following table describes the parameters:

Parameter Required Description
accessKeyId Yes The AccessKey ID of the Alibaba Cloud account or RAM user that is used to run the Spark job.
accessKeySecret Yes The AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job.
securityToken No RAM User's Token.
Note
This field is only required when conducting STS authentication.
regionId Yes The region ID. In this example, the China (Hangzhou) region is used.
endpoint Yes The endpoint of the EMR Serverless Spark workspace. Format: emr-serverless-spark.<yourRegionId>.aliyuncs.com.
In this example, the China (Hangzhou) region is used. T he value of this parameter is emr-serverless-spark.cn-hangzhou.aliyuncs.com.
workspaceId Yes The ID of the EMR Serverless Spark workspace.
resourceQueueId No The queue name. Default value: dev_queue.
networkServiceId No The network connection name.
Note
This parameter is required only if the Spark job needs to access virtual private clouds (VPCs). For more information, see Network connection between EMR Serverless Spark and other VPCs.
releaseVersion No The version of EMR Serverless Spark. Default value: esr-2.1 (Spark 3.3.1, Scala 2.12, Java Runtime).

Step 3: Submit a Spark job

1.  Run the following command to go to the spark-submit tool directory:

cd emr-serverless-spark-tool-0.1.0

2.  Submit the Spark job in the following formats:

• Spark job launched from Java/Scala

In this example, the test JAR package spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package. Then, you can upload the JAR package to Object Storage Service (OSS). The JAR package is a simple example provided by Spark. It is used to calculate the value of pi.

./bin/spark-submit  --name SparkPi \
--queue dev_queue  \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--class org.apache.spark.examples.SparkPi \
 oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \
10000

• Spark job launched from PySpark

In this example, the test files DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files. Then, you can upload a JAR package that consists of the files to OSS.

Note

o The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.

o The employee.csv file contains data such as employee names, departments, and salaries.

./bin/spark-submit --name PySpark \
--queue dev_queue  \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.tags.key=value \
--files oss://<yourBucket>/path/to/employee.csv \
oss://<yourBucket>/path/to/DataFrame.py \
10000

The following table describes the parameters:

• Parameters of compatible open source spark-submit tool:

Parameter Example Description
--class org.apache.spark.examples.SparkPi The entry class of the Spark job. This parameter is required if the Spark job is launched from Java or Scala. You do not need to configure this parameter if the Spark job is launched from Python.
--num-executors 10 The number of executors of the Spark job.
--driver-cores 1 The number of driver cores of the Spark job.
--driver-memory 4g The size of driver memory of the Spark job.
--executor-cores 1 The number of executor cores of the Spark job.
--executor-memory 1024m The size of the executor memory of the Spark job.
--files oss://<yourBucket>/file1,oss://<yourBucket>/file2 The resource files used by the Spark job. Only resource files stored in OSS are supported. Separate multiple files with commas (,).
--py-files oss://<yourBucket>/file1.py,oss://<yourBucket>/file2.py The Python scripts used by the Spark job. Only Python scripts stored in OSS are supported. Separate multiple scripts with commas (,). This parameter is valid only if the Spark job is launched from PySpark.
--jars oss://<yourBucket>/file1.jar,oss://<yourBucket>/file2.jar The JAR packages used by the Spark job. Only JAR packages stored in OSS are supported. Separate multiple packages with commas (,).
--archives oss://<yourBucket>/archive.tar.gz#env,oss://<yourBucket>/archive2.zip The archive packages used by the Spark job. Only archive packages stored in OSS are supported. Separate multiple packages with commas (,).
--queue root_queue The name of the queue in which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace.
--conf spark.tags.key=value The custom parameter of the Spark job.
--status jr-8598aa9f459d**** The Spark job state.
--kill jr-8598aa9f459d**** Terminates the Spark job.

• Parameters of non-open source spark-submit tool:

Parameter Example Description
--detach You do not need to enter a value. Quits the spark-submit tool. If you use this parameter, you will not wait for the tool to return the job status. Instead, you will immediately exit the spark-submit tool after the Spark job is submitted.
--detail jr-8598aa9f459d**** The Spark job details.

• Parameters of incompatible open source spark-submit tool:

o --deploy-mode
o --master
o --proxy-user
o --repositories
o --keytab
o --principal
o --total-executor-cores
o --driver-library-path
o --driver-class-path
o --supervise
o --verbose

Step 4: Query the Spark job

• Use the CLI

Query the Spark job state

cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --status <jr-8598aa9f459d****>

Query the Spark job details

cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --detail <jr-8598aa9f459d****>

• Use the UI

a) In the left-side navigation pane of the EMR Serverless Spark page, click Job Runs.

b) On the Development Job Runs tab of the Job Runs page, you can view all submitted jobs.

1

(Optional) Step 5: Terminate the Spark job

cd emr-serverless-spark-tool-0.1.0
./bin/spark-submit --kill <jr-8598aa9f459d****>

Note
You can terminate a job only if the job is in the RUNNING state.

References

  1. E-MapReduce Official Website: https://www.alibabacloud.com/en/product/emapreduce
  2. Service Console: https://emr-next.console.aliyun.com/
  3. Product Documentation: https://www.alibabacloud.com/help/en/emr/emr-serverless-spark/
0 1 0
Share on

Alibaba EMR

62 posts | 6 followers

You may also like

Comments