This topic describes how to develop Spark jobs using the EMR Serverless spark-submit command-line interface (CLI). The examples use an ECS instance that is connected to EMR Serverless Spark.
Prerequisites
Java 1.8 or later is installed.
If you use a RAM user to submit a Spark job, you must add the RAM user to the Serverless Spark workspace and grant the user the Developer role or a higher-level role. For more information, see Manage users and roles.
Procedure
Step 1: Download and install the EMR Serverless spark-submit tool
Click emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip to download the installation package.
Upload the installation package to the ECS instance. For more information, see Upload or download files.
Run the following command to decompress and install the EMR Serverless spark-submit tool.
unzip emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip
Step 2: Configure parameters
If the SPARK_CONF_DIR environment variable is set in your Spark environment, you must place the configuration file in the directory specified by SPARK_CONF_DIR. Otherwise, an error occurs. For example, in an EMR cluster, this directory is usually /etc/taihao-apps/spark-conf.
Run the following command to modify the configuration in
connection.properties.vim emr-serverless-spark-tool-0.11.3-SNAPSHOT/conf/connection.propertiesConfigure the parameters in the file. The parameter format is
key=value. The following code shows an example.accessKeyId=<ALIBABA_CLOUD_ACCESS_KEY_ID> accessKeySecret=<ALIBABA_CLOUD_ACCESS_KEY_SECRET> regionId=cn-hangzhou endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com workspaceId=w-xxxxxxxxxxxxImportantThe RAM user or role that corresponds to this AccessKey must be granted RAM authorization and added to the Serverless Spark workspace.
For more information about RAM authorization, see Grant permissions to a RAM user.
For more information about user and role management in a Serverless Spark workspace, see Manage users and roles.
The following table describes the parameters.
Parameter
Required
Description
accessKeyId
Yes
The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that you use to run the Spark job.
ImportantWhen you configure the
accessKeyIdandaccessKeySecretparameters, make sure that the user corresponding to the AccessKey has read and write permissions on the Object Storage Service (OSS) bucket that is attached to the workspace. To view the OSS bucket attached to the workspace, go to the Spark page and click Details in the Actions column of the workspace.accessKeySecret
Yes
regionId
Yes
The region ID. This topic uses the China (Hangzhou) region as an example.
endpoint
Yes
The Endpoint of EMR Serverless Spark. For more information, see Service endpoints.
This topic uses the public endpoint of the China (Hangzhou) region as an example. The parameter is set to
emr-serverless-spark.cn-hangzhou.aliyuncs.com.NoteIf the ECS instance cannot access the public network, use a VPC endpoint.
workspaceId
Yes
The ID of the EMR Serverless Spark workspace.
Step 3: Submit a Spark job
Run the following command to navigate to the directory of the EMR Serverless spark-submit tool.
cd emr-serverless-spark-tool-0.11.3-SNAPSHOTSelect a submission method based on the job type.
When you submit a job, you must specify the dependent resources, such as JAR packages or Python scripts. These files can be stored in OSS or on a local disk. The storage location depends on your use case and requirements. The examples in this topic use resources that are stored in OSS.
Using spark-submit
spark-submitis a general-purpose job submission tool provided by Spark. It is suitable for Java, Scala, and PySpark jobs.Java/Scala jobs
This example uses spark-examples_2.12-3.3.1.jar, which is a simple example included with Spark that calculates the value of Pi (π). You can click spark-examples_2.12-3.3.1.jar to download the test JAR package and upload it to OSS.
./bin/spark-submit --name SparkPi \ --queue dev_queue \ --num-executors 5 \ --driver-memory 1g \ --executor-cores 2 \ --executor-memory 2g \ --class org.apache.spark.examples.SparkPi \ oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \ 10000PySpark jobs
This example uses DataFrame.py and employee.csv. You can click DataFrame.py and employee.csv to download the test files and upload them to OSS.
NoteThe DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.
The employee.csv file contains data such as employee names, departments, and salaries.
./bin/spark-submit --name PySpark \ --queue dev_queue \ --num-executors 5 \ --driver-memory 1g \ --executor-cores 2 \ --executor-memory 2g \ --conf spark.tags.key=value \ oss://<yourBucket>/path/to/DataFrame.py \ oss://<yourBucket>/path/to/employee.csvThe following sections describe the parameters:
Compatible open-source parameters
Parameter
Example value
Description
--name
SparkPi
The name of the Spark application. This parameter is used to identify the job.
--class
org.apache.spark.examples.SparkPi
The entry class name of the Spark job for a Java or Scala program. This parameter is not required for Python programs.
--num-executors
5
The number of executors for the Spark job.
--driver-cores
1
The number of driver cores for the Spark job.
--driver-memory
1g
The driver memory size for the Spark job.
--executor-cores
2
The number of executor cores for the Spark job.
--executor-memory
2g
The executor memory size for the Spark job.
--files
oss://<yourBucket>/file1,oss://<yourBucket>/file2
The resource files that the Spark job references. The files can be OSS resources or local files. Separate multiple files with commas (,).
--py-files
oss://<yourBucket>/file1.py,oss://<yourBucket>/file2.py
The Python scripts that the Spark job references. The scripts can be OSS resources or local files. Separate multiple files with commas (,). This parameter is valid only for PySpark programs.
--jars
oss://<yourBucket>/file1.jar,oss://<yourBucket>/file2.jar
The JAR packages that the Spark job references. The packages can be OSS resources or local files. Separate multiple files with commas (,).
--archives
oss://<yourBucket>/archive.tar.gz#env,oss://<yourBucket>/archive2.zip
The archive packages that the Spark job references. The packages can be OSS resources or local files. Separate multiple files with commas (,).
--queue
root_queue
The name of the queue in which the Spark job runs. The name must be the same as the queue name in the queue management section of the EMR Serverless Spark workspace.
--proxy-user
test
The value of this parameter overwrites the
HADOOP_USER_NAMEenvironment variable. The behavior is the same as in the open-source version.--conf
spark.tags.key=value
Custom parameters for the Spark job.
--status
jr-8598aa9f459d****
View the status of a Spark job.
--kill
jr-8598aa9f459d****
Stop a Spark job.
Enhanced parameters
Parameter
Example value
Description
--detach
No value required
If you use this parameter, spark-submit exits immediately after submitting the job and does not wait for or query the job status.
--detail
jr-8598aa9f459d****
View the details of a Spark job.
--release-version
esr-4.1.1 (Spark 3.5.2, Scala 2.12)
The Spark version. Enter the database engine version displayed on the console.
--enable-template
No value required
Enables the template feature. The job uses the default configuration template of the workspace.
If you created a Configuration Template in Configuration Management, you can specify the template ID by setting the
spark.emr.serverless.templateIdparameter in--conf. The job then directly applies the specified template. For more information about how to create a template, see Configuration management.If you specify only
--enable-template, the job automatically applies the default configuration template of the workspace.If you specify only the template ID using
--conf, the job directly applies the specified template.If you specify both
--enable-templateand--conf, and the--enable-templateand--conf spark.emr.serverless.templateIdparameters are specified, the template ID in--confoverwrites the default template.If you do not specify
--enable-templateor--conf spark.emr.serverless.templateId, the job does not apply any template configuration.
--timeout
60
The job timeout period. Unit: seconds.
--workspace-id
w-4b4d7925a797****
Specifies the workspace ID at the job level. This can overwrite the
workspaceIdparameter in theconnection.propertiesfile.Unsupported open-source parameters
--deploy-mode
--master
--repositories
--keytab
--principal
--total-executor-cores
--driver-library-path
--driver-class-path
--supervise
--verbose
Using spark-sql
spark-sqlis a tool used to run SQL queries or scripts. It is suitable for scenarios in which you run SQL statements directly.Example 1: Run an SQL statement directly
spark-sql -e "SHOW TABLES"This command lists all tables in the current database.
Example 2: Run an SQL script file
spark-sql -f oss://<yourBucketname>/path/to/your/example.sqlThis example uses example.sql. You can click example.sql to download the test file and upload it to OSS.
The following table describes the parameters.
Parameter
Example value
Description
-e "<sql>"-e "SELECT * FROM table"Runs an SQL statement inline from the command line.
-f <path>-f oss://path/script.sqlRuns the SQL script file at the specified path.
Step 4: Query a Spark job
Using the CLI
Query the status of a Spark job
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --status <jr-8598aa9f459d****>Query the details of a Spark job
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --detail <jr-8598aa9f459d****>Using the UI
On the EMR Serverless Spark page, click Job History in the navigation pane on the left.
On the Development Jobs tab of the Job History page, you can view the submitted jobs.

(Optional) Step 5: Stop a Spark job
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --kill <jr-8598aa9f459d****>You can stop only running jobs.
FAQ
How do I specify network connectivity when I submit a batch job using the spark-submit tool?
Prepare a network connection. For more information, see Add a network connection.
In the spark-submit command, use
--confto specify the network connection.--conf spark.emr.serverless.network.service.name=<networkname>Replace <networkname> with the name of your network connection.