Use spark-submit to run and manage Spark jobs on EMR Serverless Spark - E-MapReduce

Prerequisites

Before you begin, make sure you have:

Java 1.8 or later installed on your ECS instance
(For RAM users) The RAM user added to the Serverless Spark workspace with the Developer role or higher. See Manage users and roles

Placeholders in this topic

Replace the following placeholders with your actual values before running the commands:

Placeholder	Description	Example
`<ALIBABA_CLOUD_ACCESS_KEY_ID>`	AccessKey ID of the Alibaba Cloud account or RAM user	`LTAI5tXxx`
`<ALIBABA_CLOUD_ACCESS_KEY_SECRET>`	AccessKey secret of the Alibaba Cloud account or RAM user	`xXxXxXx`
`<region-id>`	Region ID where your workspace is deployed	`cn-hangzhou`
`<workspace-id>`	EMR Serverless Spark workspace ID	`w-xxxxxxxxxxxx`
`<your-bucket>`	OSS bucket name bound to the workspace	`my-bucket`
`<job-run-id>`	Job run ID returned after submitting a job	`jr-8598aa9f459d****`

Step 1: Download and install the CLI

Download the installation package: emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip.
Upload the package to your ECS instance. See Upload or download files.
Decompress the package:
```
unzip emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip
```
After decompression, a directory named emr-serverless-spark-tool-0.11.3-SNAPSHOT is created in your current working directory.

Step 2: Configure the connection

Open the configuration file:

Important
If the SPARK_CONF_DIR environment variable is set in your environment (for example, in an EMR cluster, this is typically /etc/taihao-apps/spark-conf), place connection.properties in that directory instead. Otherwise, the system returns an error.
```
vim emr-serverless-spark-tool-0.11.3-SNAPSHOT/conf/connection.properties
```

Set the following required parameters using key=value format:

Important

The RAM user or role associated with the AccessKey must be granted RAM authorization and added to the Serverless Spark workspace. See RAM user authorization and Manage users and roles.

Parameter	Description
`accessKeyId`	AccessKey ID of the Alibaba Cloud account or RAM user used to run Spark jobs. The associated user must have read and write permissions on the Object Storage Service (OSS) bucket bound to the workspace. To find the bound OSS bucket, go to the Spark page and click Details in the Actions column.
`accessKeySecret`	AccessKey secret paired with `accessKeyId`.
`regionId`	Region ID of your workspace. This example uses `cn-hangzhou` (China (Hangzhou)).
`endpoint`	EMR Serverless Spark service endpoint. This example uses the public endpoint for China (Hangzhou): `emr-serverless-spark.cn-hangzhou.aliyuncs.com`. For all endpoints, see Service endpoints. If your ECS instance cannot access the public network, use the VPC endpoint instead.
`workspaceId`	ID of the EMR Serverless Spark workspace.

accessKeyId=<ALIBABA_CLOUD_ACCESS_KEY_ID>
accessKeySecret=<ALIBABA_CLOUD_ACCESS_KEY_SECRET>
regionId=<region-id>
endpoint=emr-serverless-spark.<region-id>.aliyuncs.com
workspaceId=<workspace-id>

The following table describes all parameters:

Step 3: Submit a Spark job

Navigate to the tool directory:

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT

Choose a submission method based on your job type.

Submit Java or Scala jobs with spark-submit

spark-submit is Spark's general-purpose job submission tool for Java, Scala, and PySpark jobs.

The following example runs a built-in Spark job that calculates the value of Pi (π). Download the test JAR file and upload it to your OSS bucket before running the command.

For the esr-4.x engine version: spark-examples_2.12-3.5.2.jar
For the esr-5.x engine version: spark-examples_2.13-4.0.1.jar

./bin/spark-submit --name SparkPi \
  --queue dev_queue \
  --num-executors 5 \
  --driver-memory 1g \
  --executor-cores 2 \
  --executor-memory 2g \
  --class org.apache.spark.examples.SparkPi \
  oss://<your-bucket>/path/to/spark-examples_2.12-3.5.2.jar \
  10000

Submit PySpark jobs with spark-submit

The following example uses DataFrame.py and employee.csv. DataFrame.py processes data in OSS using the Apache Spark framework; employee.csv contains employee names, departments, and salaries. Download both files and upload them to your OSS bucket before running the command.

./bin/spark-submit --name PySpark \
  --queue dev_queue \
  --num-executors 5 \
  --driver-memory 1g \
  --executor-cores 2 \
  --executor-memory 2g \
  --conf spark.tags.key=value \
  oss://<your-bucket>/path/to/DataFrame.py \
  oss://<your-bucket>/path/to/employee.csv

Run SQL queries with spark-sql

Use spark-sql to execute SQL statements or scripts directly, without writing application code.

Run an SQL statement inline:

spark-sql -e "SHOW TABLES"

This command lists all tables in the current database.

Run an SQL script file from OSS:

spark-sql -f oss://<your-bucket>/path/to/your/example.sql

Download the test file example.sql and upload it to your OSS bucket. The file contains:

example.sql file content example

CREATE TABLE IF NOT EXISTS employees (
    id INT,
    name STRING,
    age INT,
    department STRING
);

INSERT INTO employees VALUES
(1, 'Alice', 30, 'Engineering'),
(2, 'Bob', 25, 'Marketing'),
(3, 'Charlie', 35, 'Sales');

SELECT * FROM employees;

spark-sql parameters:

Parameter	Example	Description
`-e "<sql>"`	`-e "SELECT * FROM table"`	Execute an SQL statement inline.
`-f <path>`	`-f oss://path/script.sql`	Execute an SQL script file at the specified path.

spark-submit parameters

Supported open source parameters

Parameter	Example	Description
`--name`	`SparkPi`	Application name of the Spark job, used for identification.
`--class`	`org.apache.spark.examples.SparkPi`	Entry class name for Java or Scala programs. Not applicable to Python programs.
`--num-executors`	`5`	Number of executors for the Spark job.
`--driver-cores`	`1`	Number of CPU cores allocated to the Spark driver.
`--driver-memory`	`1g`	Memory allocated to the Spark driver.
`--executor-cores`	`2`	Number of CPU cores allocated to each executor.
`--executor-memory`	`2g`	Memory allocated to each executor.
`--files`	`oss://<your-bucket>/file1,oss://<your-bucket>/file2`	Resource files required by the job. Accepts OSS paths and local paths. Separate multiple files with commas.
`--py-files`	`oss://<your-bucket>/file1.py,oss://<your-bucket>/file2.py`	Python scripts required by the job. Accepts OSS paths and local paths. Separate multiple files with commas. Applies to PySpark jobs only.
`--jars`	`oss://<your-bucket>/file1.jar,oss://<your-bucket>/file2.jar`	JAR files required by the job. Accepts OSS paths and local paths. Separate multiple files with commas.
`--archives`	`oss://<your-bucket>/archive.tar.gz#env,oss://<your-bucket>/archive2.zip`	Archive packages required by the job. Accepts OSS paths and local paths. Separate multiple files with commas.
`--queue`	`root_queue`	Queue where the job runs. The name must match a queue configured in the EMR Serverless Spark workspace.
`--proxy-user`	`test`	Overwrites the `HADOOP_USER_NAME` environment variable, matching open source behavior.
`--conf`	`spark.tags.key=value`	Custom Spark configuration properties.
`--status`	`jr-8598aa9f459d****`	Query the status of a Spark job by job run ID.
`--kill`	`jr-8598aa9f459d****`	Stop a Spark job by job run ID.

Enhanced parameters

These parameters are specific to EMR Serverless Spark and are not part of standard Spark.

Parameter	Example	Description
`--detach`	N/A	Exit immediately after submitting the job without waiting for or checking the job status.
`--detail`	`jr-8598aa9f459d****`	View the details of a Spark job by job run ID.
`--release-version`	`esr-4.1.1 (Spark 3.5.2, Scala 2.12)`	Specify the engine version. Use the version string shown in the console.
`--enable-template`	N/A	Apply a configuration template to the job. See the behavior notes below.
`--timeout`	`60`	Job timeout period, in seconds.
`--workspace-id`	`w-4b4d7925a797****`	Specify the workspace at the job level, overriding `workspaceId` in `connection.properties`.

`--enable-template` behavior:

Parameters specified	Result
`--enable-template` only	Applies the workspace's default configuration template.
`--conf spark.emr.serverless.templateId` only	Applies the specified template directly.
Both `--enable-template` and `--conf spark.emr.serverless.templateId`	The template ID in `--conf` overwrites the default template.
Neither	No template is applied.

To create a configuration template, see Configuration management.

Unsupported parameters

The following standard Spark parameters are not supported. Because EMR Serverless Spark manages cluster deployment, resource scheduling, and security at the platform level, these settings do not need to be configured per job.

--deploy-mode
--master
--repositories
--keytab
--principal
--total-executor-cores
--driver-library-path
--driver-class-path
--supervise
--verbose

Step 4: Query a Spark job

Using the CLI

Query job status:

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --status <job-run-id>

Query job details:

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --detail <job-run-id>

Using the console

On the EMR Serverless Spark page, click Job History in the left navigation pane.
Click the Development Jobs tab to view submitted jobs.

Step 5 (optional): Stop a Spark job

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --kill <job-run-id>

Note

You can stop only jobs in the Running state.

FAQ

How do I specify network connectivity when submitting a batch job?

First, set up a network connection. See Add a network connection.

Then pass the connection name in your spark-submit command:

--conf spark.emr.serverless.network.service.name=<network-connection-name>