Connect an Elastic Compute Service (ECS) instance to EMR Serverless Spark and submit Spark jobs directly from the command line using the EMR Serverless spark-submit command-line interface (CLI).
Prerequisites
Before you begin, make sure you have:
-
Java 1.8 or later installed on your ECS instance
-
(For RAM users) The RAM user added to the Serverless Spark workspace with the Developer role or higher. See Manage users and roles
Placeholders in this topic
Replace the following placeholders with your actual values before running the commands:
| Placeholder | Description | Example |
|---|---|---|
<ALIBABA_CLOUD_ACCESS_KEY_ID> |
AccessKey ID of the Alibaba Cloud account or RAM user | LTAI5tXxx |
<ALIBABA_CLOUD_ACCESS_KEY_SECRET> |
AccessKey secret of the Alibaba Cloud account or RAM user | xXxXxXx |
<region-id> |
Region ID where your workspace is deployed | cn-hangzhou |
<workspace-id> |
EMR Serverless Spark workspace ID | w-xxxxxxxxxxxx |
<your-bucket> |
OSS bucket name bound to the workspace | my-bucket |
<job-run-id> |
Job run ID returned after submitting a job | jr-8598aa9f459d**** |
Step 1: Download and install the CLI
-
Download the installation package: emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip.
-
Upload the package to your ECS instance. See Upload or download files.
-
Decompress the package:
unzip emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zipAfter decompression, a directory named
emr-serverless-spark-tool-0.11.3-SNAPSHOTis created in your current working directory.
Step 2: Configure the connection
-
Open the configuration file:
ImportantIf the
SPARK_CONF_DIRenvironment variable is set in your environment (for example, in an EMR cluster, this is typically/etc/taihao-apps/spark-conf), placeconnection.propertiesin that directory instead. Otherwise, the system returns an error.vim emr-serverless-spark-tool-0.11.3-SNAPSHOT/conf/connection.properties -
Set the following required parameters using
key=valueformat:ImportantThe RAM user or role associated with the AccessKey must be granted RAM authorization and added to the Serverless Spark workspace. See RAM user authorization and Manage users and roles.
Parameter Description accessKeyIdAccessKey ID of the Alibaba Cloud account or RAM user used to run Spark jobs. The associated user must have read and write permissions on the Object Storage Service (OSS) bucket bound to the workspace. To find the bound OSS bucket, go to the Spark page and click Details in the Actions column. accessKeySecretAccessKey secret paired with accessKeyId.regionIdRegion ID of your workspace. This example uses cn-hangzhou(China (Hangzhou)).endpointEMR Serverless Spark service endpoint. This example uses the public endpoint for China (Hangzhou): emr-serverless-spark.cn-hangzhou.aliyuncs.com. For all endpoints, see Service endpoints. If your ECS instance cannot access the public network, use the VPC endpoint instead.workspaceIdID of the EMR Serverless Spark workspace. accessKeyId=<ALIBABA_CLOUD_ACCESS_KEY_ID> accessKeySecret=<ALIBABA_CLOUD_ACCESS_KEY_SECRET> regionId=<region-id> endpoint=emr-serverless-spark.<region-id>.aliyuncs.com workspaceId=<workspace-id>The following table describes all parameters:
Step 3: Submit a Spark job
Navigate to the tool directory:
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
Choose a submission method based on your job type.
Submit Java or Scala jobs with spark-submit
spark-submit is Spark's general-purpose job submission tool for Java, Scala, and PySpark jobs.
The following example runs a built-in Spark job that calculates the value of Pi (π). Download the test JAR file and upload it to your OSS bucket before running the command.
-
For the esr-4.x engine version: spark-examples_2.12-3.5.2.jar
-
For the esr-5.x engine version: spark-examples_2.13-4.0.1.jar
./bin/spark-submit --name SparkPi \
--queue dev_queue \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--class org.apache.spark.examples.SparkPi \
oss://<your-bucket>/path/to/spark-examples_2.12-3.5.2.jar \
10000
Submit PySpark jobs with spark-submit
The following example uses DataFrame.py and employee.csv. DataFrame.py processes data in OSS using the Apache Spark framework; employee.csv contains employee names, departments, and salaries. Download both files and upload them to your OSS bucket before running the command.
./bin/spark-submit --name PySpark \
--queue dev_queue \
--num-executors 5 \
--driver-memory 1g \
--executor-cores 2 \
--executor-memory 2g \
--conf spark.tags.key=value \
oss://<your-bucket>/path/to/DataFrame.py \
oss://<your-bucket>/path/to/employee.csv
Run SQL queries with spark-sql
Use spark-sql to execute SQL statements or scripts directly, without writing application code.
Run an SQL statement inline:
spark-sql -e "SHOW TABLES"
This command lists all tables in the current database.
Run an SQL script file from OSS:
spark-sql -f oss://<your-bucket>/path/to/your/example.sql
Download the test file example.sql and upload it to your OSS bucket. The file contains:
spark-sql parameters:
| Parameter | Example | Description |
|---|---|---|
-e "<sql>" |
-e "SELECT * FROM table" |
Execute an SQL statement inline. |
-f <path> |
-f oss://path/script.sql |
Execute an SQL script file at the specified path. |
spark-submit parameters
Supported open source parameters
| Parameter | Example | Description |
|---|---|---|
--name |
SparkPi |
Application name of the Spark job, used for identification. |
--class |
org.apache.spark.examples.SparkPi |
Entry class name for Java or Scala programs. Not applicable to Python programs. |
--num-executors |
5 |
Number of executors for the Spark job. |
--driver-cores |
1 |
Number of CPU cores allocated to the Spark driver. |
--driver-memory |
1g |
Memory allocated to the Spark driver. |
--executor-cores |
2 |
Number of CPU cores allocated to each executor. |
--executor-memory |
2g |
Memory allocated to each executor. |
--files |
oss://<your-bucket>/file1,oss://<your-bucket>/file2 |
Resource files required by the job. Accepts OSS paths and local paths. Separate multiple files with commas. |
--py-files |
oss://<your-bucket>/file1.py,oss://<your-bucket>/file2.py |
Python scripts required by the job. Accepts OSS paths and local paths. Separate multiple files with commas. Applies to PySpark jobs only. |
--jars |
oss://<your-bucket>/file1.jar,oss://<your-bucket>/file2.jar |
JAR files required by the job. Accepts OSS paths and local paths. Separate multiple files with commas. |
--archives |
oss://<your-bucket>/archive.tar.gz#env,oss://<your-bucket>/archive2.zip |
Archive packages required by the job. Accepts OSS paths and local paths. Separate multiple files with commas. |
--queue |
root_queue |
Queue where the job runs. The name must match a queue configured in the EMR Serverless Spark workspace. |
--proxy-user |
test |
Overwrites the HADOOP_USER_NAME environment variable, matching open source behavior. |
--conf |
spark.tags.key=value |
Custom Spark configuration properties. |
--status |
jr-8598aa9f459d**** |
Query the status of a Spark job by job run ID. |
--kill |
jr-8598aa9f459d**** |
Stop a Spark job by job run ID. |
Enhanced parameters
These parameters are specific to EMR Serverless Spark and are not part of standard Spark.
| Parameter | Example | Description |
|---|---|---|
--detach |
N/A | Exit immediately after submitting the job without waiting for or checking the job status. |
--detail |
jr-8598aa9f459d**** |
View the details of a Spark job by job run ID. |
--release-version |
esr-4.1.1 (Spark 3.5.2, Scala 2.12) |
Specify the engine version. Use the version string shown in the console. |
--enable-template |
N/A | Apply a configuration template to the job. See the behavior notes below. |
--timeout |
60 |
Job timeout period, in seconds. |
--workspace-id |
w-4b4d7925a797**** |
Specify the workspace at the job level, overriding workspaceId in connection.properties. |
`--enable-template` behavior:
| Parameters specified | Result |
|---|---|
--enable-template only |
Applies the workspace's default configuration template. |
--conf spark.emr.serverless.templateId only |
Applies the specified template directly. |
Both --enable-template and --conf spark.emr.serverless.templateId |
The template ID in --conf overwrites the default template. |
| Neither | No template is applied. |
To create a configuration template, see Configuration management.
Unsupported parameters
The following standard Spark parameters are not supported. Because EMR Serverless Spark manages cluster deployment, resource scheduling, and security at the platform level, these settings do not need to be configured per job.
-
--deploy-mode -
--master -
--repositories -
--keytab -
--principal -
--total-executor-cores -
--driver-library-path -
--driver-class-path -
--supervise -
--verbose
Step 4: Query a Spark job
Using the CLI
Query job status:
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --status <job-run-id>
Query job details:
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --detail <job-run-id>
Using the console
-
On the EMR Serverless Spark page, click Job History in the left navigation pane.
-
Click the Development Jobs tab to view submitted jobs.

Step 5 (optional): Stop a Spark job
cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --kill <job-run-id>
You can stop only jobs in the Running state.
FAQ
How do I specify network connectivity when submitting a batch job?
First, set up a network connection. See Add a network connection.
Then pass the connection name in your spark-submit command:
--conf spark.emr.serverless.network.service.name=<network-connection-name>