All Products
Search
Document Center

E-MapReduce:Submit a Spark job using the spark-submit command line interface (CLI)

Last Updated:Mar 26, 2026

Connect an Elastic Compute Service (ECS) instance to EMR Serverless Spark and submit Spark jobs directly from the command line using the EMR Serverless spark-submit command-line interface (CLI).

Prerequisites

Before you begin, make sure you have:

  • Java 1.8 or later installed on your ECS instance

  • (For RAM users) The RAM user added to the Serverless Spark workspace with the Developer role or higher. See Manage users and roles

Placeholders in this topic

Replace the following placeholders with your actual values before running the commands:

Placeholder Description Example
<ALIBABA_CLOUD_ACCESS_KEY_ID> AccessKey ID of the Alibaba Cloud account or RAM user LTAI5tXxx
<ALIBABA_CLOUD_ACCESS_KEY_SECRET> AccessKey secret of the Alibaba Cloud account or RAM user xXxXxXx
<region-id> Region ID where your workspace is deployed cn-hangzhou
<workspace-id> EMR Serverless Spark workspace ID w-xxxxxxxxxxxx
<your-bucket> OSS bucket name bound to the workspace my-bucket
<job-run-id> Job run ID returned after submitting a job jr-8598aa9f459d****

Step 1: Download and install the CLI

  1. Download the installation package: emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip.

  2. Upload the package to your ECS instance. See Upload or download files.

  3. Decompress the package:

    unzip emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip

    After decompression, a directory named emr-serverless-spark-tool-0.11.3-SNAPSHOT is created in your current working directory.

Step 2: Configure the connection

  1. Open the configuration file:

    Important

    If the SPARK_CONF_DIR environment variable is set in your environment (for example, in an EMR cluster, this is typically /etc/taihao-apps/spark-conf), place connection.properties in that directory instead. Otherwise, the system returns an error.

    vim emr-serverless-spark-tool-0.11.3-SNAPSHOT/conf/connection.properties
  2. Set the following required parameters using key=value format:

    Important

    The RAM user or role associated with the AccessKey must be granted RAM authorization and added to the Serverless Spark workspace. See RAM user authorization and Manage users and roles.

    Parameter Description
    accessKeyId AccessKey ID of the Alibaba Cloud account or RAM user used to run Spark jobs. The associated user must have read and write permissions on the Object Storage Service (OSS) bucket bound to the workspace. To find the bound OSS bucket, go to the Spark page and click Details in the Actions column.
    accessKeySecret AccessKey secret paired with accessKeyId.
    regionId Region ID of your workspace. This example uses cn-hangzhou (China (Hangzhou)).
    endpoint EMR Serverless Spark service endpoint. This example uses the public endpoint for China (Hangzhou): emr-serverless-spark.cn-hangzhou.aliyuncs.com. For all endpoints, see Service endpoints. If your ECS instance cannot access the public network, use the VPC endpoint instead.
    workspaceId ID of the EMR Serverless Spark workspace.
    accessKeyId=<ALIBABA_CLOUD_ACCESS_KEY_ID>
    accessKeySecret=<ALIBABA_CLOUD_ACCESS_KEY_SECRET>
    regionId=<region-id>
    endpoint=emr-serverless-spark.<region-id>.aliyuncs.com
    workspaceId=<workspace-id>

    The following table describes all parameters:

Step 3: Submit a Spark job

Navigate to the tool directory:

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT

Choose a submission method based on your job type.

Submit Java or Scala jobs with spark-submit

spark-submit is Spark's general-purpose job submission tool for Java, Scala, and PySpark jobs.

The following example runs a built-in Spark job that calculates the value of Pi (π). Download the test JAR file and upload it to your OSS bucket before running the command.

./bin/spark-submit --name SparkPi \
  --queue dev_queue \
  --num-executors 5 \
  --driver-memory 1g \
  --executor-cores 2 \
  --executor-memory 2g \
  --class org.apache.spark.examples.SparkPi \
  oss://<your-bucket>/path/to/spark-examples_2.12-3.5.2.jar \
  10000

Submit PySpark jobs with spark-submit

The following example uses DataFrame.py and employee.csv. DataFrame.py processes data in OSS using the Apache Spark framework; employee.csv contains employee names, departments, and salaries. Download both files and upload them to your OSS bucket before running the command.

./bin/spark-submit --name PySpark \
  --queue dev_queue \
  --num-executors 5 \
  --driver-memory 1g \
  --executor-cores 2 \
  --executor-memory 2g \
  --conf spark.tags.key=value \
  oss://<your-bucket>/path/to/DataFrame.py \
  oss://<your-bucket>/path/to/employee.csv

Run SQL queries with spark-sql

Use spark-sql to execute SQL statements or scripts directly, without writing application code.

Run an SQL statement inline:

spark-sql -e "SHOW TABLES"

This command lists all tables in the current database.

Run an SQL script file from OSS:

spark-sql -f oss://<your-bucket>/path/to/your/example.sql

Download the test file example.sql and upload it to your OSS bucket. The file contains:

example.sql file content example

CREATE TABLE IF NOT EXISTS employees (
    id INT,
    name STRING,
    age INT,
    department STRING
);

INSERT INTO employees VALUES
(1, 'Alice', 30, 'Engineering'),
(2, 'Bob', 25, 'Marketing'),
(3, 'Charlie', 35, 'Sales');

SELECT * FROM employees;

spark-sql parameters:

Parameter Example Description
-e "<sql>" -e "SELECT * FROM table" Execute an SQL statement inline.
-f <path> -f oss://path/script.sql Execute an SQL script file at the specified path.

spark-submit parameters

Supported open source parameters

Parameter Example Description
--name SparkPi Application name of the Spark job, used for identification.
--class org.apache.spark.examples.SparkPi Entry class name for Java or Scala programs. Not applicable to Python programs.
--num-executors 5 Number of executors for the Spark job.
--driver-cores 1 Number of CPU cores allocated to the Spark driver.
--driver-memory 1g Memory allocated to the Spark driver.
--executor-cores 2 Number of CPU cores allocated to each executor.
--executor-memory 2g Memory allocated to each executor.
--files oss://<your-bucket>/file1,oss://<your-bucket>/file2 Resource files required by the job. Accepts OSS paths and local paths. Separate multiple files with commas.
--py-files oss://<your-bucket>/file1.py,oss://<your-bucket>/file2.py Python scripts required by the job. Accepts OSS paths and local paths. Separate multiple files with commas. Applies to PySpark jobs only.
--jars oss://<your-bucket>/file1.jar,oss://<your-bucket>/file2.jar JAR files required by the job. Accepts OSS paths and local paths. Separate multiple files with commas.
--archives oss://<your-bucket>/archive.tar.gz#env,oss://<your-bucket>/archive2.zip Archive packages required by the job. Accepts OSS paths and local paths. Separate multiple files with commas.
--queue root_queue Queue where the job runs. The name must match a queue configured in the EMR Serverless Spark workspace.
--proxy-user test Overwrites the HADOOP_USER_NAME environment variable, matching open source behavior.
--conf spark.tags.key=value Custom Spark configuration properties.
--status jr-8598aa9f459d**** Query the status of a Spark job by job run ID.
--kill jr-8598aa9f459d**** Stop a Spark job by job run ID.

Enhanced parameters

These parameters are specific to EMR Serverless Spark and are not part of standard Spark.

Parameter Example Description
--detach N/A Exit immediately after submitting the job without waiting for or checking the job status.
--detail jr-8598aa9f459d**** View the details of a Spark job by job run ID.
--release-version esr-4.1.1 (Spark 3.5.2, Scala 2.12) Specify the engine version. Use the version string shown in the console.
--enable-template N/A Apply a configuration template to the job. See the behavior notes below.
--timeout 60 Job timeout period, in seconds.
--workspace-id w-4b4d7925a797**** Specify the workspace at the job level, overriding workspaceId in connection.properties.

`--enable-template` behavior:

Parameters specified Result
--enable-template only Applies the workspace's default configuration template.
--conf spark.emr.serverless.templateId only Applies the specified template directly.
Both --enable-template and --conf spark.emr.serverless.templateId The template ID in --conf overwrites the default template.
Neither No template is applied.

To create a configuration template, see Configuration management.

Unsupported parameters

The following standard Spark parameters are not supported. Because EMR Serverless Spark manages cluster deployment, resource scheduling, and security at the platform level, these settings do not need to be configured per job.

  • --deploy-mode

  • --master

  • --repositories

  • --keytab

  • --principal

  • --total-executor-cores

  • --driver-library-path

  • --driver-class-path

  • --supervise

  • --verbose

Step 4: Query a Spark job

Using the CLI

Query job status:

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --status <job-run-id>

Query job details:

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --detail <job-run-id>

Using the console

  1. On the EMR Serverless Spark page, click Job History in the left navigation pane.

  2. Click the Development Jobs tab to view submitted jobs.

    image

Step 5 (optional): Stop a Spark job

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --kill <job-run-id>
Note

You can stop only jobs in the Running state.

FAQ

How do I specify network connectivity when submitting a batch job?

First, set up a network connection. See Add a network connection.

Then pass the connection name in your spark-submit command:

--conf spark.emr.serverless.network.service.name=<network-connection-name>