All Products
Search
Document Center

E-MapReduce:Submit jobs using spark-submit

Last Updated:Nov 21, 2025

This topic describes how to develop Spark jobs using the EMR Serverless spark-submit command-line interface (CLI). The examples use an ECS instance that is connected to EMR Serverless Spark.

Prerequisites

  • Java 1.8 or later is installed.

  • If you use a RAM user to submit a Spark job, you must add the RAM user to the Serverless Spark workspace and grant the user the Developer role or a higher-level role. For more information, see Manage users and roles.

Procedure

Step 1: Download and install the EMR Serverless spark-submit tool

  1. Click emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip to download the installation package.

  2. Upload the installation package to the ECS instance. For more information, see Upload or download files.

  3. Run the following command to decompress and install the EMR Serverless spark-submit tool.

    unzip emr-serverless-spark-tool-0.11.3-SNAPSHOT-bin.zip

Step 2: Configure parameters

Important

If the SPARK_CONF_DIR environment variable is set in your Spark environment, you must place the configuration file in the directory specified by SPARK_CONF_DIR. Otherwise, an error occurs. For example, in an EMR cluster, this directory is usually /etc/taihao-apps/spark-conf.

  1. Run the following command to modify the configuration in connection.properties.

    vim emr-serverless-spark-tool-0.11.3-SNAPSHOT/conf/connection.properties
  2. Configure the parameters in the file. The parameter format is key=value. The following code shows an example.

    accessKeyId=<ALIBABA_CLOUD_ACCESS_KEY_ID>
    accessKeySecret=<ALIBABA_CLOUD_ACCESS_KEY_SECRET>
    regionId=cn-hangzhou
    endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
    workspaceId=w-xxxxxxxxxxxx
    Important

    The RAM user or role that corresponds to this AccessKey must be granted RAM authorization and added to the Serverless Spark workspace.

    The following table describes the parameters.

    Parameter

    Required

    Description

    accessKeyId

    Yes

    The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that you use to run the Spark job.

    Important

    When you configure the accessKeyId and accessKeySecret parameters, make sure that the user corresponding to the AccessKey has read and write permissions on the Object Storage Service (OSS) bucket that is attached to the workspace. To view the OSS bucket attached to the workspace, go to the Spark page and click Details in the Actions column of the workspace.

    accessKeySecret

    Yes

    regionId

    Yes

    The region ID. This topic uses the China (Hangzhou) region as an example.

    endpoint

    Yes

    The Endpoint of EMR Serverless Spark. For more information, see Service endpoints.

    This topic uses the public endpoint of the China (Hangzhou) region as an example. The parameter is set to emr-serverless-spark.cn-hangzhou.aliyuncs.com.

    Note

    If the ECS instance cannot access the public network, use a VPC endpoint.

    workspaceId

    Yes

    The ID of the EMR Serverless Spark workspace.

Step 3: Submit a Spark job

  1. Run the following command to navigate to the directory of the EMR Serverless spark-submit tool.

    cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
  2. Select a submission method based on the job type.

    When you submit a job, you must specify the dependent resources, such as JAR packages or Python scripts. These files can be stored in OSS or on a local disk. The storage location depends on your use case and requirements. The examples in this topic use resources that are stored in OSS.

    Using spark-submit

    spark-submit is a general-purpose job submission tool provided by Spark. It is suitable for Java, Scala, and PySpark jobs.

    Java/Scala jobs

    This example uses spark-examples_2.12-3.3.1.jar, which is a simple example included with Spark that calculates the value of Pi (π). You can click spark-examples_2.12-3.3.1.jar to download the test JAR package and upload it to OSS.

    ./bin/spark-submit  --name SparkPi \
    --queue dev_queue  \
    --num-executors 5 \
    --driver-memory 1g \
    --executor-cores 2 \
    --executor-memory 2g \
    --class org.apache.spark.examples.SparkPi \
     oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \
    10000

    PySpark jobs

    This example uses DataFrame.py and employee.csv. You can click DataFrame.py and employee.csv to download the test files and upload them to OSS.

    Note
    • The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.

    • The employee.csv file contains data such as employee names, departments, and salaries.

    ./bin/spark-submit --name PySpark \
    --queue dev_queue  \
    --num-executors 5 \
    --driver-memory 1g \
    --executor-cores 2 \
    --executor-memory 2g \
    --conf spark.tags.key=value \
    oss://<yourBucket>/path/to/DataFrame.py \
    oss://<yourBucket>/path/to/employee.csv

    The following sections describe the parameters:

    • Compatible open-source parameters

      Parameter

      Example value

      Description

      --name

      SparkPi

      The name of the Spark application. This parameter is used to identify the job.

      --class

      org.apache.spark.examples.SparkPi

      The entry class name of the Spark job for a Java or Scala program. This parameter is not required for Python programs.

      --num-executors

      5

      The number of executors for the Spark job.

      --driver-cores

      1

      The number of driver cores for the Spark job.

      --driver-memory

      1g

      The driver memory size for the Spark job.

      --executor-cores

      2

      The number of executor cores for the Spark job.

      --executor-memory

      2g

      The executor memory size for the Spark job.

      --files

      oss://<yourBucket>/file1,oss://<yourBucket>/file2

      The resource files that the Spark job references. The files can be OSS resources or local files. Separate multiple files with commas (,).

      --py-files

      oss://<yourBucket>/file1.py,oss://<yourBucket>/file2.py

      The Python scripts that the Spark job references. The scripts can be OSS resources or local files. Separate multiple files with commas (,). This parameter is valid only for PySpark programs.

      --jars

      oss://<yourBucket>/file1.jar,oss://<yourBucket>/file2.jar

      The JAR packages that the Spark job references. The packages can be OSS resources or local files. Separate multiple files with commas (,).

      --archives

      oss://<yourBucket>/archive.tar.gz#env,oss://<yourBucket>/archive2.zip

      The archive packages that the Spark job references. The packages can be OSS resources or local files. Separate multiple files with commas (,).

      --queue

      root_queue

      The name of the queue in which the Spark job runs. The name must be the same as the queue name in the queue management section of the EMR Serverless Spark workspace.

      --proxy-user

      test

      The value of this parameter overwrites the HADOOP_USER_NAME environment variable. The behavior is the same as in the open-source version.

      --conf

      spark.tags.key=value

      Custom parameters for the Spark job.

      --status

      jr-8598aa9f459d****

      View the status of a Spark job.

      --kill

      jr-8598aa9f459d****

      Stop a Spark job.

    • Enhanced parameters

      Parameter

      Example value

      Description

      --detach

      No value required

      If you use this parameter, spark-submit exits immediately after submitting the job and does not wait for or query the job status.

      --detail

      jr-8598aa9f459d****

      View the details of a Spark job.

      --release-version

      esr-4.1.1 (Spark 3.5.2, Scala 2.12)

      The Spark version. Enter the database engine version displayed on the console.

      --enable-template

      No value required

      Enables the template feature. The job uses the default configuration template of the workspace.

      If you created a Configuration Template in Configuration Management, you can specify the template ID by setting the spark.emr.serverless.templateId parameter in --conf. The job then directly applies the specified template. For more information about how to create a template, see Configuration management.

      • If you specify only --enable-template, the job automatically applies the default configuration template of the workspace.

      • If you specify only the template ID using --conf, the job directly applies the specified template.

      • If you specify both --enable-template and --conf, and the --enable-template and --conf spark.emr.serverless.templateId parameters are specified, the template ID in --conf overwrites the default template.

      • If you do not specify --enable-template or --conf spark.emr.serverless.templateId, the job does not apply any template configuration.

      --timeout

      60

      The job timeout period. Unit: seconds.

      --workspace-id

      w-4b4d7925a797****

      Specifies the workspace ID at the job level. This can overwrite the workspaceId parameter in the connection.properties file.

    • Unsupported open-source parameters

      • --deploy-mode

      • --master

      • --repositories

      • --keytab

      • --principal

      • --total-executor-cores

      • --driver-library-path

      • --driver-class-path

      • --supervise

      • --verbose

    Using spark-sql

    spark-sql is a tool used to run SQL queries or scripts. It is suitable for scenarios in which you run SQL statements directly.

    • Example 1: Run an SQL statement directly

      spark-sql -e "SHOW TABLES"

      This command lists all tables in the current database.

    • Example 2: Run an SQL script file

      spark-sql -f oss://<yourBucketname>/path/to/your/example.sql

      This example uses example.sql. You can click example.sql to download the test file and upload it to OSS.

      Example content of the example.sql file

      CREATE TABLE IF NOT EXISTS employees (
          id INT,
          name STRING,
          age INT,
          department STRING
      );
      
      INSERT INTO employees VALUES
      (1, 'Alice', 30, 'Engineering'),
      (2, 'Bob', 25, 'Marketing'),
      (3, 'Charlie', 35, 'Sales');
      
      SELECT * FROM employees;
      

    The following table describes the parameters.

    Parameter

    Example value

    Description

    -e "<sql>"

    -e "SELECT * FROM table"

    Runs an SQL statement inline from the command line.

    -f <path>

    -f oss://path/script.sql

    Runs the SQL script file at the specified path.

Step 4: Query a Spark job

Using the CLI

Query the status of a Spark job

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --status <jr-8598aa9f459d****>

Query the details of a Spark job

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --detail <jr-8598aa9f459d****>

Using the UI

  1. On the EMR Serverless Spark page, click Job History in the navigation pane on the left.

  2. On the Development Jobs tab of the Job History page, you can view the submitted jobs.

    image

(Optional) Step 5: Stop a Spark job

cd emr-serverless-spark-tool-0.11.3-SNAPSHOT
./bin/spark-submit --kill <jr-8598aa9f459d****>
Note

You can stop only running jobs.

FAQ

How do I specify network connectivity when I submit a batch job using the spark-submit tool?

  1. Prepare a network connection. For more information, see Add a network connection.

  2. In the spark-submit command, use --conf to specify the network connection.

    --conf spark.emr.serverless.network.service.name=<networkname>

    Replace <networkname> with the name of your network connection.