All Products
Search
Document Center

E-MapReduce:Use DolphinScheduler to submit Spark jobs

Last Updated:Apr 11, 2025

DolphinScheduler is a distributed and extensible open source workflow orchestration platform with powerful Directed Acyclic Graph (DAG) visual interfaces. DolphinScheduler can help you efficiently execute and manage workflows for large amounts of data. This topic describes how to create, edit, and schedule Spark jobs on the web UI of DolphinScheduler with ease.

Background information

The code of DolphinScheduler AliyunServerlessSpark Task Plugin has been successfully merged into the main branch of Apache DolphinScheduler and will be released in the subsequent official versions. Before a new version is released, you can compile the main branch code of AliyunServerlessSpark Task Plugin or integrate AliyunServerlessSpark Task Plugin into your project by using the related pull request (PR) of the cherry-pick command.

Prerequisites

  • Java Development Kit (JDK) 1.8 or later is installed.

  • AliyunServerlessSpark Task Plugin is installed by using one of the following methods:

Procedure

Step 1: Create a data source

  1. Access the web UI of DolphinScheduler. In the top navigation bar, click Datasource.

  2. Click Create DataSource. In the Choose DataSource Type dialog box, select ALIYUN_SERVERLESS_SPARK.

  3. In the CreateDataSource dialog box, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Datasource Name

    The name of the data source.

    Access Key Id

    The AccessKey ID of your Alibaba Cloud account.

    Access Key Secret

    The AccessKey secret of your Alibaba Cloud account.

    Region Id

    The ID of the region where the E-MapReduce (EMR) Serverless Spark workspace resides. Example: cn-beijing.

    For information about supported regions, see Supported regions.

  4. Click Test Connect. After the data source passes the connectivity test, click Confirm.

Step 2: Create a project

  1. In the top navigation bar, click Project.

  2. Click Create Project.

  3. In the Create Project dialog box, configure the parameters, such as Project Name and User. For more information, see Project.

Step 3: Create a workflow

  1. Click the name of the created project. In the left-side navigation pane, choose Workflow > Workflow Definition to go to the Workflow Definition page.

  2. Click Create Workflow. The workflow DAG edit page appears.

  3. In the left-side navigation pane, select ALIYUN_SERVERLESS_SPARK and drag it to the right-side canvas.

  4. In the Current node settings dialog box, configure the parameters and click Confirm.

    The parameter configurations vary based on the type of the job that you want to submit.

    Parameters required to submit Java Archive (JAR) jobs

    Parameter

    Description

    Datasource types

    Select ALIYUN_SERVERLESS_SPARK.

    Datasource instances

    Select the created data source.

    workspace id

    The ID of the EMR Serverless Spark workspace.

    resource queue id

    The ID of the resource queue in the EMR Serverless Spark workspace. Default value: root_queue.

    code type

    The type of the job. Set the parameter to JAR.

    job name

    The name of the EMR Serverless Spark job. Example: ds-emr-spark-jar.

    entry point

    The file path. Example: oss://<yourBucketName>/spark-resource/examples/jars/spark-examples_2.12-3.3.1.jar.

    entry point arguments

    The code of the Spark job. You can use the number sign (#) as a delimiter to separate the parameters in the code.

    spark submit parameters

    The parameters required to submit Spark JAR jobs. Example:

    --class org.apache.spark.examples.SparkPi --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1

    is production

    Turn on the switch if the Spark job is a job in the production environment.

    engine release version

    The engine version. Default value: esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime).

    Parameters required to submit SQL jobs

    Parameter

    Description

    Datasource types

    Select ALIYUN_SERVERLESS_SPARK.

    Datasource instances

    Select the created data source.

    workspace id

    The ID of the EMR Serverless Spark workspace.

    resource queue id

    The ID of the resource queue in the EMR Serverless Spark workspace. Default value: root_queue.

    code type

    The type of the job. Set the parameter to SQL.

    job name

    The name of the EMR Serverless Spark job. Example: ds-emr-spark-sql.

    entry point

    The file path. You must enter a valid file path.

    entry point arguments

    The code of the Spark job. You can use the number sign (#) as a delimiter to separate the parameters in the code. Examples:

    • Submit an SQL script.

      -e#show tables;show tables;
    • Submit an SQL script in OSS.

      -f#oss://<yourBucketName>/spark-resource/examples/sql/show_db.sql

    spark submit parameters

    The parameters required to submit Spark SQL jobs. Example:

    --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1

    is production

    Turn on the switch if the Spark job is a job in the production environment.

    engine release version

    The engine version. Default value: esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime).

    Parameters required to submit PySpark jobs

    Parameter

    Description

    Datasource types

    Select ALIYUN_SERVERLESS_SPARK.

    Datasource instances

    Select the created data source.

    workspace id

    The ID of the EMR Serverless Spark workspace.

    resource queue id

    The ID of the resource queue in the EMR Serverless Spark workspace. Default value: root_queue.

    code type

    The type of the job. Set the parameter to PYTHON.

    job name

    The name of the EMR Serverless Spark job. Example: ds-emr-spark-jar.

    entry point

    The file path. Example: oss://<yourBucketName>/spark-resource/examples/src/main/python/pi.py.

    entry point arguments

    The code of the Spark job. You can use the number sign (#) as a delimiter to separate the parameters in the code. Example: 1.

    spark submit parameters

    The parameters required to submit PySpark jobs. Example:

    --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1

    is production

    Turn on the switch if the Spark job is a job in the production environment.

    engine release version

    The engine version. Default value: esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime).

References

For more information about DolphinScheduler, see Apache DolphinScheduler.