Apache DolphinScheduler is a distributed, extensible open source workflow orchestration platform with powerful Directed Acyclic Graph (DAG) visual interfaces. This guide walks you through connecting DolphinScheduler to E-MapReduce (EMR) Serverless Spark and submitting Java Archive (JAR), SQL, and PySpark jobs from the DolphinScheduler web UI.
Background
The AliyunServerlessSpark Task Plugin has been merged into the main branch of Apache DolphinScheduler and will ship in a future official release. Until then, install it using one of the methods described in the prerequisites below.
Prerequisites
Before you begin, ensure that you have:
-
Java Development Kit (JDK) 1.8 or later installed
-
AliyunServerlessSpark Task Plugin installed using one of the following methods:
-
Method 1 (compile from source): Clone the main branch and compile. See apache/dolphinscheduler on GitHub.
-
Method 2 (cherry-pick): Integrate the plugin into your project using the cherry-pick pull request (PR). See [Feature-16127] Support emr serverless spark #16126 on GitHub.
-
Step 1: Create a data source
-
Open the DolphinScheduler web UI and click Datasource in the top navigation bar.
-
Click Create DataSource. In the Choose DataSource Type dialog box, select ALIYUN_SERVERLESS_SPARK.
-
In the CreateDataSource dialog box, configure the following parameters:
Parameter Description Datasource Name A name for the data source Access Key Id Your Alibaba Cloud AccessKey ID Access Key Secret Your Alibaba Cloud AccessKey secret Region Id The region where your EMR Serverless Spark workspace resides, for example, cn-beijing. For supported regions, see Supported regions. -
Click Test Connect. After the connectivity test passes, click Confirm.
Step 2: Create a project
-
Click Project in the top navigation bar.
-
Click Create Project.
-
In the Create Project dialog box, set Project Name, User, and any other required fields. For details, see Project.
Step 3: Create a workflow
-
Click the project name. In the left navigation pane, choose Workflow > Workflow Definition.
-
Click Create Workflow. The workflow DAG edit page opens.
-
In the left navigation pane, drag ALIYUN_SERVERLESS_SPARK onto the canvas.
-
In the Current node settings dialog box, configure the node parameters based on your job type, then click Confirm. The following sections list the parameters for each job type. Parameters shared across all three job types are listed first; job-specific parameters follow in each section.
Shared parameters
These parameters apply to JAR, SQL, and PySpark jobs.
| Parameter | Description |
|---|---|
| Datasource types | Select ALIYUN_SERVERLESS_SPARK |
| Datasource instances | Select the data source created in Step 1 |
| workspace id | The ID of your EMR Serverless Spark workspace |
| resource queue id | The ID of the resource queue in the EMR Serverless Spark workspace. Default: root_queue |
| is production | Enable this toggle if the job runs in a production environment |
| engine release version | The engine version. Default: esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime) |
Submit a JAR job
Set code type to JAR, then configure the following parameters:
| Parameter | Description | Example |
|---|---|---|
| code type | Job type | JAR |
| job name | Name of the EMR Serverless Spark job | ds-emr-spark-jar |
| entry point | Path to the JAR file in OSS | oss://<yourBucketName>/spark-resource/examples/jars/spark-examples_2.12-3.3.1.jar |
| entry point arguments | Arguments passed to the job. Use # as the delimiter between arguments. |
— |
| spark submit parameters | Spark configuration flags passed to spark-submit | See the example below |
Example spark submit parameters for a JAR job:
--class org.apache.spark.examples.SparkPi --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1
Submit an SQL job
Set code type to SQL, then configure the following parameters:
| Parameter | Description | Example |
|---|---|---|
| code type | Job type | SQL |
| job name | Name of the EMR Serverless Spark job | ds-emr-spark-sql |
| entry point | A valid file path | — |
| entry point arguments | The SQL script to run. Use # as the delimiter. |
See the examples below |
| spark submit parameters | Spark configuration flags passed to spark-submit | See the example below |
entry point arguments examples:
-
Submit an inline SQL script:
-e#show tables;show tables; -
Submit an SQL script stored in OSS:
-f#oss://<yourBucketName>/spark-resource/examples/sql/show_db.sql
Example spark submit parameters for an SQL job:
--class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1
Submit a PySpark job
Set code type to PYTHON, then configure the following parameters:
| Parameter | Description | Example |
|---|---|---|
| code type | Job type | PYTHON |
| job name | Name of the EMR Serverless Spark job | ds-emr-spark-jar |
| entry point | Path to the Python script in OSS | oss://<yourBucketName>/spark-resource/examples/src/main/python/pi.py |
| entry point arguments | Arguments passed to the script. Use # as the delimiter. |
1 |
| spark submit parameters | Spark configuration flags passed to spark-submit | See the example below |
Example spark submit parameters for a PySpark job:
--conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1
What's next
For more information about DolphinScheduler workflows, task types, and scheduling options, see Apache DolphinScheduler documentation.