DolphinScheduler is a distributed and extensible open source workflow orchestration platform with powerful Directed Acyclic Graph (DAG) visual interfaces. DolphinScheduler can help you efficiently execute and manage workflows for large amounts of data. This topic describes how to create, edit, and schedule Spark jobs on the web UI of DolphinScheduler with ease.
Background information
The code of DolphinScheduler AliyunServerlessSpark Task Plugin has been successfully merged into the main branch of Apache DolphinScheduler and will be released in the subsequent official versions. Before a new version is released, you can compile the main branch code of AliyunServerlessSpark Task Plugin or integrate AliyunServerlessSpark Task Plugin into your project by using the related pull request (PR) of the cherry-pick command.
Prerequisites
Java Development Kit (JDK) 1.8 or later is installed.
AliyunServerlessSpark Task Plugin is installed by using one of the following methods:
Method 1: Compile the main branch code of AliyunServerlessSpark Task Plugin. For more information, see dolphinscheduler.
Method 2: Integrate AliyunServerlessSpark Task Plugin into your project by using the related PR of the cherry-pick command. For more information, see [Feature-16127] Support emr serverless spark #16126.
Procedure
Step 1: Create a data source
Access the web UI of DolphinScheduler. In the top navigation bar, click Datasource.
Click Create DataSource. In the Choose DataSource Type dialog box, select ALIYUN_SERVERLESS_SPARK.
In the CreateDataSource dialog box, configure the parameters. The following table describes the parameters.
Parameter
Description
Datasource Name
The name of the data source.
Access Key Id
The AccessKey ID of your Alibaba Cloud account.
Access Key Secret
The AccessKey secret of your Alibaba Cloud account.
Region Id
The ID of the region where the E-MapReduce (EMR) Serverless Spark workspace resides. Example: cn-beijing.
For information about supported regions, see Supported regions.
Click Test Connect. After the data source passes the connectivity test, click Confirm.
Step 2: Create a project
In the top navigation bar, click Project.
Click Create Project.
In the Create Project dialog box, configure the parameters, such as Project Name and User. For more information, see Project.
Step 3: Create a workflow
Click the name of the created project. In the left-side navigation pane, choose Workflow > Workflow Definition to go to the Workflow Definition page.
Click Create Workflow. The workflow DAG edit page appears.
In the left-side navigation pane, select ALIYUN_SERVERLESS_SPARK and drag it to the right-side canvas.
In the Current node settings dialog box, configure the parameters and click Confirm.
The parameter configurations vary based on the type of the job that you want to submit.
Parameters required to submit Java Archive (JAR) jobs
Parameter
Description
Datasource types
Select ALIYUN_SERVERLESS_SPARK.
Datasource instances
Select the created data source.
workspace id
The ID of the EMR Serverless Spark workspace.
resource queue id
The ID of the resource queue in the EMR Serverless Spark workspace. Default value:
root_queue
.code type
The type of the job. Set the parameter to
JAR
.job name
The name of the EMR Serverless Spark job. Example: ds-emr-spark-jar.
entry point
The file path. Example: oss://<yourBucketName>/spark-resource/examples/jars/spark-examples_2.12-3.3.1.jar.
entry point arguments
The code of the Spark job. You can use the
number sign
(#) as a delimiter to separate the parameters in the code.spark submit parameters
The parameters required to submit Spark JAR jobs. Example:
--class org.apache.spark.examples.SparkPi --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1
is production
Turn on the switch if the Spark job is a job in the production environment.
engine release version
The engine version. Default value:
esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime)
.Parameters required to submit SQL jobs
Parameter
Description
Datasource types
Select ALIYUN_SERVERLESS_SPARK.
Datasource instances
Select the created data source.
workspace id
The ID of the EMR Serverless Spark workspace.
resource queue id
The ID of the resource queue in the EMR Serverless Spark workspace. Default value:
root_queue
.code type
The type of the job. Set the parameter to
SQL
.job name
The name of the EMR Serverless Spark job. Example: ds-emr-spark-sql.
entry point
The file path. You must enter a valid file path.
entry point arguments
The code of the Spark job. You can use the
number sign
(#) as a delimiter to separate the parameters in the code. Examples:Submit an SQL script.
-e#show tables;show tables;
Submit an SQL script in OSS.
-f#oss://<yourBucketName>/spark-resource/examples/sql/show_db.sql
spark submit parameters
The parameters required to submit Spark SQL jobs. Example:
--class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1
is production
Turn on the switch if the Spark job is a job in the production environment.
engine release version
The engine version. Default value:
esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime)
.Parameters required to submit PySpark jobs
Parameter
Description
Datasource types
Select ALIYUN_SERVERLESS_SPARK.
Datasource instances
Select the created data source.
workspace id
The ID of the EMR Serverless Spark workspace.
resource queue id
The ID of the resource queue in the EMR Serverless Spark workspace. Default value:
root_queue
.code type
The type of the job. Set the parameter to
PYTHON
.job name
The name of the EMR Serverless Spark job. Example: ds-emr-spark-jar.
entry point
The file path. Example:
oss://<yourBucketName>/spark-resource/examples/src/main/python/pi.py
.entry point arguments
The code of the Spark job. You can use the
number sign
(#) as a delimiter to separate the parameters in the code. Example:1
.spark submit parameters
The parameters required to submit PySpark jobs. Example:
--conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=1
is production
Turn on the switch if the Spark job is a job in the production environment.
engine release version
The engine version. Default value:
esr-2.1-native (Spark 3.3.1, Scala 2.12, Native Runtime)
.
References
For more information about DolphinScheduler, see Apache DolphinScheduler.