DolphinScheduler is a distributed, extensible open source workflow orchestration platform with a visual Directed Acyclic Graph (DAG) editor. Use it to create, schedule, and monitor Spark jobs for AnalyticDB for MySQL clusters.
How it works
DolphinScheduler connects to AnalyticDB for MySQL Spark in two ways, depending on execution mode:
Batch mode and JAR jobs: DolphinScheduler uses a SHELL task to invoke the
spark-submitcommand-line tool, which submits the job to an AnalyticDB for MySQL job resource group.Interactive mode: DolphinScheduler uses a SQL task to connect to a Spark interactive resource group over JDBC (port 10000), and sends SQL statements directly.
Prerequisites
Before you begin, make sure you have:
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster
A job resource group or a Spark interactive resource group created for the cluster
Java Development Kit (JDK) V1.8 or later installed
DolphinScheduler installed (version 3.2.1 used in this guide)
The IP address of the DolphinScheduler server added to the IP address whitelist of the cluster
Schedule Spark SQL jobs
AnalyticDB for MySQL supports Spark SQL in batch or interactive mode. The steps differ by mode.
Batch mode
In batch mode, DolphinScheduler runs a SHELL task that calls the spark-submit tool to submit Spark SQL to a job resource group.
Steps in this section:
Install and configure spark-submit
Create a project
Create a workflow with a SHELL task
Run the workflow
View execution results
Step 1: Install and configure spark-submit
Install the spark-submit command-line tool and configure the required parameters.
For Spark SQL batch jobs, configure only these parameters:keyId,secretId,regionId,clusterId, andrgName.
Step 2: Create a project
Open the DolphinScheduler web interface. In the top navigation bar, click Project.
Click Create Project.
In the Create Project dialog box, enter a Project Name and configure Owned Users.
Step 3: Create a workflow
Click the project name. In the left-side navigation pane, choose Workflow > Workflow Definition.
Click Create Workflow to open the workflow DAG edit page.
In the left-side list, select SHELL and drag it onto the canvas.
In the Current node settings dialog box, configure the following parameters.
ImportantAlways specify the full installation path of spark-submit in the script. If the path is omitted, the scheduling task cannot find the spark-submit command.
For other SHELL task parameters, see DolphinScheduler Task Parameters Appendix.
Parameter Description Node name A name for the workflow node. Script The full installation path of spark-submit, followed by the job arguments. Example: /root/adb-spark-toolkit-submit/bin/spark-submit --class com.aliyun.adb.spark.sql.OfflineSqlTemplate local:///opt/spark/jars/offline-sql.jar "show databases" "select 100"Click Confirm.
Click Save in the upper-right corner. In the Basic Information dialog box, enter a Workflow Name and click Confirm.
Step 4: Run the workflow
Find the workflow in the list and click the
icon in the Operation column to publish it.Click the
icon in the Operation column.In the Please set the parameters before starting dialog box, configure the parameters.
Click Confirm to start the workflow.
Step 5: View execution results
In the left-side navigation pane, choose Task > Task Instance.
Find the task and click the
icon in the Operation column to view the execution results and logs.
Interactive mode
In interactive mode, DolphinScheduler uses a SQL task that connects to a Spark interactive resource group over JDBC. This approach lets you send SQL statements without managing the spark-submit command.
Steps in this section:
Get the connection URL of the Spark interactive resource group
Create a data source in DolphinScheduler
Create a project
Create a workflow with a SQL task
Run the workflow
View execution results
Step 1: Get the connection URL
Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters.
On the Enterprise Edition, Basic Edition, or Data Lakehouse Edition tab, find your cluster and click the cluster ID.
In the left-side navigation pane, choose Cluster Management > Resource Management. Click the Resource Groups tab.
Find the Spark interactive resource group and click Details in the Actions column. Copy the internal or public connection URL. Click the
icon next to the port number to copy the URL.
Apply for a public endpoint if either of the following conditions is true:
The client tool is deployed on premises.
The client tool runs on an Elastic Compute Service (ECS) instance in a different virtual private cloud (VPC) from the cluster.
To apply, click Apply for Endpoint next to Public Endpoint.
Step 2: Create a data source
In the DolphinScheduler top navigation bar, click Datasource.
Click Create DataSource.
In the Create DataSource dialog box, configure the following parameters.
For other optional parameters, see MySQL.
Parameter Description DataSource The data source type. Select SPARK. Datasource name A name for the data source. IP The JDBC endpoint from Step 1, modified as follows: replace defaultwith the actual database name, and remove theresource_group=<resource group name>suffix. Example:jdbc:hive2://amv-t4naxpqk****sparkwho.ads.aliyuncs.com:10000/adb_demoPort The port number for Spark interactive resource groups. Enter 10000.User name The database account name for the AnalyticDB for MySQL cluster. Database name The name of the database in the cluster. Click Test Connect. After the test succeeds, click Confirm.
Step 3: Create a project
In the top navigation bar, click Project.
Click Create Project.
In the Create Project dialog box, enter a Project Name and configure Owned Users.
Step 4: Create a workflow
Click the project name. In the left-side navigation pane, choose Workflow > Workflow Definition.
Click Create Workflow to open the workflow DAG edit page.
In the left-side list, select SQL and drag it onto the canvas.
In the Current node settings dialog box, configure the following parameters.
Parameter Description Datasource types The data source type. Select SPARK. Datasource instances The data source created in Step 2. SQL type The type of SQL job. Valid values: Query and Non Query. SQL statement The SQL statement to run. Click Confirm.
Click Save in the upper-right corner. In the Basic Information dialog box, enter a Workflow Name and click Confirm.
Step 5: Run the workflow
Find the workflow and click the
icon in the Operation column to publish it.Click the
icon in the Operation column.In the Please set the parameters before starting dialog box, configure the parameters.
Click Confirm to start the workflow.
Step 6: View execution results
In the left-side navigation pane, choose Task > Task Instance.
Find the task and click the
icon in the Operation column to view the execution results and logs.
Schedule Spark JAR jobs
Spark JAR jobs follow the same SHELL task pattern as Spark SQL batch mode, with spark-submit invoking your JAR file directly.
Steps in this section:
Install and configure spark-submit
Create a project
Create a workflow with a SHELL task
Run the workflow
View execution results
Step 1: Install and configure spark-submit
Install the spark-submit command-line tool and configure the required parameters.
Configure at minimum:keyId,secretId,regionId,clusterId, andrgName. If the JAR package is stored on your local device rather than in Object Storage Service (OSS), also specify OSS parameters such asossUploadPath.
Step 2: Create a project
Open the DolphinScheduler web interface. In the top navigation bar, click Project.
Click Create Project.
In the Create Project dialog box, enter a Project Name and configure Owned Users.
Step 3: Create a workflow
Click the project name. In the left-side navigation pane, choose Workflow > Workflow Definition.
Click Create Workflow to open the workflow DAG edit page.
In the left-side list, select SHELL and drag it onto the canvas.
In the Current node settings dialog box, configure the following parameters.
ImportantAlways specify the full installation path of spark-submit in the script. If the path is omitted, the scheduling task cannot find the spark-submit command.
For other SHELL task parameters, see DolphinScheduler Task Parameters Appendix.
Parameter Description Node name A name for the workflow node. Script The full installation path of spark-submit, followed by the JAR job arguments. Example: /root/adb-spark-toolkit-submit/bin/spark-submit --class org.apache.spark.examples.SparkPi --name SparkPi --conf spark.driver.resourceSpec=medium --conf spark.executor.instances=2 --conf spark.executor.resourceSpec=medium local:///tmp/spark-examples.jar 1000Click Confirm.
Click Save in the upper-right corner. In the Basic Information dialog box, enter a Workflow Name and click Confirm.
Step 4: Run the workflow
Find the workflow and click the
icon in the Operation column to publish it.Click the
icon in the Operation column.In the Please set the parameters before starting dialog box, configure the parameters.
Click Confirm to start the workflow.
Step 5: View execution results
In the left-side navigation pane, choose Task > Task Instance.
Find the task and click the
icon in the Operation column to view the execution results and logs.