Use Azkaban to schedule Spark jobs - AnalyticDB - Alibaba Cloud Documentation Center

Azkaban is a batch workflow job scheduler that can be used to create, execute, and manage workflows that contain complex dependencies. You can schedule AnalyticDB for MySQL Spark jobs on the Azkaban web interface.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group or a Spark interactive resource group is created for the AnalyticDB for MySQL cluster.
Azkaban is installed.
Beeline is installed.
The IP address of the server that runs Azkaban is added to an IP address whitelist of the AnalyticDB for MySQL cluster.

Schedule Spark SQL jobs

AnalyticDB for MySQL allows you to execute Spark SQL in batch or interactive mode. The schedule procedure varies based on the execution mode.

Batch mode

Install the spark-submit command-line tool and specify the relevant parameters.
Note
You need to specify only the following parameters: keyId, secretId, regionId, clusterId, and rgName.

Write a workflow file and compress the workflow folder in the ZIP format.

nodes:
  - name: SparkPi
    type: command
    config:
      command: /<your path>/adb-spark-toolkit-submit/bin/spark-submit 
                --class com.aliyun.adb.spark.sql.OfflineSqlTemplate 
                local:///opt/spark/jars/offline-sql.jar 
                "show databases" 
                "select 100"
    dependsOn:
      - jobA
      - jobB

  - name: jobA
    type: command
    config:
      command: echo "This is an echoed text."

  - name: jobB
    type: command
    config:
      command: pwd

Important

Replace the <your path> parameter with the actual installation path of the spark-submit tool.
Do not use backslashes (\) in the command.

Create a project and upload the workflow file that is created in Step 2.
1. Access the Azkaban web interface. In the top navigation bar, click Projects.
2. In the upper-right corner of the page, click Create Project.
3. In the Create Project dialog box, configure the Name and Description parameters and click Create Project.
4. In the upper-right corner of the page, click Upload.
5. In the Upload Project Flies dialog box, select the workflow file and click Upload.
Run the workflow.
1. On the Projects page, click the Flows tab.
2. Click Execute Flow.
3. Click Execute.
4. In the Flow submitted message, click Continue.
View the details about the workflow.
1. In the top navigation bar, click Executing.
2. Click the Recently Finished tab.
3. Click the execution ID of the workflow. Click the Job List tab to view details about each job.
4. Click Logs to view the job logs.

Interactive mode

Obtain the connection URL of the Spark interactive resource group.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Enterprise Edition, Basic Edition, or Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
2. In the left-side navigation pane, choose Cluster Management > Resource Management. On the page that appears, click the Resource Groups tab.
3. Find the Spark interactive resource group that you created and click Details in the Actions column to view the internal or public connection URL of the resource group. You can click the icon within the parentheses next to the corresponding port number to copy the connection URL.
  You must click Apply for Endpoint next to Public Endpoint to manually apply for a public endpoint in the following scenarios:
  - The client tool that is used to submit a Spark SQL job is deployed on premises.
  - The client tool that is used to submit a Spark SQL job is deployed on an Elastic Compute Service (ECS) instance that resides in a different virtual private cloud (VPC) from your AnalyticDB for MySQL cluster.

Write a workflow file and compress the workflow folder in the ZIP format.

nodes:
  - name: jobB
    type: command
    config:
      command: <path> -u "jdbc:hive2://amv-t4n83e67n7b****sparkwho.ads.aliyuncs.com:10000/adb_demo" -n spark_interactive_prod/spark_user -p "spark_password" -e "show databases;show tables;"

    dependsOn:
      - jobA

  - name: jobA
    type: command
    config:
      command: <path> -u "jdbc:hive2://amv-t4n83e67n7b****sparkwho.ads.aliyuncs.com:10000/adb_demo" -n spark_interactive_prod/spark_user -p "spark_password" -e "show tables;"

The following table describes the parameters.

Parameter	Description
path	The path of the Beeline client. Example: `/path/to/spark/bin/beeline`.
-u	The endpoint obtained in Step 1. Replace `default` in the endpoint with the actual name of the database and delete the `resource_group=<resource group name>` suffix from the endpoint. Example: `jdbc:hive2://amv-t4naxpqk****sparkwho.ads.aliyuncs.com:10000/adb_demo`.
-n	The names of the database account and the resource group in the AnalyticDB for MySQL cluster. Format: `resource_group_name/database_account_name`. Example: `spark_interactive_prod/spark_user`.
-p	The password of the database account of the AnalyticDB for MySQL cluster.
-e	The SQL statement. Separate multiple SQL statements with semicolons (;).

Create a project and upload the workflow file that is created in Step 2.
1. Access the Azkaban web interface. In the top navigation bar, click Projects.
2. In the upper-right corner of the page, click Create Project.
3. In the Create Project dialog box, configure the Name and Description parameters and click Create Project.
4. In the upper-right corner of the page, click Upload.
5. In the Upload Project Flies dialog box, select the workflow file and click Upload.
Run the workflow.
1. On the Projects page, click the Flows tab.
2. Click Execute Flow.
3. Click Execute.
4. In the Flow submitted message, click Continue.
View the details about the workflow.
1. In the top navigation bar, click Executing.
2. Click the Recently Finished tab.
3. Click the execution ID of the workflow. Click the Job List tab to view details about each job.
4. Click Logs to view the job logs.

Schedule Spark JAR jobs

Install the spark-submit command-line tool and specify the relevant parameters.
Note
You need to specify only the following parameters: keyId, secretId, regionId, clusterId, and rgName. If your Spark JAR package is stored on your on-premises device, you must specify Object Storage Service (OSS) parameters such as ossUploadPath.

Write a workflow file and compress the workflow folder in the ZIP format.

nodes:
  - name: SparkPi
    type: command
    config:
      command: /<your path>/adb-spark-toolkit-submit/bin/spark-submit 
                --class org.apache.spark.examples.SparkPi 
                --name SparkPi 
                --conf spark.driver.resourceSpec=medium 
                --conf spark.executor.instances=2 
                --conf spark.executor.resourceSpec=medium 
                local:///tmp/spark-examples.jar 1000
    dependsOn:
      - jobA
      - jobB

  - name: jobA
    type: command
    config:
      command: echo "This is an echoed text."

  - name: jobB
    type: command
    config:
      command: pwd

Important

Replace the <your path> parameter with the actual installation path of the spark-submit tool.
Do not use backslashes (\) in the command.

Create a project and upload the workflow file that is created in Step 2.
1. Access the Azkaban web interface. In the top navigation bar, click Projects.
2. In the upper-right corner of the page, click Create Project.
3. In the Create Project dialog box, configure the Name and Description parameters and click Create Project.
4. In the upper-right corner of the page, click Upload.
5. In the Upload Project Flies dialog box, select the workflow file and click Upload.
Run the workflow.
1. On the Projects page, click the Flows tab.
2. Click Execute Flow.
3. Click Execute.
4. In the Flow submitted message, click Continue.
View the details about the workflow.
1. In the top navigation bar, click Executing.
2. Click the Recently Finished tab.
3. Click the execution ID of the workflow. Click the Job List tab to view details about each job.
4. Click Logs to view the job logs.