All Products
Search
Document Center

AnalyticDB:Schedule Spark jobs in DataWorks

Last Updated:Nov 19, 2025

DataWorks provides end-to-end capabilities for big data development and administration and supports multiple compute engines, including AnalyticDB for MySQL. The DataWorks Data Development (DataStudio) module supports visual workflow development, managed scheduling, and operations and maintenance (O&M). This lets you easily manage and schedule tasks based on time and dependencies. You can use ADB Spark SQL and ADB Spark nodes in DataWorks to develop and schedule Spark jobs, such as Spark SQL jobs and Spark application jobs.

Prerequisites

  • Your AnalyticDB for MySQL cluster must meet the following requirements:

    • An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.

    • An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.

    • A database account is created for the AnalyticDB for MySQL cluster.

    • A resource group has been created for the AnalyticDB for MySQL cluster. For more information, see Create and manage resource groups.

      • To develop Spark SQL jobs, create an Interactive resource group in AnalyticDB for MySQL. The engine type for the Interactive resource group must be Spark.

      • To develop Spark JAR or PySpark jobs, create a Job resource group in AnalyticDB for MySQL.

    • AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.

    • The log storage path of Spark applications is configured.

      Note

      Log on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development. Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.

  • Your DataWorks workspace must meet the following requirements:

    • The AnalyticDB for MySQL cluster and the DataWorks workspace are in the same region.

    • A DataWorks workspace is created, and use the new Data Studio.

    • A resource group is created and attached. For more information, see Create and attach a resource group.

      Note

      When you create the resource group, make sure that its VPC is the same as the VPC of the AnalyticDB for MySQL cluster.

    • The IPv4 CIDR block of the vSwitch for the resource group to which the DataWorks workspace is attached has been added to the whitelist of the AnalyticDB for MySQL cluster. For more information, see Setting a Whitelist.

    • An AnalyticDB for Spark computing resource is attached to your DataWorks workspace. For more information, see Attach an AnalyticDB for Spark computing resource.

Schedule Spark SQL jobs in DataWorks

In AnalyticDB for MySQL, you can develop jobs for both external and internal tables. This topic uses an external table job as an example to describe the steps for developing and scheduling a Spark SQL job in DataWorks.

Step 1: Create an ADB Spark SQL node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. Click the image icon next to Project Folder, and select Create Node > ADB > ADB Spark SQL.

  3. In the dialog box that appears, enter a name for the node and press Enter.

Step 2: Develop the ADB Spark SQL node

  1. This topic provides an example of how to create an external database on an ADB Spark SQL node. For information about how to create an internal table, see Create an internal table using Spark SQL.

    CREATE DATABASE IF NOT EXISTS `adb_spark_db` LOCATION 'oss://testBucketname/db_dome';
  2. Create an external table adb_spark_db.tb_order in the ADB Spark SQL node.

    CREATE TABLE IF NOT EXISTS adb_spark_db.tb_order(id int, name string, age int) 
    USING parquet 
    LOCATION 'oss://testBucketname/db_dome/tb1' 
    TBLPROPERTIES ('parquet.compress'='SNAPPY');
  3. Query data.

    After the external table is created, you can use a SELECT statement in AnalyticDB for MySQL to query the Parquet data.

    SELECT * FROM adb_spark_db.tb_order limit 100;
  4. Create a Delta Lake table adb_spark_db.raw_order in the ADB Spark SQL node.

    CREATE TABLE IF NOT EXISTS adb_spark_db.raw_order(id int, name string, age int) 
    USING delta;
  5. Import data from adb_spark_db.tb_order to adb_spark_db.raw_order.

    INSERT INTO adb_spark_db.raw_order SELECT * FROM adb_spark_db.tb_order;
  6. Create a database in AnalyticDB for MySQL. If a database is already created, you can skip this step. The following is an example:

    CREATE DATABASE adb_demo; 
  7. Create an internal table in AnalyticDB for MySQL to store the data imported from the Delta Lake table. The following is an example:

    CREATE TABLE adb_demo.order_xuanwu_format (
        `id` int, 
        `name` string, 
        `age` int) 
    using adb 
        TBLPROPERTIES (
        'distributeType'='HASH',
        'distributeColumns' = 'id',
        'storagePolicy' = 'hot'
    );
  8. Import the data from the Delta Lake table adb_spark_db.raw_order to order_xuanwu_format.

    INSERT OVERWRITE adb_demo.order_xuanwu_format SELECT * FROM adb_spark_db.adb_spark_db.raw_order;

Step 3: Configure and run the ADB Spark SQL node

  1. In the right-side pane, click Debugging Configurations to configure runtime parameters for the ADB Spark SQL node.

    Parameter Type

    Parameter

    Description

    Computing Resource

    Computing Resource

    Select the attached AnalyticDB for Spark computing resource.

    AnalyticDB Computing Resource Group

    Select the Interactive resource group with the Spark engine that you created in the AnalyticDB for MySQL cluster.

    Resource Group

    Resource Group

    Select the DataWorks resource group that passed the connectivity test when you attached the AnalyticDB for Spark computing resource.

    CUs For Computing

    The node uses the default CU value. You do not need to modify the CU.

    Script Parameters

    Parameter Name

    The name of the parameter that you specified for the ADB Spark SQL node. For example, you can configure the $[yyyymmdd] parameter in the script to perform batch synchronization of daily new data. For the supported parameters and their formats, see Configure Scheduling Parameters.

    Note

    The system automatically detects the parameter names configured in the node.

    Parameter Value

    Configure the parameter value. At runtime, the parameter is dynamically replaced with its actual value.

  2. (Optional) To run node tasks on a schedule, in the Scheduling pane on the right side of the node, configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group parameters in the Scheduling Policies section. Then, configure the parameters in the Scheduling Parameters section.

  3. After you configure the debug settings, click the image icon to save the SQL node. Then, click the image icon to test the SQL script and verify that it runs as expected.

  4. After the scheduling configuration is complete, you can submit and publish the database nodes to the production environment.

  5. After you publish a task, it runs periodically based on the configured parameters. You can view and manage the published auto triggered tasks on the Operation Center > Node O&M > Auto Triggered Node O&M > Auto Triggered Nodes page. For more information, see Introduction to Operation Center.

Schedule Spark JAR jobs in DataWorks

Step 1: Create an ADB Spark node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. Click the image icon next to Project Folder, and select Creaet Node > ADB > ADB Spark.

  3. In the dialog box that appears, enter a name for the node and press Enter.

Step 2: Develop the ADB Spark node

ADB Spark nodes support development in Java/Scala and Python.

Java/Scala development

  1. Prepare a sample JAR package.

    You can download the sample JAR package spark-examples_2.12-3.2.0.jar to develop and schedule the ADB Spark node.

  2. Upload the sample code spark-examples_2.12-3.2.0.jar to an OSS bucket that is in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload files using the console.

  3. Configure the ADB Spark node.

    Language

    Parameter

    Description

    Java/Scala

    Main JAR Resource

    The storage path of the JAR package in OSS. Example: oss://testBucketname/db_dome/spark-examples_2.12-3.2.0.jar.

    Main Class

    The name of the main class to run. Example: com.work.SparkWork.

    Parameters

    Enter the parameters to pass to the code.

    Configuration Items

    Configure the runtime parameters for the Spark application. For more information, see Spark application configuration parameters.

    Example:

    spark.driver.resourceSpec: medium

Python development

  1. Prepare test data.

    Create a TXT file named data.txt that you want to read using Spark. Add the following content to the file.

    Hello,Dataworks
    Hello,OSS
  2. Write sample code.

    Create a file named spark_oss.py. Add the following content to the spark_oss.py file.

    import sys
    
    from pyspark.sql import SparkSession
    
    # Initialize Spark.
    spark = SparkSession.builder.appName('OSS Example').getOrCreate()
    # Read the specified file. The file path is specified by the value passed in by args.
    textFile = spark.sparkContext.textFile(sys.argv[1])
    # Calculate and print the number of lines in the file.
    print("File total lines: " + str(textFile.count()))
    # Print the first line of the file.
    print("First line is: " + textFile.first())
    
  3. Upload the test data data.txt and the sample code spark_oss.py to an OSS bucket that is in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload files using the console.

  4. Configure the ADB Spark node.

    Language

    Parameter

    Description

    Python

    Main Package

    The OSS path of spark_oss.py from Step 3 of this example. Example: oss://testBucketname/db_dome/spark_oss.py.

    Parameters

    The OSS path of data.txt from Step 3 of this example. Example: oss://testBucketname/db_dome/data.txt.

    Configuration Items

    Configure the runtime parameters for your Spark program. For more information, see Spark application configuration parameters.

    Example:

    spark.driver.resourceSpec: medium

Step 3: Configure and run the ADB Spark node

  1. In the right-side pane, click Debugging Configurations to configure runtime parameters for the ADB Spark node.

    Parameter Type

    Parameter

    Description

    Computing Resource

    Computing Resource

    Select the attached AnalyticDB for Spark computing resource.

    AnalyticDB Computing Resource Group

    Select the Job resource group that you created in the AnalyticDB for MySQL cluster.

    Resource Group

    Resource Group

    Select the DataWorks resource group that passed the connectivity test when you attached the AnalyticDB for Spark computing resource.

    CUs For Computing

    The node uses the default CU value. You do not need to modify the CU.

    Script Parameters

    Parameter Name

    The name of the parameter that you configured in the ADB Spark JAR job.

    Note

    The system automatically detects the parameter names configured in the node.

    Parameter Value

    Configure the parameter value. At runtime, the parameter is dynamically replaced with its actual value.

  2. (Optional) To run a node task on a schedule, in the Scheduling pane on the right side of the node, configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group parameters in the Scheduling Policies section. Then, configure the parameters in the Scheduling Parameters section.

  3. After you configure the debug settings, click the image icon to save the node. Then, click the image icon to test the script and verify that it runs as expected.

  4. After you complete the scheduling configuration, you can publish the database nodes to the production environment.

  5. A published auto triggered task runs periodically based on the configured parameters. You can view and manage the task on the Operation Center > Node O&M > Auto Triggered Node O&M > Auto Triggered Nodes page. For more information, see Introduction to Operation Center.