All Products
Search
Document Center

AnalyticDB:Use DataWorks to schedule Spark jobs

Last Updated:Jan 20, 2025

DataWorks is an end-to-end big data development and governance platform that supports various compute engines, such as AnalyticDB. The DataStudio module of DataWorks supports visualized workflow development, managed job scheduling, and O&M and provides fully managed task scheduling based on time and dependencies. You can use AnalyticDB Spark SQL nodes and AnalyticDB Spark nodes to develop and schedule Spark SQL jobs and Spark JAR jobs in DataWorks.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.

  • A DataWorks workspace is created in the same region as the AnalyticDB for MySQL cluster.

  • A resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create and manage a resource group.

    • When you develop Spark SQL jobs, an interactive resource group that uses the Spark engine is created for the AnalyticDB for MySQL cluster.

    • When you develop Spark JAR jobs, a job resource group is created for the AnalyticDB for MySQL cluster.

  • Participate in Public Preview of DataStudio of New Version is turned on for the DataWorks workspace. A serverless resource group is associated with the DataWorks workspace. For more information, see Create and use a serverless resource group.

    Note

    You can turn on Participate in Public Preview of DataStudio of New Version when you create a DataWorks workspace. To turn on Participate in Public Preview of DataStudio of New Version for an existing workspace, submit a ticket.

  • An AnalyticDB for Spark computing resource is created for the DataWorks workspace. For more information, see Associate a computing resource with a workspace (Participate in Public Preview of Data Studio turned on).

  • The serverless resource group is associated with the virtual private cloud (VPC) of the AnalyticDB for MySQL cluster. For more information, see Network connectivity solutions.

  • The CIDR block of the vSwitch that is associated with the serverless resource group is added to an IP address whitelist of the AnalyticDB for MySQL cluster. For more information, see IP address whitelists.

Use DataWorks to schedule Spark SQL jobs

AnalyticDB for MySQL allows you to develop jobs on internal and external tables. This section provides an example on how to use DataWorks to develop and schedule Spark SQL jobs on an external table.

Step 1: Create an AnalyticDB Spark SQL node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. Click the image icon next to Workspace Directories and choose Create Node > ADB > ADB Spark SQL.

  3. In the dialog box that appears, enter a node name and press Enter.

Step 2: Develop the AnalyticDB Spark SQL node

  1. Create an external database on the AnalyticDB Spark SQL node. For information about how to create an internal table, see Use Spark SQL to create an internal table.

    CREATE DATABASE IF NOT EXISTS `adb_spark_db` LOCATION 'oss://testBuckename/db_dome';
  2. Create external tables named adb_spark_db.tb_order and adb_spark_db.tb_order_result on the AnalyticDB Spark SQL node.

    CREATE TABLE IF NOT EXISTS adb_spark_db.tb_order(id int, name string, age int) 
    USING parquet 
    LOCATION 'oss://testBuckename/db_dome/tb1' 
    TBLPROPERTIES ('parquet.compress'='SNAPPY');
    
    CREATE TABLE IF NOT EXISTS adb_spark_db.tb_order_result(id int, name string, age int) 
    USING parquet 
    LOCATION 'oss://testBuckename/db_dome/tb2' 
    TBLPROPERTIES ('parquet.compress'='SNAPPY');
  3. Import the data of the adb_spark_db.tb_order table to the adb_spark_db.tb_order_result table.

    INSERT INTO adb_spark_db.tb_order_result SELECT * FROM adb_spark_db.tb_order;

Step 3: Configure and run the AnalyticDB Spark SQL node

  1. On the right side of the page, click Debugging Configurations and configure the AnalyticDB Spark SQL node parameters that are described in the following table.

    Parameter type

    Parameter name

    Description

    Computing Resource

    Computing Resource

    The AnalyticDB for Spark computing resource that is associated with the workspace.

    AnalyticDB Computing Resource Group

    The interactive resource group that you created in the AnalyticDB for MySQL cluster.

    DataWorks Configurations

    Resource Group

    The serverless resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.

    CUs for Computing

    The number of compute units (CUs) that are used for computing. The current node uses the default number of CUs. You do not need to change the value.

    Script Parameters

    Parameter Name

    The name of the parameter that you specified for the AnalyticDB Spark SQL node. For example, you can specify the $[yyyymmdd] parameter in the script to perform batch synchronization on daily updated data. For information about the supported parameters and formats, see Configure scheduling parameters.

    Note

    The system automatically displays the names of the parameters specified for the node.

    Parameter Value

    The value of the parameter. When a node is run, the system dynamically replaces the parameter value with the actual value.

  2. (Optional) If you want to run the node on a regular basis, click Properties on the right side of the page. Click the Scheduling Policies tab and configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group for Scheduling parameters. Click the Scheduling Parameters tab and configure the relevant parameters.

  3. After the debugging configurations are complete, click the image icon to save the SQL node. Then, click the image icon to run the SQL script and check whether the SQL script meets expectations.

  4. If the SQL script runs as expected, release the SQL node to the production environment.

  5. Go to the Operation Center page. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. Check and manage the released node. For more information, see Getting started with Operation Center.

Use DataWorks to schedule Spark JAR jobs

Step 1: Create an AnalyticDB Spark node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. Click the image icon next to Workspace Directories and choose Create Node > ADB > ADB Spark.

  3. In the dialog box that appears, enter a node name and press Enter.

Step 2: Develop the AnalyticDB Spark node

AnalyticDB Spark nodes can be developed in Java, Scala, or Python.

Development in Java or Scala

  1. Prepare a sample JAR package.

    You can download the spark-examples_2.12-3.2.0.jar package for developing and scheduling the AnalyticDB Spark node.

  2. Upload the spark-examples_2.12-3.2.0.jar package to the OSS bucket that resides in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload objects.

  3. Configure the AnalyticDB Spark node.

    Language

    Parameter name

    Description

    Java or Scala

    Main JAR Resource

    The OSS URL of the JAR package. Example: oss://testBucketname/db_dome/spark-examples_2.12-3.2.0.jar.

    Main Class

    The name of the main class that you want to run. Example: com.work.SparkWork.

    Parameters

    The parameters that you want to specify in the code.

    Configuration Items

    The Spark configuration parameters. For more information, see Spark application configuration parameters.

    Example:

    spark.driver.resourceSpec: medium

Development in Python

  1. Prepare test data.

    Create a text file named data.txt for Spark reading. Add the following content to the file:

    Hello,Dataworks
    Hello,OSS
  2. Write sample code.

    Create a file named spark_oss.py and add the following content to the file:

    import sys
    
    from pyspark.sql import SparkSession
    
    # Initialize a Spark application.
    spark = SparkSession.builder.appName('OSS Example').getOrCreate()
    # Read the specified text file. The file path is specified by the args parameter.
    textFile = spark.sparkContext.textFile(sys.argv[1])
    # Count and display the number of lines in the text file.
    print("File total lines: " + str(textFile.count()))
    # Display the first line of the text file.
    print("First line is: " + textFile.first())
    
  3. Upload the data.txt and spark_oss.py files to the OSS bucket that resides in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload objects.

  4. Configure the AnalyticDB Spark node.

    Language

    Parameter name

    Description

    Python

    Main Package

    The OSS URL of the spark_oss.py file. Example: oss://testBucketname/db_dome/spark_oss.py.

    Parameters

    The OSS URL of the data.txt file. Example: oss://testBucketname/db_dome/data.txt.

    Configuration Items

    The Spark configuration parameters. For more information, see Spark application configuration parameters.

    Example:

    spark.driver.resourceSpec: medium

Step 3: Configure and run the AnalyticDB Spark node

  1. On the right side of the page, click Debugging Configurations and configure the AnalyticDB Spark SQL node parameters that are described in the following table.

    Parameter type

    Parameter name

    Description

    Computing Resource

    Computing Resource

    The AnalyticDB for Spark computing resource that is associated with the workspace.

    AnalyticDB Computing Resource Group

    The job resource group that you created in the AnalyticDB for MySQL cluster.

    DataWorks Configurations

    Resource Group

    The serverless resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.

    CUs for Computing

    The number of CUs that are used for computing. The current node uses the default number of CUs. You do not need to change the value.

    Script Parameters

    Parameter Name

    The name of the parameter that you specified for the AnalyticDB Spark node.

    Note

    The system automatically displays the names of the parameters specified for the node.

    Parameter Value

    The value of the parameter. When a node is run, the system dynamically replaces the parameter value with the actual value.

  2. (Optional) If you want to run the node on a regular basis, click Properties on the right side of the page. Click the Scheduling Policies tab and configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group for Scheduling parameters. Click the Scheduling Parameters tab and configure the relevant parameters.

  3. After the debugging configurations are complete, click the image icon to save the SQL node. Then, click the image icon to run the SQL script and check whether the SQL script meets expectations.

  4. If the SQL script runs as expected, release the SQL node to the production environment.

  5. Go to the Operation Center page. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. Check and manage the released node. For more information, see Getting started with Operation Center.