All Products
Search
Document Center

DataWorks:ADB Spark node

Last Updated:Feb 08, 2025

DataWorks provides ADB Spark nodes that you can use to develop and periodically schedule AnalyticDB Spark tasks and integrate AnalyticDB Spark tasks with other types of tasks. This topic describes how to use an ADB Spark node to develop tasks.

Background information

AnalyticDB for MySQL Spark is a compute engine specially designed for running large-scale Apache Spark data processing tasks in AnalyticDB. It supports real-time data analysis, complex queries, and application of machine learning. AnalyticDB for MySQL Spark simplifies the development process in different languages, such as Java, Scala, or Python, and can be automatically scaled to optimize performance and reduce costs. Users can upload related JAR packages or .py files to configure tasks. This is suitable for various industries that require efficient processing of large amounts of data and real-time insight. This also helps enterprises obtain valuable information from data and promote business development.

Prerequisites

AnalyticDB for MySQL:

  • An AnalyticDB for MySQL Basic Edition cluster that resides in the same region as a desired DataWorks workspace is created. For more information, see Create a cluster.

  • A job resource group is configured in the AnalyticDB for MySQL cluster. For more information, see Create and manage a resource group.

    Note
    • When you use DataWorks to develop Spark applications, you must create a job resource group.

    • If you want to use Object Storage Service (OSS) for storage in ADB Spark nodes, make sure that an OSS bucket that is created resides in the same region as the AnalyticDB for MySQL cluster.

DataWorks:

Step 1: Develop the ADB Spark node

On the configuration tab of the ADB Spark node, you can configure the node by using the sample JAR package spark-examples_2.12-3.2.0.jar or the file spark_oss.py prepared by sample code based on the language type. For more information about node development, see Use spark-submit to develop Spark applications.

Configure the ADB Spark node in Java or Scala

Prepare a JAR package

You must upload a sample JAR package to OSS so that you can run the JAR package in node configuration.

  1. Prepare a sample JAR package.

    You can download the sample JAR package for developing the ADB Spark node.

  2. Upload the sample code to OSS.

    1. Log on to the OSS console. In the left-side navigation pane, click Buckets.

    2. On the Buckets page, click Create Bucket. In the Create Bucket panel, set the Region parameter to the region in which the AnalyticDB for MySQL cluster resides. Then, configure other parameters to create a bucket.

      Note

      In this example, a bucket named dw-1127 is created.

    3. Create an external storage directory.

      After the bucket is created, click Go to Bucket. On the Objects page, click Create Directory to create an external storage directory named db_home.

    4. Upload the sample JAR package spark-examples_2.12-3.2.0.jar to the db_home directory. For more information, see Upload objects.

Configure the ADB Spark node

You can configure the parameters that are described in the following table to configure the ADB Spark node.

Language

Parameter

Description

Java or Scala

Main JAR Resource

The path in which the JAR package is stored in OSS. Example: oss://dw-1127/db_home/spark-examples_2.12-3.2.0.jar.

Main Class

The main class of the task in the compiled JAR package. The name of the main class in sample code is org.apache.spark.examples.SparkPi.

Parameters

The parameters that you want to configure in the code. You can configure parameters as dynamic parameters in the ${var} format.

Note

Example: ${var} is set to 1000.

Configuration Items

The Spark configuration parameters. For more information, see Spark application configuration parameters. Example:

spark.driver.resourceSpec:medium

Configure the ADB Spark node in Python

Prepare a Python file

You must perform the following operations to prepare a test data file and upload the test data file and sample code to OSS. This way, you can run the sample code in node configuration to read the test data file.

  1. Prepare test data.

    Create a text file named data.txt and add the following content to the file:

    Hello,Dataworks
    Hello,OSS
  2. Write sample code.

    Create a file named spark_oss.py and add the following content to the file:

    import sys
    
    from pyspark.sql import SparkSession
    
    # Initialize a Spark application.
    spark = SparkSession.builder.appName('OSS Example').getOrCreate()
    # Read the specified text file. The file path is specified by the args parameter.
    textFile = spark.sparkContext.textFile(sys.argv[1])
    # Count and display the number of lines in the text file.
    print("File total lines: " + str(textFile.count()))
    # Display the first line of the text file.
    print("First line is: " + textFile.first())
    
  3. Upload the test data file and sample code to OSS.

    1. Log on to the OSS console. In the left-side navigation pane, click Buckets.

    2. On the Buckets page, click Create Bucket. In the Create Bucket panel, set the Region parameter to the region in which the AnalyticDB for MySQL cluster resides. Then, configure other parameters to create a bucket.

      Note

      In this example, a bucket named dw-1127 is created.

    3. Create an external storage directory.

      After the bucket is created, click Go to Bucket. On the Objects page, click Create Directory to create an external storage directory named db_home.

    4. Upload the test data file data.txt and the sample code file spark_oss.py to the db_home directory. For more information, see Upload objects.

Configure the ADB Spark node

You can configure the parameters that are described in the following table to configure the ADB Spark node.

Language

Parameter

Description

Python

Main Package

The storage location of the sample code file. Example: oss://dw-1127/db_home/spark_oss.py.

Parameters

The parameters that you want to configure in the code. Example: oss://dw-1127/db_home/data.txt, which indicates the storage location of the test data file.

Configuration Items

The Spark configuration parameters. For more information, see Spark application configuration parameters. Example:

spark.driver.resourceSpec:medium

Step 2: Debug the ADB Spark node

  1. Configure debugging properties for the ADB Spark node.

    On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the ADB Spark node, configure the parameters that are described in the following table.

    Section

    Parameter

    Description

    Computing Resource

    Computing Resource

    Select the AnalyticDB for Spark computing resource that is associated with the workspace.

    AnalyticDB Computing Resource Group

    Select the job resource group that you created in the AnalyticDB for MySQL cluster. For more information, see Resource group overview.

    DataWorks Configurations

    Resource Group

    Select the resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.

    CUs For Computing

    The number of compute units (CUs) that are used for computing. The current node uses the default number of CUs. You do not need to change the value.

  2. Debug the ADB Spark node.

    Save and run the node.

Step 3: Schedule the ADB Spark node

  1. Configure scheduling properties for the ADB Spark node.

    If you want the ADB Spark node to be run on a regular basis, configure the parameters that are described in the following table in the Scheduling Policies section of the Properties tab in the right-side navigation pane of the configuration tab of the ADB Spark node.

    Parameter

    Description

    Computing Resource

    Select the AnalyticDB for Spark computing resource that is associated with the workspace.

    AnalyticDB Computing Resource Group

    Select the job resource group that you created in the AnalyticDB for MySQL cluster. For more information, see Resource group overview.

    Resource Group For Scheduling

    Select the resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.

    CUs For Computing

    The number of CUs that are used for computing. The current node uses the default number of CUs. You do not need to change the value.

  2. Deploy the ADB Spark node.

    After the node is configured, deploy the node.

What to do next

After you deploy the node, view the status of the node in Operation Center. For more information, see Getting started with Operation Center.