DataWorks provides ADB Spark nodes to develop, schedule, and integrate AnalyticDB Spark tasks with other task types. This topic describes how to use an ADB Spark node to develop tasks.
Background
ADB Spark is a compute engine in AnalyticDB for running large-scale Apache Spark data processing tasks. It supports real-time data analysis, complex queries, and machine learning applications. It simplifies development in languages such as Java, Scala, or Python and can automatically scale to optimize performance and reduce costs. You can upload JAR or .py files to configure tasks. This feature is ideal for industries that require efficient processing of large datasets and real-time insights. This helps enterprises extract valuable information from data to drive business growth.
Prerequisites
The following prerequisites for AnalyticDB for MySQL are met:
An AnalyticDB for MySQL Basic Edition cluster is created in the same region as your DataWorks workspace. For more information, see Create a cluster.
A job resource group is configured in the AnalyticDB for MySQL cluster. For more information, see Create a job resource group.
NoteWhen you use DataWorks to develop Spark applications, you must create a job resource group.
If you use OSS for storage in an ADB Spark node, ensure that the OSS bucket is in the same region as the AnalyticDB for MySQL cluster.
The following prerequisites for DataWorks are met:
A workspace is created, the Use Data Studio (New Version) option is selected, and a resource group is attached to the workspace. For more information, see Create a workspace.
The resource group is attached to the same VPC as the AnalyticDB for MySQL cluster. An IP address whitelist is configured for the resource group in the AnalyticDB for MySQL cluster. For more information, see Configure a whitelist.
The AnalyticDB for MySQL cluster instance is added to DataWorks as a compute engine of the AnalyticDB for Spark type. The connectivity between the resource group and the compute engine is tested. For more information, see Attach a compute engine.
A workflow folder is created. For more information, see Create a workflow folder.
An ADB Spark node is created. For more information, see Create a node for a workflow.
Step 1: Develop the ADB Spark node
On an ADB Spark node, you can configure the node content based on the selected language. You can use the sample JAR package spark-examples_2.12-3.2.0.jar or the sample spark_oss.py file. For more information about developing node content, see Develop a Spark application using the spark-submit command-line tool.
Configure the ADB Spark node content (Java/Scala)
Prepare the file to run (JAR)
Upload the sample JAR package to OSS so that you can run the JAR package file in the node configuration.
Prepare a sample JAR package.
Download the spark-examples_2.12-3.2.0.jar sample JAR package to use for the ADB Spark node.
Upload the sample code to OSS.
Log on to the OSS console. In the navigation pane on the left, click Buckets.
On the Buckets page, click Create Bucket. In the Create Bucket panel, create a bucket in the same region as the AnalyticDB for MySQL cluster.
NoteThis topic uses a bucket named
dw-1127as an example.Create an external storage folder.
After you create the bucket, click Go to Bucket. On the Objects page, click Create Directory to create an external storage folder for your database. Set the Directory Name to
db_home.Upload the sample code file
spark-examples_2.12-3.2.0.jarto thedb_homefolder. For more information, see Upload a file using the console.
Configure the ADB Spark node
Configure the ADB Spark node content using the following parameters.
Language | Parameter | Description |
Java/Scala | Main JAR Resource | The storage path of the JAR package resource in OSS. Example: |
Main Class | The main class of the task in the compiled JAR package. The name of the main class in the sample code is | |
Parameters | Enter the parameter information that you want to pass to the code. You can configure this parameter as a dynamic parameter in the Note The dynamic parameter | |
Configuration Items | You can configure the runtime parameters for the Spark program here. For more information, see Spark application configuration parameters. Example: |
Configure the ADB Spark node content (Python)
Prepare the file to run (Python)
Upload the test data file and sample code to OSS. This allows the sample code in the node configuration to read the test data file.
Prepare test data.
Create a
data.txtfile and add the following content to the file.Hello,Dataworks Hello,OSSWrite sample code.
Create a
spark_oss.pyfile and add the following content to thespark_oss.pyfile.import sys from pyspark.sql import SparkSession # Initialize Spark. spark = SparkSession.builder.appName('OSS Example').getOrCreate() # Read the specified file. The file path is specified by the value passed in by args. textFile = spark.sparkContext.textFile(sys.argv[1]) # Calculate and print the number of lines in the file. print("File total lines: " + str(textFile.count())) # Print the first line of the file. print("First line is: " + textFile.first())Upload the test data and sample code to OSS.
Log on to the OSS console. In the navigation pane on the left, click Buckets.
On the Buckets page, click Create Bucket. In the Create Bucket panel, create a bucket in the same region as the AnalyticDB for MySQL cluster.
NoteThis topic uses a bucket named
dw-1127as an example.Create an external storage folder.
After the bucket is created, click Go to Bucket. On the Objects page, click Create Directory to create an external storage folder for the database. Set Directory Name to
db_home.Upload the test data file
data.txtand the sample code filespark_oss.pyto thedb_homefolder. For more information, see Upload a file using the console.
Configure the ADB Spark node
Configure the ADB Spark node content using the following parameters.
Language | Parameter | Description |
Python | Main Package | Enter the storage location of the sample code file that you want to run. Example: |
Parameters | Enter the parameter information that you want to pass. The example information is the storage location of the test data file to be read. Example: | |
Configuration Items | You can configure the runtime parameters for the Spark program here. For more information, see Spark application configuration parameters. Example: |
Step 2: Debug the ADB Spark node
Configure the debug properties for the ADB Spark node.
In the Run Configuration section on the right of the node, configure the Computing Resource, AnalyticDB Computing Resource Group, Resource Group, and CUs for Computing as follows.
Parameter type
Parameter
Description
Computing Resource
Computing Resource
Select the AnalyticDB for Spark compute engine that you attached.
AnalyticDB Computing Resource Group
Select the job resource group that you created in the AnalyticDB for MySQL cluster. For more information, see Resource group overview.
Resource Group
Resource Group
Select the resource group that passed the connectivity test when you attached the AnalyticDB for Spark compute engine.
CUs for Computing
The current node uses the default CU value. You do not need to modify the CU value.
Debug and run the ADB Spark node.
To run the node task, click Save and then Run.
Step 3: Schedule the ADB Spark node
Configure the scheduling properties for the ADB Spark node.
To run a node task periodically, configure the following parameters in the Scheduling Policies section on the Scheduling tab, which is located on the right side of the node. For more information about parameter configuration, see Configure scheduling for a node.
Parameter
Description
Compute Resource
Select the AnalyticDB for Spark compute engine that you attached.
AnalyticDB Computing Resource Group
Select the job resource group that you created in the AnalyticDB for MySQL cluster. For more information, see Resource group overview.
Resource Group
Select the resource group that passed the connectivity test when you attached the AnalyticDB for Spark compute engine.
CUs for Computing
The current node uses the default CU value. You do not need to modify the CU value.
Publish the ADB Spark node.
After you configure the node task, you must publish the node. For more information, see Publish a node or workflow.
Next steps
After the task is published, you can view the running status of the auto-triggered task in the Operation Center. For more information, see Get started with Operation Center.