DataWorks provides ADB Spark nodes that you can use to develop and periodically schedule AnalyticDB Spark tasks and integrate AnalyticDB Spark tasks with other types of tasks. This topic describes how to use an ADB Spark node to develop tasks.
Background information
AnalyticDB for MySQL Spark is a compute engine specially designed for running large-scale Apache Spark data processing tasks in AnalyticDB. It supports real-time data analysis, complex queries, and application of machine learning. AnalyticDB for MySQL Spark simplifies the development process in different languages, such as Java, Scala, or Python, and can be automatically scaled to optimize performance and reduce costs. Users can upload related JAR
packages or .py
files to configure tasks. This is suitable for various industries that require efficient processing of large amounts of data and real-time insight. This also helps enterprises obtain valuable information from data and promote business development.
Prerequisites
AnalyticDB for MySQL:
An AnalyticDB for MySQL Basic Edition cluster that resides in the same region as a desired DataWorks workspace is created. For more information, see Create a cluster.
A job resource group is configured in the AnalyticDB for MySQL cluster. For more information, see Create and manage a resource group.
NoteWhen you use DataWorks to develop Spark applications, you must create a job resource group.
If you want to use Object Storage Service (OSS) for storage in ADB Spark nodes, make sure that an OSS bucket that is created resides in the same region as the AnalyticDB for MySQL cluster.
DataWorks:
A workspace for which Participate in Public Preview of Data Studio is turned on is created, and a resource group is associated with the workspace. For more information, see Create a workspace.
The associated resource group is deployed in the same virtual private cloud (VPC) as the AnalyticDB for MySQL cluster, and a resource group IP address whitelist is configured in the AnalyticDB for MySQL cluster. For more information, see IP address whitelists.
The created AnalyticDB for MySQL cluster is added to DataWorks as a computing resource and passes the network connectivity test. For more information, see Associate a computing resource with a workspace (Participate in Public Preview of Data Studio turned on). The computing resource type is AnalyticDB for Spark.
A workspace directory is created.
An ADB Spark node is created.
Step 1: Develop the ADB Spark node
On the configuration tab of the ADB Spark node, you can configure the node by using the sample JAR package spark-examples_2.12-3.2.0.jar
or the file spark_oss.py
prepared by sample code based on the language type. For more information about node development, see Use spark-submit to develop Spark applications.
Configure the ADB Spark node in Java or Scala
Prepare a JAR package
You must upload a sample JAR package to OSS so that you can run the JAR package in node configuration.
Prepare a sample JAR package.
You can download the sample JAR package for developing the ADB Spark node.
Upload the sample code to OSS.
Log on to the OSS console. In the left-side navigation pane, click Buckets.
On the Buckets page, click Create Bucket. In the Create Bucket panel, set the Region parameter to the region in which the AnalyticDB for MySQL cluster resides. Then, configure other parameters to create a bucket.
NoteIn this example, a bucket named
dw-1127
is created.Create an external storage directory.
After the bucket is created, click Go to Bucket. On the Objects page, click Create Directory to create an external storage directory named
db_home
.Upload the sample JAR package
spark-examples_2.12-3.2.0.jar
to thedb_home
directory. For more information, see Upload objects.
Configure the ADB Spark node
You can configure the parameters that are described in the following table to configure the ADB Spark node.
Language | Parameter | Description |
Java or Scala | Main JAR Resource | The path in which the JAR package is stored in OSS. Example: |
Main Class | The main class of the task in the compiled JAR package. The name of the main class in sample code is | |
Parameters | The parameters that you want to configure in the code. You can configure parameters as dynamic parameters in the Note Example: | |
Configuration Items | The Spark configuration parameters. For more information, see Spark application configuration parameters. Example:
|
Configure the ADB Spark node in Python
Prepare a Python file
You must perform the following operations to prepare a test data file and upload the test data file and sample code to OSS. This way, you can run the sample code in node configuration to read the test data file.
Prepare test data.
Create a text file named
data.txt
and add the following content to the file:Hello,Dataworks Hello,OSS
Write sample code.
Create a file named
spark_oss.py
and add the following content to the file:import sys from pyspark.sql import SparkSession # Initialize a Spark application. spark = SparkSession.builder.appName('OSS Example').getOrCreate() # Read the specified text file. The file path is specified by the args parameter. textFile = spark.sparkContext.textFile(sys.argv[1]) # Count and display the number of lines in the text file. print("File total lines: " + str(textFile.count())) # Display the first line of the text file. print("First line is: " + textFile.first())
Upload the test data file and sample code to OSS.
Log on to the OSS console. In the left-side navigation pane, click Buckets.
On the Buckets page, click Create Bucket. In the Create Bucket panel, set the Region parameter to the region in which the AnalyticDB for MySQL cluster resides. Then, configure other parameters to create a bucket.
NoteIn this example, a bucket named
dw-1127
is created.Create an external storage directory.
After the bucket is created, click Go to Bucket. On the Objects page, click Create Directory to create an external storage directory named
db_home
.Upload the test data file
data.txt
and the sample code filespark_oss.py
to thedb_home
directory. For more information, see Upload objects.
Configure the ADB Spark node
You can configure the parameters that are described in the following table to configure the ADB Spark node.
Language | Parameter | Description |
Python | Main Package | The storage location of the sample code file. Example: |
Parameters | The parameters that you want to configure in the code. Example: | |
Configuration Items | The Spark configuration parameters. For more information, see Spark application configuration parameters. Example:
|
Step 2: Debug the ADB Spark node
Configure debugging properties for the ADB Spark node.
On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the ADB Spark node, configure the parameters that are described in the following table.
Section
Parameter
Description
Computing Resource
Computing Resource
Select the AnalyticDB for Spark computing resource that is associated with the workspace.
AnalyticDB Computing Resource Group
Select the job resource group that you created in the AnalyticDB for MySQL cluster. For more information, see Resource group overview.
DataWorks Configurations
Resource Group
Select the resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.
CUs For Computing
The number of compute units (CUs) that are used for computing. The current node uses the default number of CUs. You do not need to change the value.
Debug the ADB Spark node.
Save and run the node.
Step 3: Schedule the ADB Spark node
Configure scheduling properties for the ADB Spark node.
If you want the ADB Spark node to be run on a regular basis, configure the parameters that are described in the following table in the Scheduling Policies section of the Properties tab in the right-side navigation pane of the configuration tab of the ADB Spark node.
Parameter
Description
Computing Resource
Select the AnalyticDB for Spark computing resource that is associated with the workspace.
AnalyticDB Computing Resource Group
Select the job resource group that you created in the AnalyticDB for MySQL cluster. For more information, see Resource group overview.
Resource Group For Scheduling
Select the resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.
CUs For Computing
The number of CUs that are used for computing. The current node uses the default number of CUs. You do not need to change the value.
Deploy the ADB Spark node.
After the node is configured, deploy the node.
What to do next
After you deploy the node, view the status of the node in Operation Center. For more information, see Getting started with Operation Center.