DataWorks is an end-to-end big data development and governance platform that supports various compute engines, such as AnalyticDB. The DataStudio module of DataWorks supports visualized workflow development, managed job scheduling, and O&M and provides fully managed task scheduling based on time and dependencies. You can use AnalyticDB Spark SQL nodes and AnalyticDB Spark nodes to develop and schedule Spark SQL jobs and Spark JAR jobs in DataWorks.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
A DataWorks workspace is created in the same region as the AnalyticDB for MySQL cluster.
A resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create and manage a resource group.
When you develop Spark SQL jobs, an interactive resource group that uses the Spark engine is created for the AnalyticDB for MySQL cluster.
When you develop Spark JAR jobs, a job resource group is created for the AnalyticDB for MySQL cluster.
Participate in Public Preview of DataStudio of New Version is turned on for the DataWorks workspace. A serverless resource group is associated with the DataWorks workspace. For more information, see Create and use a serverless resource group.
NoteYou can turn on Participate in Public Preview of DataStudio of New Version when you create a DataWorks workspace. To turn on Participate in Public Preview of DataStudio of New Version for an existing workspace, submit a ticket.
An AnalyticDB for Spark computing resource is created for the DataWorks workspace. For more information, see Associate a computing resource with a workspace (Participate in Public Preview of Data Studio turned on).
The serverless resource group is associated with the virtual private cloud (VPC) of the AnalyticDB for MySQL cluster. For more information, see Network connectivity solutions.
The CIDR block of the vSwitch that is associated with the serverless resource group is added to an IP address whitelist of the AnalyticDB for MySQL cluster. For more information, see IP address whitelists.
Use DataWorks to schedule Spark SQL jobs
AnalyticDB for MySQL allows you to develop jobs on internal and external tables. This section provides an example on how to use DataWorks to develop and schedule Spark SQL jobs on an external table.
Step 1: Create an AnalyticDB Spark SQL node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
Click the
icon next to Workspace Directories and choose .
In the dialog box that appears, enter a node name and press Enter.
Step 2: Develop the AnalyticDB Spark SQL node
Create an external database on the AnalyticDB Spark SQL node. For information about how to create an internal table, see Use Spark SQL to create an internal table.
CREATE DATABASE IF NOT EXISTS `adb_spark_db` LOCATION 'oss://testBuckename/db_dome';
Create external tables named
adb_spark_db.tb_order
andadb_spark_db.tb_order_result
on the AnalyticDB Spark SQL node.CREATE TABLE IF NOT EXISTS adb_spark_db.tb_order(id int, name string, age int) USING parquet LOCATION 'oss://testBuckename/db_dome/tb1' TBLPROPERTIES ('parquet.compress'='SNAPPY'); CREATE TABLE IF NOT EXISTS adb_spark_db.tb_order_result(id int, name string, age int) USING parquet LOCATION 'oss://testBuckename/db_dome/tb2' TBLPROPERTIES ('parquet.compress'='SNAPPY');
Import the data of the
adb_spark_db.tb_order
table to theadb_spark_db.tb_order_result
table.INSERT INTO adb_spark_db.tb_order_result SELECT * FROM adb_spark_db.tb_order;
Step 3: Configure and run the AnalyticDB Spark SQL node
On the right side of the page, click Debugging Configurations and configure the AnalyticDB Spark SQL node parameters that are described in the following table.
Parameter type
Parameter name
Description
Computing Resource
Computing Resource
The AnalyticDB for Spark computing resource that is associated with the workspace.
AnalyticDB Computing Resource Group
The interactive resource group that you created in the AnalyticDB for MySQL cluster.
DataWorks Configurations
Resource Group
The serverless resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.
CUs for Computing
The number of compute units (CUs) that are used for computing. The current node uses the default number of CUs. You do not need to change the value.
Script Parameters
Parameter Name
The name of the parameter that you specified for the AnalyticDB Spark SQL node. For example, you can specify the
$[yyyymmdd]
parameter in the script to perform batch synchronization on daily updated data. For information about the supported parameters and formats, see Configure scheduling parameters.NoteThe system automatically displays the names of the parameters specified for the node.
Parameter Value
The value of the parameter. When a node is run, the system dynamically replaces the parameter value with the actual value.
(Optional) If you want to run the node on a regular basis, click Properties on the right side of the page. Click the Scheduling Policies tab and configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group for Scheduling parameters. Click the Scheduling Parameters tab and configure the relevant parameters.
After the debugging configurations are complete, click the
icon to save the SQL node. Then, click the
icon to run the SQL script and check whether the SQL script meets expectations.
If the SQL script runs as expected, release the SQL node to the production environment.
Go to the Operation Center page. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. Check and manage the released node. For more information, see Getting started with Operation Center.
Use DataWorks to schedule Spark JAR jobs
Step 1: Create an AnalyticDB Spark node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
Click the
icon next to Workspace Directories and choose .
In the dialog box that appears, enter a node name and press Enter.
Step 2: Develop the AnalyticDB Spark node
AnalyticDB Spark nodes can be developed in Java, Scala, or Python.
Development in Java or Scala
Prepare a sample JAR package.
You can download the spark-examples_2.12-3.2.0.jar package for developing and scheduling the AnalyticDB Spark node.
Upload the
spark-examples_2.12-3.2.0.jar
package to the OSS bucket that resides in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload objects.Configure the AnalyticDB Spark node.
Language
Parameter name
Description
Java or Scala
Main JAR Resource
The OSS URL of the JAR package. Example:
oss://testBucketname/db_dome/spark-examples_2.12-3.2.0.jar
.Main Class
The name of the main class that you want to run. Example:
com.work.SparkWork
.Parameters
The parameters that you want to specify in the code.
Configuration Items
The Spark configuration parameters. For more information, see Spark application configuration parameters.
Example:
spark.driver.resourceSpec: medium
Development in Python
Prepare test data.
Create a text file named
data.txt
for Spark reading. Add the following content to the file:Hello,Dataworks Hello,OSS
Write sample code.
Create a file named
spark_oss.py
and add the following content to the file:import sys from pyspark.sql import SparkSession # Initialize a Spark application. spark = SparkSession.builder.appName('OSS Example').getOrCreate() # Read the specified text file. The file path is specified by the args parameter. textFile = spark.sparkContext.textFile(sys.argv[1]) # Count and display the number of lines in the text file. print("File total lines: " + str(textFile.count())) # Display the first line of the text file. print("First line is: " + textFile.first())
Upload the
data.txt
andspark_oss.py
files to the OSS bucket that resides in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload objects.Configure the AnalyticDB Spark node.
Language
Parameter name
Description
Python
Main Package
The OSS URL of the
spark_oss.py
file. Example:oss://testBucketname/db_dome/spark_oss.py
.Parameters
The OSS URL of the
data.txt
file. Example:oss://testBucketname/db_dome/data.txt
.Configuration Items
The Spark configuration parameters. For more information, see Spark application configuration parameters.
Example:
spark.driver.resourceSpec: medium
Step 3: Configure and run the AnalyticDB Spark node
On the right side of the page, click Debugging Configurations and configure the AnalyticDB Spark SQL node parameters that are described in the following table.
Parameter type
Parameter name
Description
Computing Resource
Computing Resource
The AnalyticDB for Spark computing resource that is associated with the workspace.
AnalyticDB Computing Resource Group
The job resource group that you created in the AnalyticDB for MySQL cluster.
DataWorks Configurations
Resource Group
The serverless resource group that passed the connectivity test and is associated with the AnalyticDB for Spark computing resource.
CUs for Computing
The number of CUs that are used for computing. The current node uses the default number of CUs. You do not need to change the value.
Script Parameters
Parameter Name
The name of the parameter that you specified for the AnalyticDB Spark node.
NoteThe system automatically displays the names of the parameters specified for the node.
Parameter Value
The value of the parameter. When a node is run, the system dynamically replaces the parameter value with the actual value.
(Optional) If you want to run the node on a regular basis, click Properties on the right side of the page. Click the Scheduling Policies tab and configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group for Scheduling parameters. Click the Scheduling Parameters tab and configure the relevant parameters.
After the debugging configurations are complete, click the
icon to save the SQL node. Then, click the
icon to run the SQL script and check whether the SQL script meets expectations.
If the SQL script runs as expected, release the SQL node to the production environment.
Go to the Operation Center page. In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. Check and manage the released node. For more information, see Getting started with Operation Center.