DataWorks provides end-to-end capabilities for big data development and administration and supports multiple compute engines, including AnalyticDB for MySQL. The DataWorks Data Development (DataStudio) module supports visual workflow development, managed scheduling, and operations and maintenance (O&M). This lets you easily manage and schedule tasks based on time and dependencies. You can use ADB Spark SQL and ADB Spark nodes in DataWorks to develop and schedule Spark jobs, such as Spark SQL jobs and Spark application jobs.
Prerequisites
Your AnalyticDB for MySQL cluster must meet the following requirements:
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
A resource group has been created for the AnalyticDB for MySQL cluster. For more information, see Create and manage resource groups.
To develop Spark SQL jobs, create an Interactive resource group in AnalyticDB for MySQL. The engine type for the Interactive resource group must be Spark.
To develop Spark JAR or PySpark jobs, create a Job resource group in AnalyticDB for MySQL.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.
The log storage path of Spark applications is configured.
NoteLog on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose . Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.
Your DataWorks workspace must meet the following requirements:
The AnalyticDB for MySQL cluster and the DataWorks workspace are in the same region.
A DataWorks workspace is created, and use the new Data Studio.
A resource group is created and attached. For more information, see Create and attach a resource group.
NoteWhen you create the resource group, make sure that its VPC is the same as the VPC of the AnalyticDB for MySQL cluster.
The IPv4 CIDR block of the vSwitch for the resource group to which the DataWorks workspace is attached has been added to the whitelist of the AnalyticDB for MySQL cluster. For more information, see Setting a Whitelist.
An AnalyticDB for Spark computing resource is attached to your DataWorks workspace. For more information, see Attach an AnalyticDB for Spark computing resource.
Schedule Spark SQL jobs in DataWorks
In AnalyticDB for MySQL, you can develop jobs for both external and internal tables. This topic uses an external table job as an example to describe the steps for developing and scheduling a Spark SQL job in DataWorks.
Step 1: Create an ADB Spark SQL node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
Click the
icon next to Project Folder, and select .In the dialog box that appears, enter a name for the node and press Enter.
Step 2: Develop the ADB Spark SQL node
This topic provides an example of how to create an external database on an ADB Spark SQL node. For information about how to create an internal table, see Create an internal table using Spark SQL.
CREATE DATABASE IF NOT EXISTS `adb_spark_db` LOCATION 'oss://testBucketname/db_dome';Create an external table
adb_spark_db.tb_orderin the ADB Spark SQL node.CREATE TABLE IF NOT EXISTS adb_spark_db.tb_order(id int, name string, age int) USING parquet LOCATION 'oss://testBucketname/db_dome/tb1' TBLPROPERTIES ('parquet.compress'='SNAPPY');Query data.
After the external table is created, you can use a SELECT statement in AnalyticDB for MySQL to query the Parquet data.
SELECT * FROM adb_spark_db.tb_order limit 100;Create a Delta Lake table
adb_spark_db.raw_orderin the ADB Spark SQL node.CREATE TABLE IF NOT EXISTS adb_spark_db.raw_order(id int, name string, age int) USING delta;Import data from
adb_spark_db.tb_ordertoadb_spark_db.raw_order.INSERT INTO adb_spark_db.raw_order SELECT * FROM adb_spark_db.tb_order;Create a database in AnalyticDB for MySQL. If a database is already created, you can skip this step. The following is an example:
CREATE DATABASE adb_demo;Create an internal table in AnalyticDB for MySQL to store the data imported from the Delta Lake table. The following is an example:
CREATE TABLE adb_demo.order_xuanwu_format ( `id` int, `name` string, `age` int) using adb TBLPROPERTIES ( 'distributeType'='HASH', 'distributeColumns' = 'id', 'storagePolicy' = 'hot' );Import the data from the Delta Lake table
adb_spark_db.raw_ordertoorder_xuanwu_format.INSERT OVERWRITE adb_demo.order_xuanwu_format SELECT * FROM adb_spark_db.adb_spark_db.raw_order;
Step 3: Configure and run the ADB Spark SQL node
In the right-side pane, click Debugging Configurations to configure runtime parameters for the ADB Spark SQL node.
Parameter Type
Parameter
Description
Computing Resource
Computing Resource
Select the attached AnalyticDB for Spark computing resource.
AnalyticDB Computing Resource Group
Select the Interactive resource group with the Spark engine that you created in the AnalyticDB for MySQL cluster.
Resource Group
Resource Group
Select the DataWorks resource group that passed the connectivity test when you attached the AnalyticDB for Spark computing resource.
CUs For Computing
The node uses the default CU value. You do not need to modify the CU.
Script Parameters
Parameter Name
The name of the parameter that you specified for the ADB Spark SQL node. For example, you can configure the
$[yyyymmdd]parameter in the script to perform batch synchronization of daily new data. For the supported parameters and their formats, see Configure Scheduling Parameters.NoteThe system automatically detects the parameter names configured in the node.
Parameter Value
Configure the parameter value. At runtime, the parameter is dynamically replaced with its actual value.
(Optional) To run node tasks on a schedule, in the Scheduling pane on the right side of the node, configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group parameters in the Scheduling Policies section. Then, configure the parameters in the Scheduling Parameters section.
After you configure the debug settings, click the
icon to save the SQL node. Then, click the
icon to test the SQL script and verify that it runs as expected.After the scheduling configuration is complete, you can submit and publish the database nodes to the production environment.
After you publish a task, it runs periodically based on the configured parameters. You can view and manage the published auto triggered tasks on the page. For more information, see Introduction to Operation Center.
Schedule Spark JAR jobs in DataWorks
Step 1: Create an ADB Spark node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
Click the
icon next to Project Folder, and select .In the dialog box that appears, enter a name for the node and press Enter.
Step 2: Develop the ADB Spark node
ADB Spark nodes support development in Java/Scala and Python.
Java/Scala development
Prepare a sample JAR package.
You can download the sample JAR package spark-examples_2.12-3.2.0.jar to develop and schedule the ADB Spark node.
Upload the sample code
spark-examples_2.12-3.2.0.jarto an OSS bucket that is in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload files using the console.Configure the ADB Spark node.
Language
Parameter
Description
Java/Scala
Main JAR Resource
The storage path of the JAR package in OSS. Example:
oss://testBucketname/db_dome/spark-examples_2.12-3.2.0.jar.Main Class
The name of the main class to run. Example:
com.work.SparkWork.Parameters
Enter the parameters to pass to the code.
Configuration Items
Configure the runtime parameters for the Spark application. For more information, see Spark application configuration parameters.
Example:
spark.driver.resourceSpec: medium
Python development
Prepare test data.
Create a TXT file named
data.txtthat you want to read using Spark. Add the following content to the file.Hello,Dataworks Hello,OSSWrite sample code.
Create a file named
spark_oss.py. Add the following content to thespark_oss.pyfile.import sys from pyspark.sql import SparkSession # Initialize Spark. spark = SparkSession.builder.appName('OSS Example').getOrCreate() # Read the specified file. The file path is specified by the value passed in by args. textFile = spark.sparkContext.textFile(sys.argv[1]) # Calculate and print the number of lines in the file. print("File total lines: " + str(textFile.count())) # Print the first line of the file. print("First line is: " + textFile.first())Upload the test data
data.txtand the sample codespark_oss.pyto an OSS bucket that is in the same region as the AnalyticDB for MySQL cluster. For more information, see Upload files using the console.Configure the ADB Spark node.
Language
Parameter
Description
Python
Main Package
The OSS path of
spark_oss.pyfrom Step 3 of this example. Example:oss://testBucketname/db_dome/spark_oss.py.Parameters
The OSS path of
data.txtfrom Step 3 of this example. Example:oss://testBucketname/db_dome/data.txt.Configuration Items
Configure the runtime parameters for your Spark program. For more information, see Spark application configuration parameters.
Example:
spark.driver.resourceSpec: medium
Step 3: Configure and run the ADB Spark node
In the right-side pane, click Debugging Configurations to configure runtime parameters for the ADB Spark node.
Parameter Type
Parameter
Description
Computing Resource
Computing Resource
Select the attached AnalyticDB for Spark computing resource.
AnalyticDB Computing Resource Group
Select the Job resource group that you created in the AnalyticDB for MySQL cluster.
Resource Group
Resource Group
Select the DataWorks resource group that passed the connectivity test when you attached the AnalyticDB for Spark computing resource.
CUs For Computing
The node uses the default CU value. You do not need to modify the CU.
Script Parameters
Parameter Name
The name of the parameter that you configured in the ADB Spark JAR job.
NoteThe system automatically detects the parameter names configured in the node.
Parameter Value
Configure the parameter value. At runtime, the parameter is dynamically replaced with its actual value.
(Optional) To run a node task on a schedule, in the Scheduling pane on the right side of the node, configure the Computing Resource, AnalyticDB Computing Resource Group, and Resource Group parameters in the Scheduling Policies section. Then, configure the parameters in the Scheduling Parameters section.
After you configure the debug settings, click the
icon to save the node. Then, click the
icon to test the script and verify that it runs as expected.After you complete the scheduling configuration, you can publish the database nodes to the production environment.
A published auto triggered task runs periodically based on the configured parameters. You can view and manage the task on the page. For more information, see Introduction to Operation Center.