Build CDH Spark SQL Nodes for Structured Data Processing - DataWorks - Alibaba Cloud - DataWorks

Spark SQL nodes allow you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs. DataWorks provides CDH Spark SQL nodes that you can use to develop and periodically schedule CDH Spark SQL tasks and integrate the tasks with other types of tasks. This topic describes how to create and use a CDH Spark SQL node.

Prerequisites

A workflow is created in DataStudio.

In DataStudio, development tasks are organized into workflows. You must create a workflow before you can create a node. For more information, see Create a workflow.
A CDH cluster is created and registered with your DataWorks workspace.

You must register your CDH cluster with a DataWorks workspace before creating CDH nodes and tasks. For more information, see Bind a CDH compute resource in the old version of DataStudio.
A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.

Limits

Tasks on CDH Spark SQL nodes can be run on serverless resource groups or old-version exclusive resource groups for scheduling. We recommend that you run tasks on serverless resource groups.

Step 1: Create a CDH Spark SQL node

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
Right-click the target workflow and choose Create Node > CDH > CDH Spark SQL.
In the Create Node dialog box, enter a Name for the node and click OK. The node editor opens, where you can develop and configure the task.

Step 2: Develop a CDH Spark SQL task

(Optional) Select a CDH cluster

If multiple Cloudera's Distribution including Apache Hadoop (CDH) clusters are registered to the current workspace, you must select a cluster from the Engine Instance CDH drop-down list based on your business requirements. If only one CDH cluster is registered to the current workspace, the CDH cluster is automatically used for development.

From the CDH engine instance drop-down list, select the target cluster. If you need to access a domain with an IP address whitelist, use an exclusive resource group for scheduling.

Develop SQL code

Develop SQL code: Simple example

In the code editor on the configuration tab of the CDH Spark SQL node, write task code.

In this example, you can create the test_lineage_table_f1 and test_lineage_table_t2 tables in the test_spark database and copy data from the test_lineage_table_f1 table to the test_lineage_table_t2 table. Sample code:

Note

This example is for reference only. You can write code based on your business requirements.

CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_f1 (`id` BIGINT, `name` STRING)
PARTITIONED BY (`ds` STRING);
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_t2 AS SELECT * FROM test_spark.test_lineage_table_f1;
INSERT into test_spark.test_lineage_table_t2 SELECT * FROM test_spark.test_lineage_table_f1;

Develop SQL code: Use scheduling parameters

DataWorks provides Scheduling Parameter that let you dynamically pass values to variables in your code for scheduled tasks. You can define variables in your task code by using the ${variable_name} format. In the right-side navigation pane of the node editor page, go to Scheduling Settings > Scheduling Parameter to assign a value to the variable. For more information about supported formats and configuration, see Supported formats of scheduling parameters and Configure and use scheduling parameters.

Sample code:

SELECT '${var}'; -- You can assign a specific scheduling parameter to the var variable.

(Optional) Configure advanced parameters

On the CDH Spark SQL node editor page, click Advanced Settings in the right-side navigation pane to configure task runtime parameters. For example:

"spark.driver.memory": "2g": specifies the memory size allocated to the Spark driver node.
"spark.yarn.queue": "haha": specifies the queue of Yarn to which the application is submitted.

For more information about how to configure advanced parameters, see Spark Configuration.

Step 3: Configure task scheduling

To periodically run the node task, click Scheduling on the right side of the node editing page and configure scheduling settings based on your needs. For more information, see Overview of task scheduling properties.

Note

You must configure the node’s Rerun attribute and Parent Nodes before you can submit the node.

Step 4: Test task code

Perform the following test operations as needed to verify that the task behaves as expected.

(Optional) Select a resource group and assign custom parameter values.
- Click the icon in the toolbar. In the Parameter dialog box, select the schedule resource group for testing.
- If your task code uses scheduling parameter variables, assign values to them here for testing. For more information about parameter assignment logic, see Task debugging process.
Save and run the task code.

Click the icon in the toolbar to save your task code. Then click the icon to run the task.
(Optional) Perform smoke testing.

To run smoke testing in the development environment and verify that the scheduled node task executes as expected, perform smoke testing either during or after node submission. For more information, see Perform smoke testing.

Step 5: Submit and publish the task

After configuring the node task, submit and publish it. Once published, the node runs periodically based on its scheduling configuration.

Click the icon in the toolbar to save the node.
Click the icon in the toolbar to submit the node task.

In the Submission dialog box, enter a Change Description. Optionally, choose whether to require code review after submission.
Note
- You must configure the node’s Rerun attribute and Parent Nodes before you can submit the node.
- Code review helps ensure code quality and prevents errors caused by unreviewed code being published directly to production. If code review is enabled, the submitted node code must be approved by reviewers before it can be published. For more information, see Code review.

If you are using a workspace in standard mode, after successfully submitting the task, click Publish in the upper-right corner of the node editing page to deploy the task to the production environment. For more information, see Publish a task.

What to do next

Task O&M: After a task is submitted and published, it runs periodically based on the node's configuration. You can click O&M in the upper-right corner of the node editing page to go to the Operation Center and view the scheduling and running status of periodic tasks. For more information, see Manage periodic tasks.
View lineages: After you commit and deploy the task, you can view the lineages of the task on the Data Map page. For example, you can view the source of the original data and the database to which the table data flows. Then, you can analyze the impacts of different levels of lineages based on your business requirements. For more information, see View lineages.