Spark SQL nodes allow you to use a distributed SQL query engine to process structured data. This improves the running efficiency of jobs. DataWorks provides CDH Spark SQL nodes that you can use to develop and periodically schedule CDH Spark SQL tasks and integrate the tasks with other types of tasks. This topic describes how to create and use a CDH Spark SQL node.
Prerequisites
An Alibaba Cloud CDH cluster is created and registered to DataWorks. For more information, see Register a CDH or CDP cluster to DataWorks.
(Required if you use a RAM user to develop tasks) The desired RAM user is added to your DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add workspace members and assign roles to them.
NoteIf you use an Alibaba Cloud account, you can skip this operation.
A Hive data source is added to the workspace, and the data source has passed the network connectivity test. For more information, see Add and manage data sources.
A CDH Spark SQL node is created.
Procedure
On the configuration tab of the CDH Spark SQL node, perform the following operations to develop a task based on the CDH Spark SQL node:
Develop SQL code
In the SQL editor, develop task code. You can define variables in the ${Variable name} format in the task code, and configure scheduling parameters in the Scheduling Parameters section of the Properties tab in the right-side navigation pane of the configuration tab of the CDH Spark SQL node to assign the scheduling parameters to the variables as values. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. For more information about how to use scheduling parameters. Sample code:
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_f1 (`id` BIGINT, `name` STRING) PARTITIONED BY (`ds` STRING); CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_t2 AS SELECT * FROM test_spark.test_lineage_table_f1; INSERT INTO test_spark.test_lineage_table_t2 SELECT id,${var} FROM test_spark.test_lineage_table_f1;
NoteIn this example, you can create the
test_lineage_table_f1
andtest_lineage_table_t2
tables in thetest_spark
database and copy data from thetest_lineage_table_f1
table to thetest_lineage_table_t2
table. This example is for reference only. You can write code based on your business requirements.The field
name
is assigned to the${var}
parameter in the code as the value.
Run the task on the CDH Spark SQL node
On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the CDH Spark SQL node, configure the Computing Resource and Resource Group parameters.
Select the name of the CDH cluster that you registered in DataWorks for Computing Resource.
Select the resource group for scheduling that passed the network connectivity test between the CDH computing resource. For more information, see Network connectivity solutions.
NoteYou can also configure the CUs for Computing parameter based on the resources required for task execution. The default value of this parameter is
0.25
.Click Run to run the task on the node.
If you want to run the task on the node on a regular basis, configure the scheduling information based on your business requirements. For more information.
After the task on the node is configured, commit and deploy the task. For more information.
After you deploy the task on the node, view the status of the task in Operation Center. For more information, see Getting started with Operation Center.