CDH Spark SQL nodes run a distributed SQL query engine on a Cloudera's Distribution including Apache Hadoop (CDH) cluster to process structured data, which improves the running efficiency of jobs. Use them to develop Spark SQL tasks in DataWorks, schedule them on a recurring basis, and integrate them with other task types.
Prerequisites
Before you begin, make sure you have the following:
A workflow in DataStudio — All development in DataStudio is organized within workflows. If you haven't created one yet, see Create a workflow.
A CDH cluster registered to DataWorks — CDH nodes require a registered CDH cluster. If no cluster is registered, node creation and task execution will fail. See Register a CDH or CDP cluster to DataWorks.
A serverless resource group, purchased and configured — The resource group must be associated with your workspace and have network connectivity configured. See Create and use a serverless resource group.
Limitations
CDH Spark SQL tasks can run on serverless resource groups or old-version exclusive resource groups for scheduling. Use serverless resource groups.submit a ticket
Step 1: Create a CDH Spark SQL node
Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose Data Development and O\&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.
On the DataStudio page, find your workflow, right-click the workflow name, and choose Create Node > CDH > CDH Spark SQL.
In the Create Node dialog box, set the Name parameter and click Confirm.
Step 2: Develop a CDH Spark SQL task
Select a CDH cluster (optional)
If multiple CDH clusters are registered to the workspace, select one from the Engine Instance CDH drop-down list. If only one cluster is registered, it is used automatically.

Write SQL code
In the code editor on the node's configuration tab, write your task code.
Example: create tables and copy data
The following example creates test_lineage_table_f1 and test_lineage_table_t2 in the test_spark database, then copies data between them. Modify the code to match your requirements.
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_f1 (`id` BIGINT, `name` STRING)
PARTITIONED BY (`ds` STRING);
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_t2 AS SELECT * FROM test_spark.test_lineage_table_f1;
INSERT into test_spark.test_lineage_table_t2 SELECT * FROM test_spark.test_lineage_table_f1;Example: use scheduling parameters
DataWorks scheduling parameters are dynamically replaced in task code at runtime. Define variables in the ${Variable} format and assign values in the Scheduling Parameter section of the Properties tab.
SELECT '${var}'; -- You can assign a specific scheduling parameter to the var variable.For supported formats and configuration steps, see Supported formats of scheduling parameters and Configure and use scheduling parameters.
Configure advanced Spark settings (optional)
In the right-side navigation pane of the configuration tab, click Advanced Settings to configure Spark properties. The following are common examples:
| Property | Example value | Description |
|---|---|---|
spark.driver.memory | "2g" | Memory allocated to the Spark driver. |
spark.yarn.queue | "haha" | Yarn queue to which the application is submitted. |
For the full list of available properties, see Spark configuration.
Step 3: Configure scheduling properties
To run the task on a schedule, click Properties in the right-side navigation pane to configure scheduling settings.
Configure the Rerun and Parent Nodes parameters before committing the task.
For details on scheduling options, see Overview.
Step 4: Debug the task
Optional: select a resource group and assign parameter values. Click the
icon in the top toolbar to open the Parameters dialog box. Select the resource group to use for debugging. If your code uses scheduling parameters, assign test values to the variables. For the value assignment logic, see Debugging procedure.Save and run the task. Click the
icon to save the task. Then click the
icon to run it.Optional: run smoke testing. When you commit the node or after you commit the node, you can run smoke testing in the development environment to verify the node behaves as expected. See Perform smoke testing.
Step 5: Commit and deploy the task
Click the
icon to save the task.Click the
icon. In the Submit dialog box, fill in the Change description field.You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task. If code review is enabled, the node can only be deployed after the code passes review. See Code review.
If your workspace is in standard mode, deploy the task to the production environment after committing. Click Deploy in the upper-right corner of the configuration tab. See Deploy tasks.
What's next
Monitor task runs — After the task is committed and deployed, it runs on the configured schedule. Click Operation Center in the upper-right corner to view scheduling status and run history. See View and manage auto triggered tasks.
View data lineage — After deployment, view data lineage on the Data Map page to trace data sources and downstream table flows. See View lineages.