All Products
Search
Document Center

DataWorks:CDH Spark SQL node

Last Updated:Mar 26, 2026

The CDH Spark SQL node in DataWorks lets you run Spark SQL tasks on a CDH cluster, configure periodic scheduling, and integrate them with other jobs in your data pipeline.

This topic covers how to create, develop, and run a CDH Spark SQL node using the DataWorks Data Studio UI.

Prerequisites

Before you begin, make sure you have:

  • A CDH cluster bound to your DataWorks workspace, with the Spark component installed and Spark-related configurations set during binding. For more information, see Data Studio: Associate a CDH computing resource.

  • A Hive data source configured in DataWorks that has passed the connectivity test. For more information, see Data Source Management.

  • (RAM users only) Your RAM user added to the workspace with the Developer or Workspace Administrator role. For more information, see Add members to a workspace.

    Important

    The Workspace Administrator role has extensive permissions — grant it with caution. If you are using your root account, skip this step.

Create a node

For instructions, see Create a node.

Develop the node

Write your Spark SQL code in the SQL editor. To pass dynamic values into scheduled runs, define variables in your SQL using the ${variable_name} format, then assign their values under Scheduling configuration > Scheduling parameter on the right side of the editor. For more information, see Sources and expressions of scheduling parameters.

The following example creates two tables in the test_spark database and uses a scheduling parameter to populate the name field:

CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_f1 (`id` BIGINT, `name` STRING)
PARTITIONED BY (`ds` STRING);
CREATE TABLE IF NOT EXISTS test_spark.test_lineage_table_t2 AS SELECT * FROM test_spark.test_lineage_table_f1;
INSERT INTO test_spark.test_lineage_table_t2 SELECT id,${var} FROM test_spark.test_lineage_table_f1;
Note

This example creates test_lineage_table_f1 and test_lineage_table_t2 in the test_spark database, then copies data from test_lineage_table_f1 to test_lineage_table_t2. The ${var} parameter provides the value for the name field. Adapt the database and table names to your own environment.

Debug node

To run the node, specify the CDH cluster that executes the task and the scheduling resource group that handles connectivity to your data source.

  1. In the Run Configuration section, configure the following fields:

    Field Description
    Compute resources The CDH cluster registered in DataWorks that runs the Spark SQL task.
    Resource group The scheduling resource group that passed the data source connectivity test. The resource group must have network access to your Hive data source. For more information, see Network connectivity solutions.
  2. On the toolbar, click Run.

What's next

  • Schedule the node: To run the node on a recurring schedule, configure its Time Property and related settings in the Scheduling configuration panel. For more information, see Node scheduling configuration.

  • Publish the node: To make the node available for scheduling, click the image icon to publish it to the production environment. Only nodes published to the production environment are scheduled.

  • Task O&M: After publishing, track scheduled runs in the O&M Center. For more information, see Getting started with Operation Center.