All Products
Search
Document Center

DataWorks:Create a CDH Spark node

Last Updated:Mar 26, 2026

DataWorks supports Cloudera's Distribution Including Apache Hadoop (CDH) Spark nodes, which let you develop and periodically schedule Spark tasks without leaving the DataWorks interface. This topic describes how to create a CDH Spark node, upload a JAR file, configure scheduling, and commit the task to production.

Prerequisites

Before you begin, make sure you have:

Limitations

CDH Spark tasks can run on serverless resource groups or old-version exclusive resource groups. Use serverless resource groups.

Prepare Spark task code and a JAR file

Before scheduling a CDH Spark task in DataWorks, prepare your Spark task code in CDH and compile it into a JAR file. For details, see Overview.

Note

You must upload the JAR file to the DataWorks console before DataWorks can schedule your CDH Spark tasks.

Step 1: Create a CDH Spark node

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left navigation pane, choose Data Development and O\&M > Data Development, select a workspace, and click Go to Data Development.

  2. On the DataStudio page, find your workflow, right-click its name, and choose Create Node > CDH > CDH Spark.

  3. In the Create Node dialog box, set the Engine Instance, Path, and Name parameters.

  4. Click Confirm.

Step 2: Create and reference a CDH JAR resource

Reference a JAR file in the CDH Spark node, then write a spark-submit command to submit the task.

Create a CDH JAR resource

Find your workflow and click CDH. Right-click Resource and choose Create Resource > CDH JAR. In the Create Resource dialog box, click Upload and select the JAR file to upload.

image.png

Reference the CDH JAR resource

  1. Go to the configuration tab of the node you created.

  2. Under Resource in the CDH folder, right-click the resource name and select Insert Resource Path. This example uses a resource named spark-examples_2.11_2.4.0.jar. A clause in ##@resource_reference{""} format appears on the configuration tab when the resource is successfully referenced:

    ##@resource_reference{"spark_examples_2.11_2.4.0.jar"}
    spark_examples_2.11_2.4.0.jar

    image.png

  3. Edit the node code to add the spark-submit command. Use the following format:

    Important

    Do not add comments in CDH Spark node code. Comments cause a runtime error.

    ParameterDescription
    org.apache.spark.examples.SparkPiThe main class of the task in your compiled JAR file
    spark-examples_2.11-2.4.0.jarThe name of the JAR file you uploaded
    ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
    spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100

Step 3: Configure task scheduling properties

To run the task on a schedule, click Properties in the right navigation pane of the node configuration tab.

Note

Configure the Rerun and Parent Nodes parameters before you commit the task.

Step 4: Debug task code

  1. (Optional) Select a resource group and assign values to scheduling parameters.

    • Click the 高级运行 icon in the top toolbar to open the Parameters dialog box. Select the resource group for debugging.

    • If your task code uses scheduling parameters, assign values to those parameters for the debug run. See Differences in value assignment logic across run modes for details.

  2. Click the 保存 icon to save, then click the 运行 icon to run the task.

  3. (Optional) Run a smoke test in the development environment before or after committing the task. See Perform smoke testing.

What's next

Commit and deploy the task

  1. Click the Save icon to save the task.

  2. Click the Submit icon to commit the task.

  3. In the Submit dialog box, enter a Change description and click Confirm.

  4. If your workspace runs in standard mode, deploy the task to the production environment after committing: click Deploy in the DataStudio top navigation bar. See Deploy tasks.

View the task in Operation Center

  1. Click Operation Center in the upper-right corner of the node configuration tab to go to the production environment.

  2. View and manage your scheduled task. See View and manage auto triggered tasks.

For a full Operation Center overview, see Overview.

Use cases

DataWorks handles job scheduling and monitoring so your Spark jobs are reliably submitted to CDH clusters at the configured schedule. Typical uses include:

  • Data analysis: Use Spark SQL, Dataset, and DataFrame APIs to aggregate, filter, and transform complex datasets for rapid insight.

  • Stream processing: Use Spark Streaming to process real-time data streams and act on results immediately.

  • Machine learning: Use MLlib to preprocess data, extract features, and run model training and evaluation at scale.

  • Large-scale ETL (extract, transform, and load): Process large datasets and load results into data warehouses or other storage systems.