All Products
Search
Document Center

DataWorks:CDH Spark node

Last Updated:Mar 26, 2026

The CDH Spark node lets you develop and schedule Apache Spark jobs on your Cloudera Distribution Hadoop (CDH) cluster through DataWorks. Upload a compiled JAR package, configure the node with a spark-submit command, and run it on a recurring schedule.

How it works

  1. Develop your Spark job in your CDH environment and compile it into a JAR package.

  2. Upload the JAR package to DataWorks as a CDH JAR resource.

  3. Create a CDH Spark node, reference the JAR resource, write a spark-submit command, and run the node.

Prerequisites

Before you begin, ensure that you have:

  • A CDH cluster registered in DataWorks with the Spark component installed. For setup instructions, see Data Studio: Associate a CDH computing resource.

    Important

    When registering your CDH cluster in DataWorks, include the Spark component information in the registration form.

  • A Hive data source configured in DataWorks with a successful connectivity test. See Data Source Management.

  • (Optional, RAM users only) Your RAM user added to the DataWorks workspace with the Developer or Workspace Administrator role. The Workspace Administrator role has broad permissions — grant it with caution. See Add members to a workspace. Root account users can skip this step.

Step 1: Prepare your Spark JAR

Develop and test your Spark job in your CDH environment, then compile it into a JAR package. For guidance, see the CDH Spark development overview.

Step 2: Upload the JAR as a resource

Upload the JAR package to DataWorks so the CDH Spark node can reference it at runtime.

  1. Go to Resource Management and click Upload. Select your JAR file from your local computer. For details, see Resource management.

  2. Set the following fields:

    FieldDescription
    Storage PathLocation in DataWorks where the resource is stored
    Data SourceThe Hive data source connected to your CDH cluster
    Resource GroupThe scheduling resource group with a working data source connection
  3. Click Save.

Step 3: Create a CDH Spark node

Follow the instructions in Create a node to create a CDH Spark node in your workflow.

Step 4: Configure the node

Reference the JAR resource

  1. Open the CDH Spark node you created.

  2. In the Resource Management panel on the left, right-click the JAR resource and select Reference Resource. A resource reference statement appears in the code editor:

    ##@resource_reference{"spark_examples_2.11_2.4.0.jar"}
    spark_examples_2.11_2.4.0.jar

Write the spark-submit command

Add the spark-submit command below the resource reference line. The following example runs the SparkPi calculation:

##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100

Key parameters:

ParameterDescriptionExample
--classThe fully qualified main class in your JARorg.apache.spark.examples.SparkPi
--masterThe cluster manageryarn
JAR nameThe name of the uploaded CDH JAR resourcespark-examples_2.11-2.4.0.jar
Important

The CDH Spark node editor does not support code comments. Remove all comments from your code before running, or the job will fail.

Step 5: Run the node

  1. In the Run Configuration section, set:

    ParameterDescription
    Compute ResourcesSelect the CDH cluster registered in DataWorks
    Resource GroupSelect a scheduling resource group with a working connection to the data source. See Network connectivity solutions for network setup options
    Compute CUs(Optional) Adjust based on your job's resource requirements. Default: 0.5
  2. Click Run in the toolbar.

What's next

  • Schedule the node: Configure Time Property and other scheduling settings in the Scheduling configuration panel to run the node on a recurring schedule. See Node scheduling configuration.

  • Publish the node: Click the publish icon to publish the node to the production environment. Only published nodes are scheduled. See Publish a node.

  • Monitor runs: After publishing, track scheduled runs in the O&M Center. See Getting started with Operation Center.