The CDH Spark node lets you develop and schedule Apache Spark jobs on your Cloudera Distribution Hadoop (CDH) cluster through DataWorks. Upload a compiled JAR package, configure the node with a spark-submit command, and run it on a recurring schedule.
How it works
Develop your Spark job in your CDH environment and compile it into a JAR package.
Upload the JAR package to DataWorks as a CDH JAR resource.
Create a CDH Spark node, reference the JAR resource, write a spark-submit command, and run the node.
Prerequisites
Before you begin, ensure that you have:
A CDH cluster registered in DataWorks with the Spark component installed. For setup instructions, see Data Studio: Associate a CDH computing resource.
ImportantWhen registering your CDH cluster in DataWorks, include the Spark component information in the registration form.
A Hive data source configured in DataWorks with a successful connectivity test. See Data Source Management.
(Optional, RAM users only) Your RAM user added to the DataWorks workspace with the Developer or Workspace Administrator role. The Workspace Administrator role has broad permissions — grant it with caution. See Add members to a workspace. Root account users can skip this step.
Step 1: Prepare your Spark JAR
Develop and test your Spark job in your CDH environment, then compile it into a JAR package. For guidance, see the CDH Spark development overview.
Step 2: Upload the JAR as a resource
Upload the JAR package to DataWorks so the CDH Spark node can reference it at runtime.
Go to Resource Management and click Upload. Select your JAR file from your local computer. For details, see Resource management.
Set the following fields:
Field Description Storage Path Location in DataWorks where the resource is stored Data Source The Hive data source connected to your CDH cluster Resource Group The scheduling resource group with a working data source connection Click Save.
Step 3: Create a CDH Spark node
Follow the instructions in Create a node to create a CDH Spark node in your workflow.
Step 4: Configure the node
Reference the JAR resource
Open the CDH Spark node you created.
In the Resource Management panel on the left, right-click the JAR resource and select Reference Resource. A resource reference statement appears in the code editor:
##@resource_reference{"spark_examples_2.11_2.4.0.jar"} spark_examples_2.11_2.4.0.jar
Write the spark-submit command
Add the spark-submit command below the resource reference line. The following example runs the SparkPi calculation:
##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100Key parameters:
| Parameter | Description | Example |
|---|---|---|
--class | The fully qualified main class in your JAR | org.apache.spark.examples.SparkPi |
--master | The cluster manager | yarn |
| JAR name | The name of the uploaded CDH JAR resource | spark-examples_2.11-2.4.0.jar |
The CDH Spark node editor does not support code comments. Remove all comments from your code before running, or the job will fail.
Step 5: Run the node
In the Run Configuration section, set:
Parameter Description Compute Resources Select the CDH cluster registered in DataWorks Resource Group Select a scheduling resource group with a working connection to the data source. See Network connectivity solutions for network setup options Compute CUs (Optional) Adjust based on your job's resource requirements. Default: 0.5Click Run in the toolbar.
What's next
Schedule the node: Configure Time Property and other scheduling settings in the Scheduling configuration panel to run the node on a recurring schedule. See Node scheduling configuration.
Publish the node: Click the publish icon to publish the node to the production environment. Only published nodes are scheduled. See Publish a node.
Monitor runs: After publishing, track scheduled runs in the O&M Center. See Getting started with Operation Center.