All Products
Search
Document Center

DataWorks:CDH MR node

Last Updated:Nov 18, 2025

In DataWorks, you can create a Cloudera Distribution of Apache Hadoop (CDH) MapReduce (MR) node to process large datasets. This topic describes how to configure and use a CDH MR node in DataWorks.

Prerequisites

  • An Alibaba Cloud CDH cluster is created and registered to DataWorks. For more information, see DataStudio (legacy version): Associate a CDH computing resource.

  • (Required if you use a RAM user to develop tasks) The desired RAM user is added to your DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add workspace members and assign roles to them.

    Note

    If you use an Alibaba Cloud account, you can skip this operation.

  • A Hive data source is added to the workspace, and the data source has passed the network connectivity test. For more information, see Data Source Management.

Create a CDH JAR resource

You can upload a task JAR package to DataWorks. Then, you can use a CDH MR task to periodically schedule the package.

  1. For more information, see Resource Management. To upload a JAR package from your local computer to the resource storage folder, click the Upload button.

  2. Select a Storage Path, Data Source, and Resource Group.

  3. Click the Save button.

Create a node

For more information about how to create a node, see Create a node.

Develop the node

On the editing page of the CDH MR node, perform the following steps:

  1. Open the created CDH MR node. The code editor page opens.

  2. In the Resource Management pane of the navigation pane on the left, find the resource to reference. Right-click the resource and select Reference Resource.

  3. After you reference the resource, a statement in the ##@resource_reference{""} format appears on the code editor page. This indicates that the resource is referenced. Then, run the following command. The resource package, bucket name, and path in the command are examples. You must replace them with your actual information.

##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/cdh/datas/wordcount02/inputs oss://onaliyun-bucket-2/cdh/datas/wordcount02/outputs

Debug the node

  1. In the Debug Configuration dialog box, go to the Computing Resource section and configure the Computing Resource and Resource Group parameters.

    1. For Computing Resource, select the name of the CDH cluster that you registered in DataWorks.

    2. For Resource Group, select the scheduling resource group that passed the connectivity test with the data source. For more information, see Network connectivity solutions.

  2. Click Run in the toolbar at the top of the node editing page.

What to do next

  • Node scheduling: To periodically execute a node in the project folder, set a Scheduling Policy in the Scheduling Configuration panel on the right and configure the scheduling properties.

  • Node publishing: If the task must run in the production environment, click the image icon to publish the task. Nodes in the project folder are periodically scheduled only after they are published to the production environment.

  • Task O&M: After the task is published, you can view the status of the auto triggered task in the Operation Center. For more information, see Getting started with Operation Center.