All Products
Search
Document Center

DataWorks:Create a CDH MR node

Last Updated:Mar 13, 2024

In DataWorks DataStudio, you can create a Cloudera's Distribution Including Apache Hadoop (CDH) MapReduce (MR) node to process data in ultra-large datasets. This topic describes how to create and use a CDH MR node in DataWorks.

Prerequisites

Step 1: Create a CDH MR node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. On the DataStudio page, find the desired workflow, right-click the workflow name, and then choose Create Node > CDH > CDH MR.

  3. In the Create Node dialog box, configure the Engine Instance, Path, and Name parameters.

  4. Click Confirm. Then, you can use the created node to develop and configure tasks.

Step 2: Create and reference a CDH JAR resource

DataWorks allows you to upload a resource from your on-premises machine to DataStudio before you reference the resource. Perform the following operations to create and reference a CDH JAR resource:

  1. Create a CDH JAR resource.

    Find the desired workflow and click CDH. Right-click Resource and choose Create Resource > CDH JAR. In the Create Resource dialog box, click Upload to upload a required file.

    image.png

  2. Reference the CDH JAR resource.

    1. Go to the configuration tab of the created CDH MR node.

    2. Find the resource that you want to reference under Resource in the CDH folder, right-click the resource name, and then select Insert Resource Path. In this example, a resource named onaliyun_mr_wordcount-1.0-SNAPSHOT.jar is used.

      image.png

      If the clause that is in the ##@resource_reference{""} format appears on the configuration tab of the node, the resource is successfully referenced. Then, run the following code. You must replace the information such as the resource package name, bucket name, and directory in the following code with the actual information.

      ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
      onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/cdh/datas/wordcount02/inputs oss://onaliyun-bucket-2/cdh/datas/wordcount02/outputs
      Note

      Do not add comments when you write code for a CDH MR node.

What to do next

  1. Commit and deploy the node.

    1. Click the Save icon in the top toolbar to save the node.

    2. Click the Submit icon in the top toolbar to commit the node.

    3. In the Commit Node dialog box, configure the Change description parameter.

    4. Click OK.

    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.

  2. View the node.

    1. Click Operation Center in the upper-right corner of the configuration tab of the node to go to Operation Center in the production environment.

    2. View the scheduled node. For more information, see View and manage auto triggered nodes.

    To view more information about the node, click Operation Center in the top navigation bar of the DataStudio page. For more information, see Overview.