Run MapReduce (MR) jobs on an Alibaba Cloud CDH cluster by creating a CDH MR node in DataWorks. Upload a compiled JAR package as a resource, configure the compute resource and resource group, then run and schedule the node for recurring execution.
Prerequisites
Before you begin, ensure that you have:
A CDH cluster created in Alibaba Cloud and bound to your DataWorks workspace. See Data Studio: Associate a CDH computing resource.
A Hive data source configured in DataWorks with a passed connectivity test. See Data Source Management.
(RAM user only) Your RAM user added to the workspace with the Developer or Workspace Administrator role. See Add members to a workspace.
Root account users can skip the RAM user step. Grant the Workspace Administrator role with caution — it carries extensive permissions.
Create a CDH JAR resource
Upload your compiled JAR package to DataWorks so the CDH MR node can reference it during execution.
Go to Resource management and click Click to Upload to select the JAR package from your local machine.
Set the following fields:
Field Description Storage Path The path in DataWorks where the resource is stored Data Source The data source associated with this resource Resource Group The resource group used to manage and run the resource Click Save.
Create a node
See Create a node for instructions.
Develop the node
Reference your JAR package in the node editor and add the command to run the MapReduce job.
Open the CDH MR node. The code editor opens.
In the Resource Management pane on the left, right-click the JAR resource and select Reference Resource. DataWorks inserts a reference statement in the following format:
##@resource_reference{"<jar-filename>"}Below the reference statement, add the command to run your MapReduce job. Use the following pattern:
<jar-filename> <main-class> <input-path> <output-path>Example:
##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"} onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/cdh/datas/wordcount02/inputs oss://onaliyun-bucket-2/cdh/datas/wordcount02/outputsThe bucket name and paths in this example are for illustration. Replace them with your actual OSS bucket and paths.
Run and debug the node
Configure the compute resource and resource group, then run the node to verify the job executes correctly.
In the Run Configuration section, set the following fields:
Field Description Compute Resource Select the CDH cluster you registered in DataWorks Resource Group Select a scheduling resource group that has network connectivity to the data source. See Network connectivity solutions for how to connect a resource group to a data source On the toolbar, click Run.
What's next
Schedule the node: To run the node on a recurring schedule, configure its Time Property and related scheduling properties in the Scheduling configuration panel on the right. See Node scheduling configuration.
Publish the node: Click the
icon to publish the node to the production environment. Only published nodes are scheduled for execution. See Publish a node.Monitor runs: After publishing, monitor scheduled runs in the O&M Center. See Getting started with Operation Center.