Build & Run MaxCompute MapReduce Jobs with ODPS MR Nodes - DataWorks - Alibaba Cloud - DataWorks

MaxCompute provides MapReduce programming interfaces. You can create an ODPS MR node, submit it for scheduling, and use MapReduce Java APIs to write MapReduce programs to process data in MaxCompute.

Prerequisites

You have uploaded, submitted, and published the required resources. For more information, see Create and use MaxCompute resources.
You have created an ODPS MR node. For more information, see Create and manage MaxCompute nodes.

Important

You must upload, submit, and publish the required resources before you create an ODPS MR node.

Background

MapReduce is a programming framework for distributed applications. It combines user-written business logic with built-in components into a complete distributed program that runs concurrently on a Hadoop cluster. MaxCompute provides two versions of the MapReduce programming interface. For more information, see MapReduce.

MaxCompute MapReduce: The native MaxCompute interface. It offers fast execution, rapid development, and does not expose the file system.
Extended MaxCompute MapReduce (MR2): An extension of MaxCompute MapReduce that supports more complex job scheduling logic. The implementation is the same as the native interface.

In DataWorks, you can use an ODPS MR node to schedule, run, and integrate MaxCompute MapReduce tasks with other jobs.

Limitations

For the limitations of ODPS MR nodes, see Usage limits.

Example: A simple WordCount job

The following example demonstrates using an ODPS MR node to count the occurrences of each string in the wc_in table and write the results to the wc_out table.

Upload, submit, and publish the mapreduce-examples.jar resource. For more information, see Create and use MaxCompute resources.
Note
For more information about the implementation logic within the mapreduce-examples.jar package, see WordCount example.

Enter the following code into the ODPS MR node and run it.

-- Create the input table.
CREATE TABLE if not exists wc_in (key STRING, value STRING);
-- Create the output table.
CREATE TABLE if not exists wc_out (key STRING, cnt BIGINT);
    --- Create the system dual table.
    drop table if exists dual;
    create table dual(id bigint); -- If this pseudo-table does not exist in the workspace, you must create it and initialize data.
    --- Initialize the data in the system pseudo-table.
    insert overwrite table dual select count(*)from dual;
    --- Insert sample data into the input table wc_in.
    insert overwrite table wc_in select * from (
    select 'project','val_pro' from dual 
    union all 
    select 'problem','val_pro' from dual
    union all 
    select 'package','val_a' from dual
    union all 
    select 'pad','val_a' from dual
      ) b;
-- Reference the JAR resource that you just uploaded. You can find this resource in the resource list, right-click the resource, and then select Reference Resources.
--@resource_reference{"mapreduce-examples.jar"}
jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out

The code includes the following statements and parameters:

--@resource_reference: You can right-click a resource name and select Insert Resource Path to automatically generate this statement.
-resources: The filename of the referenced JAR resource.
-classpath: The path to the JAR package. Since the resource is already referenced, the path points to the JAR file in the current directory (./).
com.aliyun.odps.mapred.open.example.WordCount: The fully qualified name of the main class to execute from the JAR file.
wc_in: The name of the input table for the MapReduce job. This table is created in the preceding code.
wc_out: The name of the output table for the MapReduce job. This table is created in the preceding code.
If a MapReduce job requires multiple JAR resources, separate the paths with a comma, for example, -classpath ./xxxx1.jar,./xxxx2.jar.

Result: OK

In an ODPS SQL node, run the following command to query the data in the wc_out table.

select * from wc_out;

Expected output:

+------------+------------+
| key        | cnt        |
+------------+------------+
| package    | 1          |
| pad        | 1          |
| problem    | 1          |
| project    | 1          |
| val_a      | 2          |
| val_pro    | 2          |
+------------+------------+

Advanced examples

For information about how to develop MaxCompute MapReduce tasks for more scenarios, see the following topics:

Next steps

After developing the task, you can:

Configure scheduling properties: Configure the periodic scheduling properties for the node. If a task needs to run on a schedule, you must configure its rerun properties and scheduling dependencies. For more information, see Overview of task scheduling properties.
Perform node debugging: Test and run the code of the current node to verify that the code logic works as expected. For more information, see Node debugging process.
Deploy the task: Once development is complete, deploy the task nodes. After deployment, the nodes run periodically based on their scheduling properties. For more information, see Deploy tasks.

FAQ about MaxCompute MapReduce: Learn about common issues encountered when running MapReduce tasks to help you quickly troubleshoot exceptions.