All Products
Search
Document Center

DataWorks:Develop a MaxCompute MapReduce task

Last Updated:Mar 27, 2026

Use an ODPS MR node in DataWorks to schedule a MaxCompute MapReduce job and integrate it with other nodes in your data pipeline.

Prerequisites

Before you begin, make sure you have:

Important

Upload, commit, and deploy JAR resources before you write code on the ODPS MR node.

MaxCompute MapReduce API versions

MapReduce is a programming framework for distributed computing that integrates user business logic with built-in MapReduce components to generate a complete distributed computing program, allowing jobs to run concurrently on a Hadoop cluster.

MaxCompute provides two MapReduce API versions:

API version Description
MaxCompute MapReduce The native MapReduce API. Runs fast and does not expose the file system.
Extended MaxCompute MapReduce (MR2) An extension of MaxCompute MapReduce that supports complex job scheduling logic. Uses the same implementation as MaxCompute MapReduce.

For a full comparison, see Overview.

Limits

For limits that apply to MaxCompute MapReduce tasks, see Limits.

WordCount example

This example counts how many times each string appears in a wc_in table and writes the results to a wc_out table.

Step 1: Upload the JAR resource

Upload, commit, and deploy mapreduce-examples.jar as a MaxCompute resource. See Create and use MaxCompute resources.

Note

For the implementation details inside mapreduce-examples.jar, see WordCount example.

Step 2: Write and run the code

Enter the following code on the ODPS MR node and run it.

-- Create an input table.
CREATE TABLE if not exists wc_in (key STRING, value STRING);
-- Create an output table.
CREATE TABLE if not exists wc_out (key STRING, cnt BIGINT);
    --- Create a dual table.
    drop table if exists dual;
    create table dual(id bigint); -- If no dual table exists in the current workspace, create a dual table.
    --- Initialize the dual table.
    insert overwrite table dual select count(*)from dual;
    --- Insert the sample data into the wc_in table.
    insert overwrite table wc_in select * from (
    select 'project','val_pro' from dual
    union all
    select 'problem','val_pro' from dual
    union all
    select 'package','val_a' from dual
    union all
    select 'pad','val_a' from dual
      ) b;
-- Reference the uploaded JAR resource. To reference the resource, find the JAR resource in the resource list, right-click the JAR resource, and then select Reference Resources.
--@resource_reference{"mapreduce-examples.jar"}
jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out

The jar command runs the WordCount class from mapreduce-examples.jar. It reads from wc_in and writes results to wc_out.

Parameter Description
--@resource_reference{"mapreduce-examples.jar"} Added automatically when you right-click the resource name and select Reference Resources.
-resources mapreduce-examples.jar The name of the referenced JAR resource.
-classpath ./mapreduce-examples.jar The path to the JAR resource. Format: ./ followed by the resource name. To reference multiple JARs, separate them with commas: -classpath ./xxxx1.jar,./xxxx2.jar.
com.aliyun.odps.mapred.open.example.WordCount The fully qualified name of the main class to run.
wc_in The input table.
wc_out The output table.

When the job completes, the result returned is OK.

Step 3: Verify the results

Query the output table using an ODPS SQL node to confirm the results:

select * from wc_out;

Expected output:

+------------+------------+
| key        | cnt        |
+------------+------------+
| package    | 1          |
| pad        | 1          |
| problem    | 1          |
| project    | 1          |
| val_a      | 2          |
| val_pro    | 2          |
+------------+------------+

Each unique string from wc_in appears as a row in wc_out, with cnt showing how many times it appeared.

More examples

For MapReduce use cases beyond WordCount, see the following topics:

What's next

After developing the node, complete the following steps before going to production:

  • Configure scheduling properties: Set rerun settings and scheduling dependencies so the system runs the task on a schedule. See Overview.

  • Debug the node: Test the code logic to confirm it produces the expected results. See Debugging procedure.

  • Deploy the node: After all development is complete, deploy the node. The system then schedules it automatically based on the configured properties. See Deploy nodes.

For common issues, see FAQ about MaxCompute MapReduce.