All Products
Search
Document Center

DataWorks:MaxCompute MR node

Last Updated:Apr 23, 2026

MaxCompute provides MapReduce programming interfaces. You can create and schedule a MaxCompute MR node, and then use the MapReduce Java API to write MapReduce programs that process large-scale datasets in MaxCompute.

Background

MapReduce is a distributed computing framework. It combines user-written business logic code with built-in components to create a complete distributed program that runs concurrently on a Hadoop cluster. MaxCompute provides two versions of the MapReduce programming interface. For more information, see MapReduce.

  • MaxCompute MapReduce: The native API for MaxCompute. It enables fast execution and rapid development without exposing the file system.

  • Extended MaxCompute MapReduce (MR2): An extension of MaxCompute MapReduce that supports more complex job scheduling logic. It uses the same implementation as the native MaxCompute API.

In DataWorks, you can use a MaxCompute MR node to schedule and run MaxCompute MapReduce tasks and integrate them with other jobs.

Prerequisites

Note

You must upload and deploy the required resources before you create a MaxCompute MR node.

Limitation

For information about the limitations on MaxCompute MR nodes, see Limitations.

Procedure

  1. On the MaxCompute MR node editor page, follow the development steps below.

    Develop the MR code

    The following example shows how to use a MaxCompute MR node to count the occurrences of each string in the wc_in table and write the results to the wc_out table.

    1. Upload, submit, and deploy the mapreduce-examples.jar resource. For more information, see resource management.

      Note

      For information about the implementation logic in the mapreduce-examples.jar package, see WordCount example.

    2. In the MaxCompute MR node editor, enter the following sample code.

      --Create the input table.
      CREATE TABLE IF NOT EXISTS wc_in (key STRING, VALUE STRING);
      --Create the output table.
      CREATE TABLE IF NOT EXISTS wc_out (key STRING, cnt BIGINT);
          ---Create the system dual table.
          DROP TABLE IF EXISTS dual;
          CREATE TABLE dual(id BIGINT); --If the workspace does not have this pseudo table, you need to create it and initialize the data.
          ---Initialize data for the system pseudo table.
          INSERT OVERWRITE TABLE dual SELECT count(*) FROM dual;
          ---Insert sample data into the input table wc_in.
          INSERT OVERWRITE TABLE wc_in SELECT * FROM (
          SELECT 'project','val_pro' FROM dual 
          UNION ALL 
          SELECT 'problem','val_pro' FROM dual
          UNION ALL 
          SELECT 'package','val_a' FROM dual
          UNION ALL 
          SELECT 'pad','val_a' FROM dual
            ) b;
      -- Reference the JAR resource that was just uploaded. You can find the resource in the resource management panel, right-click to reference it.
      --@resource_reference{"mapreduce-examples.jar"}
      jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out
      Note

      The code is described as follows:

      • --@resource_reference: You can right-click a resource name in the resource management section and select Insert Resource Path to automatically generate this statement.

      • -resources: The name of the referenced JAR resource file.

      • -classpath: The path of the JAR package. Because the resource has been referenced, the path is uniformly set to the JAR package under ./.

      • com.aliyun.odps.mapred.open.example.WordCount: The main class called during execution. This value must be the same as the main class name in the JAR package.

      • wc_in: The name of the MR input table, which has been created in the preceding code.

      • wc_out: The name of the MR output table, which has been created in the preceding code.

      • When an MR task calls multiple JAR resources, the classpath is specified in the following format: -classpath ./xxxx1.jar,./xxxx2.jar. The paths are separated by commas (,).

    Run the MR task

    1. In Run Configuration, configure Compute Resource, Compute Quota, and Resource Group.

      Note

      To access data sources in public network or VPC network environments, you must use a resource group for scheduling that has passed the connectivity test with the data source. For more information, see Network connectivity.

    2. In the parameter dialog on the toolbar, select the MaxCompute data source that you have created, and click Run to run the MR task.

    (Optional) Query the results

    Query the data in the output table wc_out by using a MaxCompute SQL node.

    SELECT * FROM wc_out;

    Result:

    +------------+------------+
    | key        | cnt        |
    +------------+------------+
    | package    | 1          |
    | pad        | 1          |
    | problem    | 1          |
    | project    | 1          |
    | val_a      | 2          |
    | val_pro    | 2          |
    +------------+------------+
  2. To periodically run the node task, configure the schedule settings based on your business requirements. For more information, see Schedule settings.

  3. After the node task is configured, you must deploy the node. For more information, see Deploy a node.

  4. After the task is deployed, you can view the running status of the scheduled task in Operation Center. For more information, see Scheduled tasks.

References

For more ODPS MR task development scenarios, see the following topics:

FAQ: You can learn about common issues that may occur during MR task execution and quickly troubleshoot them. For more information, see FAQ.