OPEN MR

Last Updated: Nov 12, 2017

OPEN MR allows you to use the MR functions of MaxCompute more securely and easily to implement more complex calculation logics.

This document describes how to develop OPEN MR to make it easier for you to develop complex MR models. You only need to focus on the Mapper/Reducer logic. The logic of job submission is centrally handled in the platform. Variables for daily scheduling can be specified by parameters in the configuration when you are creating an OPEN MR node.

The ODPS_MR task type is available and recommended now.

Note:

Resource table reference and multiple Reduce operations are not supported in OPEN_MR.

Use cases and data description

This example describes how to use MaxCompute MapReduce on the Alibaba Cloud big data platform by using the classic WordCount sample. For more information about the WordCount sample, see WordCount sample.

Data tables involved in this example are described as follows:

  • Input data table: wc_in is used to store the word list.

  • Output data table: wc_out is used to store the result set processed by the MR program.

Prepare data tables

Create a data table

Create wc_in and wc_out tables as instructed in Table creation.

  1. CREATE TABLE wc_in (key STRING, value STRING) partitioned by (pt string );
  2. CREATE TABLE wc_out (key STRING, cnt BIGINT) partitioned by (pt string );

Insert sample data

To detect the running result of the OPEN MR program on the big data platform, insert sample data to the input table (partition pt=20170101 of wc_in).

Procedure

  1. Navigate to New > New Script on the Data Development page.

  2. Complete the configurations in the New Script pop-up window, and then click Submit.  

  3. In the MaxCompute code editor, write and run the MaxCompute SQL statements. For more SQL syntaxes, see SQL overview.

    The MaxCompute SQL script is as follows:

    1. ---Create system dual
    2. drop table if exists dual;
    3. create table dual(id bigint); --If the project does not have the pseudo table, create the table and initialize data.
    4. ---Initialize data to the system pseudo table
    5. insert overwrite table dual select count(*)from dual;
    6. --- Insert sample data to the partition pt=20170101 of the input table wc_in
    7. insert overwrite table wc_in partition(pt=20170101) select * from (
    8. select 'project','val_pro' from dual
    9. union all
    10. select 'problem','val_pro' from dual
    11. union all
    12. select 'package','val_a' from dual
    13. union all
    14. select 'pad','val_a' from dual
    15. ) b;
  4. Write query statements to view the inserted sample data.

Write MapReduce program

Before using the OPEN_MR node, write the MapReduce locally as needed based on the WordCount sample code of the MaxCompute MapReduce programming framework, build it into a JAR package, and add it as a resource to the big data platform. For more information about MR development, see MaxCompute documentation. For more information about the sample code, see the attachment WordCount.java.

Add a resource

You must run the JAR command in both the MaxCompute console and the Alibaba Cloud big data platform. Therefore, generate the WordCount.jar package by using the Export function of Eclipse or other tools such as Ant, and then upload the package to the MaxCompute resource.

Procedure

  1. Right-click the directory to select Upload Resources on the Resource Management module of the Data Development page.

  2. Complete the configurations in the Upload Resources pop-up window and select the Upload as ODPS resource checkbox.

  3. Click Submit.

Create an OPEN_MR node

After the new MaxCompute MapReduce program is uploaded as a resource to MaxCompute, an OPEN_MR node must be created to run the program.

Procedure

  1. Navigate to New > New Script on the Data Development page.

  2. Complete the following configurations in the New Task pop-up window.

    • Task name: WordCount sample.

    • Description: WordCount sample.

  3. Click Create.

  4. Complete the following configurations on the OPEN_MR configuration page.

    • MRJar package: This is a mandatory field and indicates the primary JAR resource package that needs to be run in this node.
    • Resource: This is a mandatory field and indicates the primary JAR resource that needs to be run and the list of other resources that need to be called in this node.
    • Input/output tables: In this sample, the partition table of this project is used and the partition value is the business date scheduled automatically every day. Therefore, the partition is represented by a variable (system scheduling parameter).

    Parameter configuration: In this sample, the partition is represented by a system parameter, instead of a custom variable. Thus, no additional configuration is needed:

    Note:

    For more information about the usage of parameter variables, see System scheduling parameters.

  5. Click Save and Submit to switch to the workflow panel, and then Click Test Run.

    Note:

    Given that only the partition pt=20170101 in the sample table has data during the test run, select 2017-01-01 as the business date so that the system parameter replaces the partitions in the input/output tables with 20170101.

After the test task is generated, wait until the task is run successfully.

View results

Procedure

  1. Open the script inserting sample data to wc_in.

  2. Write the MaxCompute SQL query statement.

  3. Click Run.

Check if the test result is as expected.

Thank you! We've received your feedback.