OPEN MR allows you to use the MR functions of MaxCompute more securely and easily to implement more complex calculation logics.
This document describes how to use OPEN MR to develop complex MR models easily. You only must focus on the Mapper/Reducer logic. The logic of job submission is handled centrally handled in the console. Variables for daily scheduling can be specified by parameters in the configuration when you create an OPEN MR node.
The ODPS_MR task type is available and recommended now.
Resource table reference and multiple Reduce operations are not supported in OPEN_MR.
This example describes how to use MaxCompute MapReduce on the Alibaba Cloud big data platform by using the classic WordCount sample. For more information, see WordCount sample.
Data tables involved in this example are described as follows:
Input data table: wc_in is used to store the word list.
Output data table: wc_out is used to store the result set processed by the MR program.
Create wc_in and wc_out tables as instructed in Table creation.
CREATE TABLE wc_in (key STRING, value STRING) partitioned by (pt string );
CREATE TABLE wc_out (key STRING, cnt BIGINT) partitioned by (pt string );
To detect the running result of the OPEN MR program on the big data platform, insert sample data to the input table (partition pt=20170101 of wc_in).
Navigate to New > Create Script on the Data Development page.
Complete the configurations in the Create Script dialog box, and click Submit.
In the MaxCompute code editor, write and run the MaxCompute SQL statements. For more SQL syntaxes, see SQL overview.
The MaxCompute SQL script is as follows.
---Create system dual
drop table if exists dual;
create table dual(id bigint); --If the project does not have the pseudo table, create the table and initialize data.
---Initialize data to the system pseudo table
insert overwrite table dual select count(*)from dual;
--- Insert sample data to the partition pt=20170101 of the input table wc_in
insert overwrite table wc_in partition(pt=20170101) select * from (
select 'project','val_pro' from dual
select 'problem','val_pro' from dual
select 'package','val_a' from dual
select 'pad','val_a' from dual
- Write query statements to view the inserted sample data.
Before using the OPEN_MR node, write the MapReduce locally as needed based on the WordCount sample code of the MaxCompute MapReduce programming framework. Build it into a JAR package, and add it as a resource to the big data platform. For more information about MR development, see MaxCompute documentation. For more information about the sample code, see the attachment WordCount.java.
You must run the JAR command in both the MaxCompute console and the Alibaba Cloud big data platform. Generate the WordCount.jar package by using the Export function of Eclipse or other tools such as Ant, and then upload the package to the MaxCompute resource.
Click Upload on the Resource Management module of the Data Development page.
Complete the configurations in the Upload Resource dialog box and select the Upload as ODPS resource checkbox.
After the new MaxCompute MapReduce program is uploaded as a resource to MaxCompute, an OPEN_MR node must be created to run the program.
Navigate to New > Create Task on the Data Development page.
Complete the following configurations in the Create Task dialog box.
Task name: WordCount sample.
Description: WordCount sample.
Complete the following configurations on the OPEN_MR configuration page.
MRJar package: This is a required field and indicates the primary JAR resource package that must be run in this node.
Resource: This is a required field and indicates the primary JAR resource that must be run and the list of other resources that must be called in this node.
Input/output tables: In this sample, the partition table of this project is used and the partition value is the business date scheduled automatically every day. Therefore, the partition is represented by a variable (system scheduling parameter).
Parameter configuration: In this sample, the partition is represented by a system parameter, instead of a custom variable. Thus, no additional configuration is needed:
For more information about the usage of parameter variables, see System scheduling parameters.
Click Save and Submit to switch to the workflow panel, and then click Test Run.
Given that only the partition pt=20170101 in the sample table has data during the test run, select 2017/01/01 as a business date so that the system parameter replaces the partitions in the input/output tables with 20170101.
Once the task is generated, you must wait until the task runs successfully.
Open the script inserting sample data to wc_in.
Enter the MaxCompute SQL query statement.
Check if the test result is as expected.