This document aims to help users to use the MaxCompute MR feature in a safer and more convenient manner to implement more complicated computing logic. It focuses on development methods of OPEN MR to help users better develop complicated MR models.
OpenMR users only need to care about the Mapper/Reducer logic, and the job submission logic is handled uniformly by the platform.
Variables involved in some daily scheduling can be specified through parameters in the configuration when you create the OpenMR node.
This demo describes how to use MaxCompute MapReduce on Alibaba Cloud Data IDE with a classic WordCount example.
Data tables involved in this document are described as follows:
1) Input data table: “wc_in” is used to store the Word list.
2) Output data table: “wc_out” is used to store the result set processed by MapReduce.
Refer to Quick Start-New Table chapter to create the wc_in and wc_out tables.
CREATE TABLE wc_in (key STRING, value STRING) partitioned by (pt string );
CREATE TABLE wc_out (key STRING, cnt BIGINT) partitioned by (pt string );
To detect the running results of OPEN MR on Alibaba Cloud Data IDE, you need to insert sample data to the input table (Partition pt=20170101 in the “wc_in” data table). The steps are as follows:
Navigate to Data IDE, click New > New Script File.
Fill in the configuration items in the New Script File pop-up box, and then click “Submit”.
Compile the MaxCompute SQL statement and run the code in the MaxCompute code editor.
The MaxCompute SQL script is provided as follows:
---Create system dual
drop table if exists dual;
create table dual(id bigint); --If the project does not have this pseudo table, you should create it and initialize data.
---Initialize data to the pseudo table.
insert overwrite table dual select count(*)from dual;
---Insert sample data into Partition pt=20170101 in the “wc_in” input table.
insert overwrite table wc_in partition(pt=20170101) select * from (
select 'project','val_pro' from dual
select 'problem','val_pro' from dual
select 'package','val_a' from dual
select 'pad','val_a' from dual
You can preview the inserted sample data, as shown in the figure below:
Before using an OPEN MR node, you should compile the WordCount demo code based on the MaxCompute MapReduce framework in the local environment and compress the code into a JAR package, and add the package as a resource to Alibaba Cloud Data IDE.
For MR development, refer to the Help section on the MaxCompute official website. Link: MaxCompute Help Documents.
For details of the code in this example, seeWordCount.java appendix.
You need to run the MaxCompute MR node in the MaxCompute console or Alibaba Cloud Data IDE by using the “jar” command. First, generate a WordCount.jar package (using the Export function of Eclipse, or using Ant or another tool), and then upload the package as an MaxCompute resource.
The specific steps are as follows:
Select “Manage Resources” in the left catalog, and right click on the Upload Resources on the catalog.
Fill in the configuration items in the Resource Upload pop-up box. Notice that you must check the ‘Upload as ODPS Resource’ box.
After the newly created MaxCompute MapReduce program is uploaded as a resource to MaxCompute, you need to create an MaxCompute MR node to run the program. The specific steps are as follows:
Click New > New Job in the tool bar in the work zone.
Fill in the various configuration items in the New Job pop-up box.
Configuration items in the “New Job” pop-up box are described as follows:
■ Job Name: wordcount demo.
■ Description: wordcount demo.
Click “Create” ;
Drag an OPEN MR node to the development canvas ;
Double click on the node, or right click to view the node content to enter the OPEN MR configuration page ;
- MRJar package: required. The primary JAR resource package to be run by this node.
- Resources: required. The primary JAR resource and the list of other resources called to be run by this node.
- Input/Output Tables: In this demo, the partition table of this project is used and the partition value is the business date subject to daily automatic scheduling. Therefore, the partition is represented by a variable (system scheduling parameter).
Parameter Configuration: The partition in this demo is represented by the system variable with no user-defined variables, so here no additional configurations are required:
Note: For more usage of parameter variables, refer to System Scheduling Parameters.
Click Save, Submit and switch to the process panel of the workflow. Click “Test Run”. Note: In the test run, only the Partition pt=20170101 in the example table contains data, so 2017-01-01 is selected as the business time, so that the system parameters will replace the partitions in the input/output tables with 20170101.
After the test job is generated, wait for the run to succeed.
To display the test results, take the following steps:
Open “wc_in” and insert the sample data script file ;
Compile the query MaxCompute SQL code ;
View whether the test results are consistent with expectations.