Use an ODPS MR node in DataWorks to schedule a MaxCompute MapReduce job and integrate it with other nodes in your data pipeline.
Prerequisites
Before you begin, make sure you have:
-
Uploaded, committed, and deployed the required JAR resources. See Create and use MaxCompute resources
-
Created an ODPS MR node. See Create and manage MaxCompute nodes
Upload, commit, and deploy JAR resources before you write code on the ODPS MR node.
MaxCompute MapReduce API versions
MapReduce is a programming framework for distributed computing that integrates user business logic with built-in MapReduce components to generate a complete distributed computing program, allowing jobs to run concurrently on a Hadoop cluster.
MaxCompute provides two MapReduce API versions:
| API version | Description |
|---|---|
| MaxCompute MapReduce | The native MapReduce API. Runs fast and does not expose the file system. |
| Extended MaxCompute MapReduce (MR2) | An extension of MaxCompute MapReduce that supports complex job scheduling logic. Uses the same implementation as MaxCompute MapReduce. |
For a full comparison, see Overview.
Limits
For limits that apply to MaxCompute MapReduce tasks, see Limits.
WordCount example
This example counts how many times each string appears in a wc_in table and writes the results to a wc_out table.
Step 1: Upload the JAR resource
Upload, commit, and deploy mapreduce-examples.jar as a MaxCompute resource. See Create and use MaxCompute resources.
For the implementation details inside mapreduce-examples.jar, see WordCount example.
Step 2: Write and run the code
Enter the following code on the ODPS MR node and run it.
-- Create an input table.
CREATE TABLE if not exists wc_in (key STRING, value STRING);
-- Create an output table.
CREATE TABLE if not exists wc_out (key STRING, cnt BIGINT);
--- Create a dual table.
drop table if exists dual;
create table dual(id bigint); -- If no dual table exists in the current workspace, create a dual table.
--- Initialize the dual table.
insert overwrite table dual select count(*)from dual;
--- Insert the sample data into the wc_in table.
insert overwrite table wc_in select * from (
select 'project','val_pro' from dual
union all
select 'problem','val_pro' from dual
union all
select 'package','val_a' from dual
union all
select 'pad','val_a' from dual
) b;
-- Reference the uploaded JAR resource. To reference the resource, find the JAR resource in the resource list, right-click the JAR resource, and then select Reference Resources.
--@resource_reference{"mapreduce-examples.jar"}
jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out
The jar command runs the WordCount class from mapreduce-examples.jar. It reads from wc_in and writes results to wc_out.
| Parameter | Description |
|---|---|
--@resource_reference{"mapreduce-examples.jar"} |
Added automatically when you right-click the resource name and select Reference Resources. |
-resources mapreduce-examples.jar |
The name of the referenced JAR resource. |
-classpath ./mapreduce-examples.jar |
The path to the JAR resource. Format: ./ followed by the resource name. To reference multiple JARs, separate them with commas: -classpath ./xxxx1.jar,./xxxx2.jar. |
com.aliyun.odps.mapred.open.example.WordCount |
The fully qualified name of the main class to run. |
wc_in |
The input table. |
wc_out |
The output table. |
When the job completes, the result returned is OK.
Step 3: Verify the results
Query the output table using an ODPS SQL node to confirm the results:
select * from wc_out;
Expected output:
+------------+------------+
| key | cnt |
+------------+------------+
| package | 1 |
| pad | 1 |
| problem | 1 |
| project | 1 |
| val_a | 2 |
| val_pro | 2 |
+------------+------------+
Each unique string from wc_in appears as a row in wc_out, with cnt showing how many times it appeared.
More examples
For MapReduce use cases beyond WordCount, see the following topics:
What's next
After developing the node, complete the following steps before going to production:
-
Configure scheduling properties: Set rerun settings and scheduling dependencies so the system runs the task on a schedule. See Overview.
-
Debug the node: Test the code logic to confirm it produces the expected results. See Debugging procedure.
-
Deploy the node: After all development is complete, deploy the node. The system then schedules it automatically based on the configured properties. See Deploy nodes.
For common issues, see FAQ about MaxCompute MapReduce.