An ODPS MR node runs a MaxCompute MapReduce program inside a DataWorks workflow. Use it to schedule MapReduce jobs written against the MaxCompute MapReduce Java APIs and integrate them with other nodes in your pipeline.
Prerequisites
Before you begin, ensure that you have:
Uploaded, committed, and deployed all required resources. See Create and use MaxCompute resources.
Created an ODPS MR node. See Create and manage MaxCompute nodes.
Upload, commit, and deploy all required resources before writing code in the ODPS MR node. The node cannot reference resources that are not yet deployed.
MapReduce API versions
MaxCompute provides two MapReduce API versions. For a full comparison, see Overview.
MapReduce is a programming framework for distributed computing programs. It integrates user business logic with built-in MapReduce components to generate a complete distributed computing program, which allows you to concurrently run jobs on a Hadoop cluster.
| API version | Description |
|---|---|
| MaxCompute MapReduce | The native API. Runs fast and does not expose the underlying file system. |
| Extended MaxCompute MapReduce (MR2) | An extension of MaxCompute MapReduce that supports complex job scheduling logic. The development approach is identical to MaxCompute MapReduce. |
For the limits that apply to MaxCompute MapReduce tasks, see Limits.
Run a MapReduce job on an ODPS MR node
The following example counts the occurrences of each string in an input table and writes the results to an output table. It uses the mapreduce-examples.jar file, which contains the standard WordCount implementation.
For the implementation details of mapreduce-examples.jar, see WordCount example.
Step 1: Upload the JAR resource
Download mapreduce-examples.jar, then upload, commit, and deploy it as a MaxCompute resource. See Create and use MaxCompute resources.
Step 2: Write and run the node code
Paste the following code into the ODPS MR node editor and run it.
-- Create an input table.
CREATE TABLE if not exists wc_in (key STRING, value STRING);
-- Create an output table.
CREATE TABLE if not exists wc_out (key STRING, cnt BIGINT);
-- Create a dual table if one does not already exist in the workspace.
drop table if exists dual;
create table dual(id bigint);
-- Initialize the dual table.
insert overwrite table dual select count(*) from dual;
-- Populate the input table with sample data.
insert overwrite table wc_in select * from (
select 'project', 'val_pro' from dual
union all
select 'problem', 'val_pro' from dual
union all
select 'package', 'val_a' from dual
union all
select 'pad', 'val_a' from dual
) b;
-- Reference the uploaded JAR resource. To add this line automatically,
-- right-click the JAR resource in the resource list and select Reference Resources.
--@resource_reference{"mapreduce-examples.jar"}
jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_outThe jar command has the following structure:
jar -resources <jar-resource-name> -classpath ./<jar-resource-name> <main-class> <input-table> <output-table>| Parameter | Description |
|---|---|
-resources | Name of the JAR resource as it appears in the DataWorks resource list |
-classpath | Path to the JAR resource — enter ./ followed by the resource name |
<main-class> | Fully qualified name of the main class in the JAR to run (com.aliyun.odps.mapred.open.example.WordCount in this example) |
<input-table> | Name of the input table (wc_in in this example) |
<output-table> | Name of the output table (wc_out in this example) |
When the node runs successfully, it returns OK.
Using multiple JAR resources: If a single node needs more than one JAR file, separate the paths of the referenced JAR resources with commas (,), such as -classpath ./xxxx1.jar,./xxxx2.jar.
Step 3: Verify the output
Run the following query on an ODPS SQL node to confirm the results were written correctly.
select * from wc_out;Expected output:
+------------+------------+
| key | cnt |
+------------+------------+
| package | 1 |
| pad | 1 |
| problem | 1 |
| project | 1 |
| val_a | 2 |
| val_pro | 2 |
+------------+------------+More examples
For MapReduce development in other scenarios, see:
What's next
After completing node development:
Configure scheduling properties — Set up periodic scheduling, rerun settings, and scheduling dependencies so the system runs the node automatically. See Overview.
Debug the node — Test your code logic before deploying. See Debugging procedure.
Deploy the node — Deploy the node so the scheduler picks it up and runs it on the configured schedule. See Deploy nodes.
For common errors and their solutions, see FAQ about MaxCompute MapReduce.