Use a MaxCompute MR node to schedule and run MapReduce jobs written in Java within DataWorks, and integrate them with other nodes in your workflow.
Background
MaxCompute provides two MapReduce programming interfaces. MapReduce is a distributed computing framework that combines user-written business logic with built-in components to create a complete distributed program that runs concurrently on a Hadoop cluster.
MaxCompute MapReduce: The native interface. Delivers fast execution and rapid development without exposing the file system.
Extended MaxCompute MapReduce (MR2): An extension that supports more complex job scheduling. Its MapReduce implementation is consistent with the native interface.
For details on both interfaces, see MapReduce.
Prerequisites
Before you begin, ensure that you have:
A MaxCompute compute resource attached to the DataWorks workspace. See Attach a MaxCompute compute resource
All required resources uploaded and published before creating the node. See Resource management
Limitations
For limits that apply to MaxCompute MR nodes, see Limits.
Develop and run a MaxCompute MR node
This section walks through a WordCount example that counts the occurrences of each string in the wc_in table and writes the results to the wc_out table.
Step 1: Develop the MR code
Upload, submit, and publish the
mapreduce-examples.jarresource. See Resource management.For details on the internal logic of
mapreduce-examples.jar, see the WordCount example.In the MaxCompute MR node editor, enter the following code:
Parameter Description --@resource_referenceAuto-generated when you right-click a resource name in the Resource section and choose Reference Resource. -resourcesThe name of the referenced JAR resource file. -classpathThe path to the JAR file. When a resource is referenced, the path is always ./followed by the JAR filename (for example,./mapreduce-examples.jar).com.aliyun.odps.mapred.open.example.WordCountThe main class in the JAR file called at execution. This must exactly match the main class defined in the JAR file. wc_inThe input table for the MapReduce job. wc_outThe output table for the MapReduce job. -- Create the input table. CREATE TABLE IF NOT EXISTS wc_in (key STRING, VALUE STRING); -- Create the output table. CREATE TABLE IF NOT EXISTS wc_out (key STRING, cnt BIGINT); --- Create the system dual table. DROP TABLE IF EXISTS dual; CREATE TABLE dual(id BIGINT); -- If the pseudo table does not exist in the workspace, you must create it and initialize its data. --- Initialize data in the system pseudo table. INSERT OVERWRITE TABLE dual SELECT count(*) FROM dual; --- Insert sample data into the input table wc_in. INSERT OVERWRITE TABLE wc_in SELECT * FROM ( SELECT 'project','val_pro' FROM dual UNION ALL SELECT 'problem','val_pro' FROM dual UNION ALL SELECT 'package','val_a' FROM dual UNION ALL SELECT 'pad','val_a' FROM dual ) b; -- Reference the JAR resource that you uploaded. Right-click the resource name in the Resource section and choose Reference Resource to auto-generate this statement. --@resource_reference{"mapreduce-examples.jar"} jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_outThe key parameters in the
jarcommand are:
Multiple JAR files
If the job depends on more than one JAR file, list all paths in -classpath, separated by commas:
-classpath ./xxxx1.jar,./xxxx2.jarStep 2: Run the MR job
In the Run Configuration section on the right, set the Compute Engine Instance, Compute Quota, and Resource Group.
To access a data source on the Public Network or in a Virtual Private Cloud (VPC), use a scheduling resource group that can connect to the data source. See Network connectivity solutions.
In the toolbar, click Run. In the Parameters dialog box, select your MaxCompute data source and run the job.
Step 3: (Optional) Verify the results
Query the output table using a MaxCompute SQL node:
SELECT * FROM wc_out;Expected output:
+------------+------------+
| key | cnt |
+------------+------------+
| package | 1 |
| pad | 1 |
| problem | 1 |
| project | 1 |
| val_a | 2 |
| val_pro | 2 |
+------------+------------+Configure scheduling and deploy
After developing and testing the node in the editor, complete the following steps to put it into production:
Configure scheduling: If the node runs periodically, set its scheduling properties. See Node scheduling configuration.
Deploy the node: Deploy the node to make it available for scheduling. See Node and workflow deployment.
Monitor execution: After deployment, track the job status in Operation Center. See Getting started with Operation Center.
What's next
Explore more MaxCompute MR job examples for different scenarios:
For solutions to common issues, see FAQ about MaxCompute MapReduce.