Write and run a MapReduce WordCount job - MaxCompute - Alibaba Cloud - MaxCompute

Prerequisites

Ensure you have the following:

The MaxCompute client is installed and configured. For more information, see Install and configure the MaxCompute client.
MaxCompute Studio is installed and connected to a MaxCompute project. For more information, see Install MaxCompute Studio and Manage project connections.
You have a source data file saved on your local machine.

This topic uses a sample file named data.txt. The file contains the content hello,odps and is saved in the bin directory of the MaxCompute client.

Usage notes

If you use Maven to develop a MapReduce program, search for odps-sdk-mapred, odps-sdk-commons, and odps-sdk-core in the Maven Central Repository to get the required Java SDK version. This topic uses version 0.36.4-public as an example. You must configure the following dependencies in the pom.xml file.

<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>odps-sdk-mapred</artifactId>
    <version>0.36.4-public</version>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>odps-sdk-commons</artifactId>
    <version>0.36.4-public</version>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>odps-sdk-core</artifactId>
    <version>0.36.4-public</version>
</dependency>

Procedure

Step 1: Develop a MapReduce program

Write, run, and debug a MapReduce program in MaxCompute Studio.
Step 2: Generate and upload a MapReduce JAR file

Package the compiled WordCount.java script into a JAR file and upload it to the MaxCompute project.
Step 3: Run the MapReduce job

Run the MapReduce job using the jar command with the JAR file uploaded to the MaxCompute project.

Step 1: Develop a MapReduce program

Create a MaxCompute Java module.
1. Start IntelliJ IDEA. In the menu bar, select File > New > Module....
2. In the left-side navigation pane of the New Module dialog box, select MaxCompute Java.
3. Configure the Module SDK and click Next.
4. Enter a Module name, such as mapreduce, and click Finish.
Write, run, and debug the WordCount MapReduce program.
1. In the Project pane, right-click the source code directory of the module src > main > java, and select New > MaxCompute Java.
2. In the Create new MaxCompute java class dialog box, click Driver, enter a Name such as WordCount, and press Enter.
3. In the newly created WordCount.java file, write the WordCount MapReduce program.
  
  For the complete WordCount sample code, see Code example.
4. In the left-side navigation pane, right-click the WordCount.java script and select Run.
5. In the Run/Debug Configurations dialog box, configure the MaxCompute project.
  
  For the MaxCompute project setting, select an Endpoint such as http://service.cn-hangzhou.maxcompute.aliyun.com/api, and then choose your target project, such as doc_test_dev.
6. Click OK. Run and debug the WordCount.java script to verify that it compiles successfully.

Step 2: Generate and upload a MapReduce JAR file

In the left-side navigation pane of IntelliJ IDEA, right-click the WordCount.java script and select Deploy to server....
In the Package a jar and submit resource dialog box, configure the parameters and click OK.
Select the target MaxCompute project. The Resource file field displays the path of the packaged JAR file. Enter a Resource name, such as mapreduce-1.0-SNAPSHOT.jar. Optionally, add a comment in the Resource comment field. Select Force update if already exists to overwrite an existing resource with the same name.

For more information about the parameters, see Procedure.
Note
If you use Maven to develop the MapReduce program, after packaging it into a JAR file, you must manually upload it to the MaxCompute project using the MaxCompute client. For more information, see Add a resource. The following command is an example:
```
add jar mapreduce-1.0-SNAPSHOT.jar;
```

Step 3: Run the MapReduce job

Log on to the MaxCompute client or open the MaxCompute client in MaxCompute Studio.

The MaxCompute client is integrated into MaxCompute Studio. You can run the MaxCompute client directly in MaxCompute Studio. For more information, see Integrate the MaxCompute client.
Create an input table and an output table.
The MapReduce job reads source data from the input table and writes the results to the output table. The following commands are examples:
```
-- Create the input table wc_in.
create table wc_in (key STRING, value STRING);
-- Create the output table wc_out.
create table wc_out (key STRING, cnt BIGINT);
```
For more information about the syntax for creating tables, see Create a table.
Use the Tunnel Upload command to insert data into the wc_in table.
The following command is an example:
```
tunnel upload data.txt wc_in;
```
For more information about Tunnel, see Tunnel.
Use the jar command to run the MapReduce job with the generated JAR file.
The following command is an example:
```
jar -resources mapreduce-1.0-SNAPSHOT.jar -classpath mapreduce-1.0-SNAPSHOT.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out;
```
- -resources mapreduce-1.0-SNAPSHOT.jar: The -resources parameter specifies the name of the resource that is called by the MapReduce job. In this example, the resource is the mapreduce-1.0-SNAPSHOT.jar file uploaded in Step 2.
- -classpath mapreduce-1.0-SNAPSHOT.jar : The -classpath parameter specifies the local path to the JAR file that contains the MainClass.
- com.aliyun.odps.mapred.open.example.WordCount: The MainClass defined in the MapReduce program.
- wc_in wc_out: The input table and output table.
For more information about the jar command, see Syntax.

Run the following command to view the data in the wc_out table.

select * from wc_out;

The following result is returned:

+------------+------------+
| key        | cnt        |
+------------+------------+
| hello      | 1          |
| odps       | 1          |
+------------+------------+