This topic describes how to build a MapReduce program in DataWorks by using the WordCount example program.
Create a project
- Log on to the DataWorks console. Click Workspaces in the left-side navigation pane. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
- On the DataStudio page that appears, click the DataWorks icon in the upper-left corner and choose .
- On the Function Studio page that appears, click Projects in the left-side navigation pane. On the Projects page, click Create Project from Code.
- On the Create Project page that appears, enter the name and description for the project and select the runtime environment. In this example, select UDFJava Project.
- After the preceding configuration is completed, click Submit. Then, you are redirected to the Function Studio page.
Write a program
In the left-side navigation pane of the Function Studio page, double-click WordCount.java in the src/main/java/com.alibaba.dataworks/mapred directory. The sample code for the WordCount program appears on the right.
The sample code is designed to count the number of occurrences of each word in an input table and write the count result to another table, which is the output table.
Debug the program
You cannot debug MapReduce programs in Function Studio. To debug a MapReduce program, commit the resource to the development environment and verify the code logic in DataStudio.
Deploy the program
- To compile and package the code to a JAR package in Function Studio and commit the
JAR package as a resource to the development environment in DataStudio, follow these
- Move the pointer over the icon on the Function Studio page and click Submit Resource to Development Environment.
- In the Submit Resource to DataStudio Development Environment dialog box that appears, select the destination workspace and destination workflow and enter the resource name. In this example, the resource name is WordCountDemo_1.0.0.jar.
- Select or clear the Force Overwrite check box as required. By default, this check box is selected. Then, click OK.
- After the resource is committed, the URL of the resource in DataStudio appears on the Output tab. Enter the URL in a browser to go to the page of DataStudio, where you can find the committed resource.
- To create an ODPS MR node in DataStudio, follow these steps:
- On the DataStudio page, find and open the target workflow in the left-side navigation pane.
- Right-click Data Analytics and choose .
- In the code editor of the ODPS MR node, write the following code to create an input
table and an output table and insert test data into the input table:
--Create an input table. create table if not exists wc_input (key string,value string); --Create an output table. create table if not exists wc_out (key string,cnt bigint); --Insert data into the input table. insert overwrite table wc_input values ('hello','odps');
- Right-click the resource of the WordCount program and select Insert Resource Path. Then, the following statement appears in the code editor of the ODPS MR node:
- Set the
-classpathparameters for the referenced resource in the code. Use the following sample code:
jar -resources WordCountDemo_1.0.0.jar -classpath . /WordCountDemo_1.0.0.jar com.alibaba.dataworks.mapred.WordCount wc_input wc_out
- Click the Run icon in the top navigation bar of the code editor.
- Click Properties in the right-side navigation pane. On the Properties tab that appears, set the relevant parameters.
- After the scheduling configuration is completed, click the Save icon and then the Submit icon or the Commit and Unlock icon in the top navigation bar of the code editor. The ODPS MR node is committed to the development environment.
- Deploy the ODPS MR node to the production environment. For more information, see Deploy a node.
- Run the ODPS MR node in the production environment to debug the WordCount program. For more information, see Auto triggered nodes.