Improve UDF and MapReduce development experience using MaxCompute Studio
Created#More Posted time:Nov 15, 2016 10:03 AM
UDF is short for User Defined Function. MaxCompute provides many built-in functions to meet users' computing requirements. It also supports UDF creation to meet custom computing demands of users. There are three types of UDFs available for user extensions: UDF (User Defined Scalar Function), UDTF (User Defined Table Valued Function) and UDAF (User Defined Aggregation Function).
At the same time, MaxCompute also provides the MapReduce programming interface. You can use the Java API provided by MapReduce to compile a MapReduce program for processing data in MaxCompute.
With the end-to-end support provided by MaxCompute Studio, you can quickly start and familiarize yourself with the development of a UDF and a MapReduce program of your own to improve efficiency. Next, we will introduce how to develop a UDF of your own using MaxCompute Studio:
Create a MaxCompute Java module
First, you should create a module in IntelliJ for developing the MaxCompute Java program. The specific steps are: File > New > Module... The module type should be MaxCompute Java. Configure the installation path of the Java JDK and MaxCompute console. Click Next, enter the module name and then click Finish.
There are two main objectives for configuring the console here:
• The compiling of UDF and MapReduce programs is dependent on the related JAR files of the MaxCompute framework. These JAR files are all present in the lib directory of the console. MaxCompute Studio can help you to automatically import these lib files to the dependent library of the module.
• MaxCompute Studio can integrate the console and some actions will be very convenient using the console.
So far, a module for developing a MaxCompute Java program has been established, that is, the jDev shown in the figure below. Its main directories include:
• src (User-developed UDF|Source code directories of the MapReduce program)
• examples (Example code directories, including single test examples. You can refer to the examples here to develop your own programs or compile a single test)
• warehouse (The schema and data required for running locally)
Create a UDF
Suppose our demand for the UDF is to convert the character strings into lower-case strings (the built-in function tolower has implemented this logic. Here we just use this simple demand to illustrate how to develop a UDF using MaxCompute Studio). MaxCompute Studio provides UDF|UDAF|UDTF|Mapper|Reducer|Driver templates. As a result, you only need to compile your own business code and the framework code will be automatically filled in by the templates.
• 1. Right click the src directory, and select New > MaxCompute Java.
• 2. Input the class name, such as myudf.MyLower, and select the kind. Here we choose UDF. Click OK.
• 3. The framework code has been automatically filled in by the template. We only need to compile the function code for converting the character strings into lower-case strings.
Test the UDF
After a UDF or a MapReduce program is developed, the next step is to test your code and check whether it meets the expectation. MaxCompute Studio provides two testing methods:
The Local Run framework provided by MaxCompute is dependent on, and you only need to provide the input data as you do for writing a common single test, and assert the output to conveniently test your own UDF or MapReduce program. There are various single test examples in the examples directory and you can refer to them to compile your unit test. Here we create a new MyLowerTest testing class to test our MyLower:
Sample data testing
Many users have requested to sample some data from online tables to the local machine for testing. MaxCompute Studio also supports this feature. In Editor > UDF kind, right click MyLower.java, click Run menu, and the Run/Debug Configurations dialog box pops up. Configure MaxCompute project, MaxCompute table and Table columns. Here we want to convert the name fields of Table hy_test to lower case:
Click OK, and MaxCompute Studio will first automatically download the sample data from the table through the tunnel to the local warehouse (the highlighted data files in the figure), and then read the data of the specified column and run the UDF locally. You can see the log output and printed result in the console:
Release a UDF
Okay. Our MyLower.java has passed the test. Next we will package it into a JAR resource (this step can be performed using IDE by referring to the user manual) and upload it to the MaxComptute server end:
• 1. In MaxCompute menu, select the Add Resource item:
• 2. Select the MaxCompute project you want to upload the resource to, the JAR file path, the resource name you want to register, and determine whether to force update when the resource or function already exists. Then click OK.
• 3. After the JAR package is successfully uploaded, you can proceed to register the UDF. In the MaxCompute menu, select the Create Function item.
• 4. Select the JAR resource you want to use, select the main class (MaxCompute Studio will automatically parse the main classes contained in the JAR resource for you to choose from), enter the function name and click OK.
Production and usage
The successfully uploaded JAR resource and registered function (you can see this timely in the Resources and Functions nodes of the corresponding project in the Project Explorer, or you can double click the node to display the decompilation source code) are ready for real production and usage. We can open the SQL editor of MaxCompute Studio and enjoy using the mylower function that we just write. The syntax highlighting and display of function signatures are a piece of cake:
MaxCompute Studio's support for MapReduce development process is similar to UDF development. The main differences include:
• The MapReduce program applies to the whole table and the input and output tables are specified in the Driver. So if you use sample data for testing, you only need to specify the project in Run/Debug Configurations.
• After a MapReduce program is developed, you only need to package it into a JAR resource and upload it, without the need to register the program.
• To run the MapReduce program in a real production environment, you can achieve this on the console that has been seamlessly integrated to the MaxCompute Studio. The specific steps are: in the Project Explorer window, right click Project, select Open in Console, and then input the commands similar to the follows in the console command line:
jar -libjars wordcount.jar -classpath D:\odps\clt\wordcount.jar com.aliyun.odps.examples.mr.WordCount wc_in wc_out;