This topic describes how to write a user-defined table-valued function (UDTF) in Java.
UDTF code structure
- Java package: optional.
You can package defined Java classes into a file for future use.
- Base UDTF classes: required.
The following base UDTF classes must be included:
com.aliyun.odps.udf.UDTF
,com.aliyun.odps.udf.annotation.Resolve
, andcom.aliyun.odps.udf.UDFException
. com.aliyun.odps.udf.annotation.Resolve specifies the@Resolve
annotation, and com.aliyun.odps.udf.UDFException specifies the method that is used to implement Java classes. If you need to use other UDTF classes or complex data types, add the required classes by following the instructions provided in MaxCompute SDK. - Custom Java class: required.
A custom Java class is the organizational unit of UDTF code. This class defines the variables and methods that are used to meet your business requirements.
@Resolve
annotation: required.The annotation is in the
@Resolve(<signature>)
format. Thesignature
is a function signature that defines the data types of input parameters and return values of a UDTF. You cannot obtain function signatures for UDTFs by using the reflection feature. You can obtain a function signature only by using a@Resolve
annotation, such as@Resolve("smallint->varchar(10)")
. For more information about@Resolve
annotations, see @Resolve annotations.- Methods to implement Java classes: required.
The following table describes the methods that can be used to implement Java classes. You can select one of the methods based on your business requirements.
Method Description public void setup(ExecutionContext ctx) throws UDFException
The initialization method. Before a UDTF processes the input data, MaxCompute calls the user-defined initialization behavior. setup
is called once for each worker.public void process(Object[] args) throws UDFException
process
is called once for each SQL record. The parameters ofprocess
are the input parameters of the UDTF that is specified in SQL statements. The input parameters are passed in the process function asObject[]
, and the results are returned by using theforward
function. You must call theforward
function in theprocess
function to determine the output data.public void close() throws UDFException
The method to terminate a UDTF. This method is called only once. It is called only after the last record is processed. public void forward(Object …o) throws UDFException
You can call the forward
method to return data. One record is generated each time theforward
function is called. When you call a UDTF in an SQL query statement, you can use theAS
clause to rename the output of theforward
method.You can use Java data types or Java writable types to write a Java UDTF. For more information about the mappings among the data types that are supported by MaxCompute projects, Java data types, and Java writable types, see Data types.
The following example shows the UDTF code.
// Package Java classes into a JAR file named org.alidata.odps.udtf.examples.
package org.alidata.odps.udtf.examples;
// The base UDTF classes.
import com.aliyun.odps.udf.UDTF;
import com.aliyun.odps.udf.UDTFCollector;
import com.aliyun.odps.udf.annotation.Resolve;
import com.aliyun.odps.udf.UDFException;
// The custom Java class.
// The @Resolve annotation.
@Resolve("string,bigint->string,bigint")
public class MyUDTF extends UDTF {
// The method that is used to implement the Java classes.
@Override
public void process(Object[] args) throws UDFException {
String a = (String) args[0];
Long b = (Long) args[1];
for (String t: a.split("\\s+")) {
forward(t, b);
}
}
}
Limits
- You cannot access the Internet by using user-defined functions (UDFs). If you want to access the Internet by using UDFs, fill in the network connection application form based on your business requirements and submit the application. After the application is approved, the MaxCompute technical support team will contact you and help you establish network connections. For more information about how to fill in the network connection application form, see Network connection process.
- If you use a UDTF in a
SELECT
statement, you cannot specify other columns or use other expressions in this statement. The following sample code shows an incorrect SQL statement.-- The statement contains a UDTF and another column. select value, user_udtf(key) as mycol ...
- UDTFs cannot be nested. The following sample code shows an incorrect SQL statement.
-- A UDTF named user_udtf2 is nested in a UDTF named user_udtf1. select user_udtf1(user_udtf2(key)) as mycol...;
- A UDTF cannot be used with a
GROUP BY
,DISTRIBUTE BY
, orSORT BY
clause in the sameSELECT
statement. The following sample code shows an incorrect SQL statement.-- A UDTF is used together with a GROUP BY clause. select user_udtf(key) as mycol ... group by mycol;
Usage notes
- We recommend that you do not package classes that have the same name but different
logic into the JAR files of different UDTFs. For example, the JAR file of UDTF 1 is
named udtf1.jar and the JAR file of UDTF 2 is named udtf2.jar. Both packages contain
a class named
com.aliyun.UserFunction.class
, but the class has different logic. If UDTF 1 and UDTF 2 are called in the same SQL statement, MaxCompute loads the com.aliyun.UserFunction.class from one of the two files. As a result, the UDTFs cannot run as expected and a compilation error may occur. - The data type of an input parameter or a return value in a Java UDTF is an object. The first letter of the data types that you specify in the Java UDTF code must be in uppercase, such as String.
- NULL values in SQL are represented by NULL in Java. Primitive data types in Java cannot represent NULL values in SQL. Therefore, these data types cannot be used.
@Resolve annotations
@Resolve
annotation format: @Resolve(<signature>)
signature
is a function signature string. This parameter is used to identify the data types
of the input parameters and return values. When a UDTF is run, the input parameters
and return values of the UDTF must be of the same data type as those specified in
the function signature. The system checks whether the UDTF complies with the definition
of the function signature during semantics parsing. If the data types of the UDTF
are inconsistent with the data types specified in the function signature, an error
is returned. The signature is in the following format: 'arg_type_list -> type_list'
Parameter description:type_list
: indicates the data types of return values. A UDTF can return multiple columns. The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), complex data types (ARRAY, MAP, and STRUCT), and nested complex data types.arg_type_list
: indicates the data types of input parameters. If multiple input parameters are used, specify multiple data types and separate them with commas (,). The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, VARCHAR, complex data types (ARRAY, MAP, and STRUCT), and nested complex data types.arg_type_list
can also be set to an asterisk (*) or left empty.- If
arg_type_list
is set to an asterisk (*), a random number of input parameters are used. - If
arg_type_list
is left empty, no input parameters are used.
- If
The following table provides examples of @Resolve
annotations.
@Resolve annotation | Description |
---|---|
@Resolve('bigint,boolean->string,datetime') |
The data types of input parameters are BIGINT and BOOLEAN, and the data types of return values are STRING and DATETIME. |
@Resolve('*->string, datetime') |
A random number of input parameters are used. The data types of return values are STRING and DATETIME. |
@Resolve('->double, bigint, string') |
No input parameters are used. The data types of return values are DOUBLE, BIGINT, and STRING. |
@Resolve("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint>") |
The data types of input parameters are ARRAY, STRUCT, and MAP. The data types of return values are MAP and STRUCT. |
Data types
In MaxCompute, different data type editions support different data types. In MaxCompute V2.0 and later, more data types and complex data types, such as ARRAY, MAP, and STRUCT, are supported. For more information about MaxCompute data type editions, see Data type editions.
The following table describes the mappings among the data types that are supported by MaxCompute projects, Java data types, and Java writable types. You must write Java UDTFs based on the mappings to ensure data type consistency.
MaxCompute data type | Java data type | Java writable data type |
---|---|---|
TINYINT | java.lang.Byte | ByteWritable |
SMALLINT | java.lang.Short | ShortWritable |
INT | java.lang.Integer | IntWritable |
BIGINT | java.lang.Long | LongWritable |
FLOAT | java.lang.Float | FloatWritable |
DOUBLE | java.lang.Double | DoubleWritable |
DECIMAL | java.math.BigDecimal | BigDecimalWritable |
BOOLEAN | java.lang.Boolean | BooleanWritable |
STRING | java.lang.String | Text |
VARCHAR | com.aliyun.odps.data.Varchar | VarcharWritable |
BINARY | com.aliyun.odps.data.Binary | BytesWritable |
DATETIME | java.util.Date | DatetimeWritable |
TIMESTAMP | java.sql.Timestamp | TimestampWritable |
INTERVAL_YEAR_MONTH | N/A | IntervalYearMonthWritable |
INTERVAL_DAY_TIME | N/A | IntervalDayTimeWritable |
ARRAY | java.util.List | N/A |
MAP | java.util.Map | N/A |
STRUCT | com.aliyun.odps.data.Struct | N/A |
Instructions
- Use a UDF in a MaxCompute project: The method is similar to that of using built-in functions.
- Use a UDF across projects: Use a UDF of Project B in Project A. The following statement
shows an example:
select B:udf_in_other_project(arg0, arg1) as res from table_t;
. For more information about resource sharing across projects, see Package-based resource sharing across projects.
For more information about how to use MaxCompute Studio to develop and call a Java UDTF, see Example.
Example
This example describes how to use MaxCompute Studio to develop and call a Java UDTF.
- Make the following preparations on IntelliJ IDEA:
- Write UDTF code.
- In the left-side navigation pane of the Project tab, choose , right-click java, and then choose .
- In the Create new MaxCompute java class dialog box, click UDTF, enter a name in the Name field, and then press Enter. In this example, the Java class is named MyUDTF.
Name is the name of the MaxCompute Java class. If no package is created, enter packagename.classname. The system automatically creates a package.
- Write code in the code editor.
The following example shows the UDTF code.
package org.alidata.odps.udtf.examples; import com.aliyun.odps.udf.UDTF; import com.aliyun.odps.udf.UDTFCollector; import com.aliyun.odps.udf.annotation.Resolve; import com.aliyun.odps.udf.UDFException; // TODO define input and output types, e.g., "string,string->string,bigint". @Resolve("string,bigint->string,bigint") public class MyUDTF extends UDTF { @Override public void process(Object[] args) throws UDFException { String a = (String) args[0]; Long b = (Long) args[1]; for (String t: a.split("\\s+")) { forward(t, b); } } }
- In the left-side navigation pane of the Project tab, choose , right-click java, and then choose .
- Debug the UDTF on your on-premises machine to ensure that the code can run successfully.
For more information about debugging operations, see Perform a local run to debug the UDF.
Note You can configure parameters based on the values in the preceding figure. - Package the created UDTF into a JAR file, upload the file to your MaxCompute project,
and then register the UDTF. In this example, the function name is
user_udtf
.For more information about how to package a UDTF, see Package the code.
- In the left-side navigation pane of MaxCompute Studio, click Project Explorer. Right-click your MaxCompute project, select Open in Console from the drop-down list
to start the MaxCompute client, and then execute the SQL statement to call the new
UDTF.
The following example shows the data structure of the my_table table that you want to query.
+------------+------------+ | col0 | col1 | +------------+------------+ | A B | 1 | | C D | 2 | +------------+------------+
Execute the following SQL statement to call the UDTF:
The following result is returned:select user_udtf(col0, col1) as (c0, c1) from my_table;
+----+------------+ | c0 | c1 | +----+------------+ | A | 1 | | B | 1 | | C | 2 | | D | 2 | +----+------------+