This topic describes how to write and use a user-defined function (UDF).
Background information
StarRocks 2.2.0 and later support UDFs written in Java.
StarRocks 3.0 and later support global UDFs. Simply add the GLOBAL keyword to SQL statements, such as CREATE, SHOW, and DROP, to make the SQL statements globally take effect. You do not need to execute the SQL statements for each database one by one. You can configure UDFs based on your business scenarios to extend the function capabilities of StarRocks.
StarRocks supports the following types of UDFs:
User-defined scalar functions (scalar UDFs)
User-defined aggregate functions (UDAFs)
User-defined window functions (UDWFs)
User-defined table-valued functions (UDTFs)
Prerequisites
You must make sure that the following requirements are met before you use the Java UDF feature of StarRocks:
Install Apache Maven for creating and writing related Java projects.
Install Java Development Kit (JDK) 1.8 on the server.
Enable the UDF feature. In the FE section on the Instance Configuration tab of the details page of your EMR Serverless StarRocks instance, set the
enable_udfparameter toTRUEand then restart the instance for the configuration to take effect.
Mappings of data types
SQL type | Java type |
BOOLEAN | java.lang.Boolean |
TINYINT | java.lang.Byte |
SMALLINT | java.lang.Short |
INT | java.lang.Integer |
BIGINT | java.lang.Long |
FLOAT | java.lang.Float |
DOUBLE | java.lang.Double |
STRING/VARCHAR | java.lang.String |
Develop and use a UDF
You must create a Maven project and write the corresponding features in Java.
Step 1: Create a Maven project
Create a Maven project with the following basic directory structure:
project
|--pom.xml
|--src
| |--main
| | |--java
| | |--resources
| |--test
|--targetStep 2: Add dependencies
Add the following dependencies to the pom.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>udf</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.76</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.10</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
</project>Step 3: Develop a UDF
You must develop a UDF in Java.
Develop a scalar UDF
You can use scalar UDFs to perform operations on a single row of data and generate a single row of results. When you use scalar UDFs to query data, each row of data will eventually appear in the result set by row. Typical scalar UDFs include UPPER, LOWER, ROUND, and ABS.
The following example shows how to extract JSON data. For example, the value of a field in JSON data may be a JSON string instead of a JSON object. In this case, if you want to extract the JSON string, nested calling of the function GET_JSON_STRING in the SQL statement is required, which is GET_JSON_STRING(GET_JSON_STRING('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key"), "$.k0").
To simplify the SQL statement, you can develop a UDF to directly extract the JSON string, such as MY_UDF_JSON_GET('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key.k0").
package com.starrocks.udf.sample;
import com.alibaba.fastjson.JSONPath;
public class UDFJsonGet {
public final String evaluate(String jsonObj, String key) {
if (obj == null || key == null) return null;
try {
// The JSONPath library can be fully expanded, even if the value of a field is a JSON string.
return JSONPath.read(jsonObj, key).toString();
} catch (Exception e) {
return null;
}
}
}User-defined classes must implement the following method.
The data types of the request parameters and response parameters in the method must be the same as those declared in the CREATE FUNCTION statement in Step 6, and the data type mappings between parameters in the method and parameters declared in the CREATE FUNCTION statement must conform to Mappings of data types.
Method | Description |
TYPE1 evaluate(TYPE2, ...) | The |
Develop a UDAF
You can use UDAFs to perform operations on multiple rows of data and generate a single row of results. Typical UDAFs include SUM, COUNT, MAX, and MIN. These functions can be used to aggregate multiple rows of data in each GROUP BY group and generate a single row of results.
In the following example, the function MY_SUM_INT is used. Different from the built-in function SUM whose return value is of the BIGINT data type, the data types of the input parameters and response parameters of the function MY_SUM_INT are INT.
package com.starrocks.udf.sample;
public class SumInt {
public static class State {
int counter = 0;
public int serializeLength() { return 4; }
}
public State create() {
return new State();
}
public void destroy(State state) {
}
public final void update(State state, Integer val) {
if (val != null) {
state.counter+= val;
}
}
public void serialize(State state, java.nio.ByteBuffer buff) {
buff.putInt(state.counter);
}
public void merge(State state, java.nio.ByteBuffer buffer) {
int val = buffer.getInt();
state.counter += val;
}
public Integer finalize(State state) {
return state.counter;
}
}User-defined classes must implement the following methods.
The data types of the input parameters and response parameters in the methods must be the same as those declared in the CREATE FUNCTION statement in Step 6, and the data type mappings between parameters in the methods and parameters declared in the CREATE FUNCTION statement must conform to Mappings of data types.
Method | Description |
State create() | Create a State. |
void destroy(State) | Destroy a State. |
void update(State, ...) | Update a State. The first parameter is State, and the remaining ones are the input parameters declared by the function. You can specify either a single input parameter or multiple ones. |
void serialize(State, ByteBuffer) | Serialize a State. |
void merge(State, ByteBuffer) | Merge and deserialize a State. |
TYPE finalize(State) | Use a State to obtain the final result of the function. |
When you develop a UDAF, you must use the buffer class java.nio.ByteBuffer to save and represent intermediate results and use the local variable serializeLength to specify the serialized length of the intermediate results.
Class and local variable | Description |
java.nio.ByteBuffer() | A buffer class used to save intermediate results. Due to the fact that intermediate results are serialized and deserialized when they are transmitted between different execution nodes, you must use serializeLength to specify the serialized length of the intermediate results. |
serializeLength() | The length of the intermediate results after serialization. Unit: bytes. The data type of serializeLength is fixed to INT. In the example, |
Take note of the following items related to java.nio.ByteBuffer when serialization is performed:
You cannot use the remaining() method of ByteBuffer to deserialize a State.
You cannot call the clear() method in ByteBuffer.
The value of
serializeLengthmust be consistent with the length of the data that is actually written. Otherwise, incorrect results may be obtained during serialization and deserialization.
Develop a UDWF
UDWF is the abbreviation for user-defined window function. Different from a common UDAF, a UDWF can be used to perform calculation for a set of rows (a window) and return a result for each row. In most cases, a UDWF contains an OVER clause that can be used to split data rows into multiple groups. The UDWF performs calculation based on the group (a window) in which each row of data resides and returns a result for each row.
In the following example, the function MY_WINDOW_SUM_INT is used. Different from the built-in function SUM whose return value is of the BIGINT data type, the data types of the input parameters and response parameters of the function MY_WINDOW_SUM_INT can be INT.
package com.starrocks.udf.sample;
public class WindowSumInt {
public static class State {
int counter = 0;
public int serializeLength() { return 4; }
@Override
public String toString() {
return "State{" +
"counter=" + counter +
'}';
}
}
public State create() {
return new State();
}
public void destroy(State state) {
}
public void update(State state, Integer val) {
if (val != null) {
state.counter+=val;
}
}
public void serialize(State state, java.nio.ByteBuffer buff) {
buff.putInt(state.counter);
}
public void merge(State state, java.nio.ByteBuffer buffer) {
int val = buffer.getInt();
state.counter += val;
}
public Integer finalize(State state) {
return state.counter;
}
public void reset(State state) {
state.counter = 0;
}
public void windowUpdate(State state,
int peer_group_start, int peer_group_end,
int frame_start, int frame_end,
Integer[] inputs) {
for (int i = (int)frame_start; i < (int)frame_end; ++i) {
state.counter += inputs[i];
}
}
}User-defined classes must implement the methods required by UDAFs and the windowUpdate() method. UDWFs are special UDAFs.
The data types of the request parameters and response parameters in the method must be the same as those declared in the CREATE FUNCTION statement in Step 6, and the data type mappings between parameters in the method and parameters declared in the CREATE FUNCTION statement must conform to Mappings of data types.
Other methods that need to be implemented
Method | Description |
| Update the window data. For more information about UDWFs, see Window_function. When you enter a row of data, the corresponding window information is obtained to update the intermediate result.
|
Develop a UDTF
A UDTF can be used to read a row of data and generate multiple values that can be considered as a table. UDTFs are commonly used to implement row-to-column conversion.
UDTFs support only returning of multiple rows in a single column.
In the following example, the function MY_UDF_SPLIT is used. The function MY_UDF_SPLIT supports splitting strings by using a space as the delimiter. The data types of the input parameters and response parameters are STRING.
package com.starrocks.udf.sample;
public class UDFSplit{
public String[] process(String in) {
if (in == null) return null;
return in.split(" ");
}
}User-defined classes must implement the following method.
The data types of the request parameters and response parameters in the method must be the same as those declared in the CREATE FUNCTION statement in Step 6, and the data type mappings between parameters in the method and parameters declared in the CREATE FUNCTION statement must conform to Mappings of data types.
Method | Description |
TYPE[] process() | The |
Step 4: Package the Java project
Run the following command to package the Java project:
mvn packageTwo files are generated in the target directory: udf-1.0-SNAPSHOT.jar and udf-1.0-SNAPSHOT-jar-with-dependencies.jar.
Step 5: Upload the project
Upload the file udf-1.0-SNAPSHOT-jar-with-dependencies.jar to Object Storage Service (OSS) and grant the public read permissions on the JAR package. For more information, see Simple upload and Bucket ACLs.
In Step 6, the frontend (FE) node verifies the JAR package of the UDF and calculates the verification value. The backend (BE) node downloads the JAR package of the UDF and executes the package.
Step 6: Create a UDF in StarRocks
StarRocks provides UDFs for both database- and global-level namespaces.
If you do not have special requirements on the visibility of UDFs, you can create a global UDF. When you reference a global UDF, you can directly call the function name without the need to use a catalog and a database as a prefix, which is more convenient for access.
If you have special requirements on the visibility of UDFs or you want to create UDFs with the same name in different databases, you can create a UDF in a database. In this case, if your session is in the database to which the UDF belongs, you can directly call the function name. If your session is in another catalog or database, you must include the catalog and database as a prefix, such as
catalog.database.function.
To create a global UDF, you must have the permissions to execute the system-level CREATE GLOBAL FUNCTION statement. To create a database-level UDF, you must have the permissions to execute the database-level CREATE FUNCTION statement. To use a UDF, you must have the USAGE permission on the UDF. For information about how to grant the required permissions, see GRANT.
After you upload the JAR package, you must create the corresponding UDF in StarRocks based on your business requirements. To create a global UDF, you need to only include the GLOBAL keyword in the SQL statement.
Syntax
CREATE [GLOBAL][AGGREGATE | TABLE] FUNCTION function_name(arg_type [, ...])
RETURNS return_type
[PROPERTIES ("key" = "value" [, ...]) ]Parameters
Parameter | Required | Description |
GLOBAL | No | If you want to create a global UDF, you must specify this keyword. This parameter is supported in StarRocks 3.0 and later. |
AGGREGATE | No | If you want to create UDAFs or UDWFs, you must specify this keyword. |
TABLE | No | If you want to create UDTFs, you must specify this keyword. |
function_name | Yes | The function name, which can contain a database name, such as |
arg_type | Yes | The data type of the parameters in the function. For information about the supported data types, see Mappings of data types. |
return_type | Yes | The data type of the return value in the function. For information about the supported data types, see Mappings of data types. |
properties | Yes | The properties of the function. You must configure the properties of the various types of UDFs that you create. For more information and examples, see the following sections. |
Create a scalar UDF
Run the following commands to create a scalar UDF in StarRocks:
CREATE [GLOBAL] FUNCTION MY_UDF_JSON_GET(string, string)
RETURNS string
PROPERTIES (
"symbol" = "com.starrocks.udf.sample.UDFJsonGet",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);Parameter | Description |
symbol | The class name of the project to which the UDF belongs. The class is named in the |
type | The UDF type. Set the value to |
file | The HTTP path of the JAR package to which the UDF belongs. Set the value to the HTTP URL corresponding to the internal endpoint in OSS. The format is |
Create a UDAF
Run the following commands to create a UDAF in StarRocks:
CREATE [GLOBAL] AGGREGATE FUNCTION MY_SUM_INT(INT)
RETURNS INT
PROPERTIES
(
"symbol" = "com.starrocks.udf.sample.SumInt",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);The descriptions of the parameters in PROPERTIES are the same as those in properties in Create a scalar UDF.
Create a UDWF
Run the following commands to create the UDWF that is used in Step 3 in StarRocks:
CREATE [GLOBAL] AGGREGATE FUNCTION MY_WINDOW_SUM_INT(Int)
RETURNS Int
PROPERTIES
(
"analytic" = "true",
"symbol" = "com.starrocks.udf.sample.WindowSumInt",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);analytic: Specifies whether the created function is a window function. Set the value to true. The descriptions of the remaining parameters are the same as those described in Create a scalar UDF.
Create a UDTF
Run the following commands to create the UDTF that is used in Step 3 in StarRocks:
CREATE [GLOBAL] TABLE FUNCTION MY_UDF_SPLIT(string)
RETURNS string
PROPERTIES
(
"symbol" = "com.starrocks.udf.sample.UDFSplit",
"type" = "StarrocksJar",
"file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);The descriptions of the parameters in PROPERTIES are the same as those in properties in Create a scalar UDF.
Step 7: Use the UDF
After you create the UDF, you can test and use the UDF that you developed.
Use the scalar UDF
Run the following command to use the scalar UDF created in Step 6:
SELECT MY_UDF_JSON_GET('{"key":"{\\"in\\":2}"}', '$.key.in');Use the UDAF
Run the following command to use the UDAF created in Step 6:
SELECT MY_SUM_INT(col1);Use the UDWF
Run the following command to use the UDWF created in Step 6:
SELECT MY_WINDOW_SUM_INT(intcol)
OVER (PARTITION BY intcol2
ORDER BY intcol3
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test_basic;Use the UDTF
Run the following commands to use the UDTF created in Step 6:
-- Assume that table t1 that contains columns a, b, and c1 exists.
SELECT t1.a,t1.b,t1.c1 FROM t1;
> output:
1,2.1,"hello world"
2,2.2,"hello UDTF."
-- Use the MY_UDF_SPLIT() function.
SELECT t1.a,t1.b, MY_UDF_SPLIT FROM t1, MY_UDF_SPLIT(t1.c1);
> output:
1,2.1,"hello"
1,2.1,"world"
2,2.2,"hello"
2,2.2,"UDTF."The first
MY_UDF_SPLITis the column alias generated after the functionMY_UDF_SPLITis called.You cannot use the
AS t2(f1)method to specify the table alias and column alias for a table that is returned by a UDTF.
View the UDF information
Run the following command to view the UDF information:
SHOW [GLOBAL] FUNCTIONS;Delete a UDF
Run the following command to delete the specified UDF:
DROP [GLOBAL] FUNCTION <function_name>(arg_type [, ...]);FAQ
Q: Can I use static variables when I develop a UDF? Do static variables of different UDFs affect each other?
A: You can use static variables when you develop a UDF. Static variables of different UDFs are isolated from each other and do not affect each other even if the UDFs have the same class name.