Build Java UDFs in StarRocks to Extend SQL Analytics - E-MapReduce

Java user-defined functions (UDFs) let you extend StarRocks with custom logic that built-in functions cannot express. EMR Serverless StarRocks supports four UDF types:

Type	What it does
Scalar UDF	Takes one row as input, returns one value. Equivalent to built-in functions like `UPPER` or `ROUND`.
UDAF (user-defined aggregate function)	Takes multiple rows as input, returns one value per group. Equivalent to built-in functions like `SUM` or `COUNT`.
UDWF (user-defined window function)	Operates over a window of rows defined by an `OVER` clause, returns one value per row.
UDTF (user-defined table-valued function)	Takes one row as input, returns multiple rows in a single column. Commonly used for row-to-column conversion.

StarRocks 2.2.0 and later support Java UDFs. StarRocks 3.0 and later support global UDFs — add the GLOBAL keyword to CREATE, SHOW, and DROP statements to make a UDF accessible across all databases without a catalog or database prefix.

Prerequisites

Before you begin, make sure you have:

Apache Maven installed (for building the Java project)
Java Development Kit (JDK) 1.8 installed on the server
The UDF feature enabled: on the Instance Configuration tab of your EMR Serverless StarRocks instance details page, go to the FE section, set enable_udf to TRUE, and restart the instance

Data type mappings

All parameter and return types in your Java class must map to a supported SQL type. The table below shows the supported mappings.

SQL type	Java type
BOOLEAN	java.lang.Boolean
TINYINT	java.lang.Byte
SMALLINT	java.lang.Short
INT	java.lang.Integer
BIGINT	java.lang.Long
FLOAT	java.lang.Float
DOUBLE	java.lang.Double
STRING/VARCHAR	java.lang.String

Develop and deploy a UDF

The workflow has seven steps: create a Maven project, add dependencies, implement the UDF class, package the JAR, upload it to OSS, register the UDF in StarRocks, and call it in a query.

Step 1: Create a Maven project

Create a Maven project with the following directory structure:

project
|--pom.xml
|--src
|  |--main
|  |  |--java
|  |  |--resources
|  |--test
|--target

Step 2: Add dependencies

Add the following content to pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>udf</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.76</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-dependency-plugin</artifactId>
                <version>2.10</version>
                <executions>
                    <execution>
                        <id>copy-dependencies</id>
                        <phase>package</phase>
                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${project.build.directory}/lib</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.3.0</version>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

Step 3: Implement the UDF class

Scalar UDF

A scalar UDF must implement the evaluate method as a public member method. The method signature determines the SQL parameter and return types — they must match the types you declare in the CREATE FUNCTION statement (see Data type mappings).

Method	Description
`TYPE1 evaluate(TYPE2, ...)`	Invocation entry point. Must be a public member method.

The example below implements MY_UDF_JSON_GET, which extracts a nested JSON value using a dotted path expression. It replaces the nested GET_JSON_STRING(GET_JSON_STRING(...)) pattern with a single call: MY_UDF_JSON_GET('{"key":"{\\"k0\\":\\"v0\\"}"}', "$.key.k0").

package com.starrocks.udf.sample;
import com.alibaba.fastjson.JSONPath;

public class UDFJsonGet {
    public final String evaluate(String jsonObj, String key) {
        if (jsonObj == null || key == null) return null;
        try {
            // JSONPath.read fully expands nested JSON strings
            return JSONPath.read(jsonObj, key).toString();
        } catch (Exception e) {
            return null;
        }
    }
}

UDAF

A UDAF aggregates multiple rows per group into a single result. It uses a State inner class to hold intermediate results, which StarRocks serializes and deserializes when transmitting data between execution nodes.

Required methods — implement all six for every UDAF:

Method	Required	Description
`State create()`	Always	Allocate a new State object.
`void destroy(State)`	Always	Release resources held by the State.
`void update(State, ...)`	Always	Accumulate one input row into the State. The first parameter is State; the remaining parameters are the declared function inputs.
`void serialize(State, ByteBuffer)`	Always	Write the State into the buffer for inter-node transmission.
`void merge(State, ByteBuffer)`	Always	Merge and deserialize a State from the buffer.
`TYPE finalize(State)`	Always	Extract the final aggregate result from the State.

Intermediate state buffer — use java.nio.ByteBuffer to store intermediate results:

Item	Description
`java.nio.ByteBuffer`	Holds the serialized State during inter-node transmission.
`serializeLength()`	Returns the byte length of the serialized State (data type: INT). Must exactly match the number of bytes you write in `serialize`. For an `int` counter, return `4`; for a `long` counter, return `8`.

Warning

Do not call remaining() on the ByteBuffer to deserialize a State, and do not call clear() on it. If serializeLength does not match the bytes actually written in serialize, aggregation produces incorrect results.

The example below implements MY_SUM_INT, an INT-in/INT-out sum (unlike the built-in SUM, which always returns BIGINT):

package com.starrocks.udf.sample;

public class SumInt {
    public static class State {
        int counter = 0;
        public int serializeLength() { return 4; } // INT = 4 bytes
    }

    public State create() {
        return new State();
    }

    public void destroy(State state) {
    }

    public final void update(State state, Integer val) {
        if (val != null) {
            state.counter += val;
        }
    }

    public void serialize(State state, java.nio.ByteBuffer buff) {
        buff.putInt(state.counter);
    }

    public void merge(State state, java.nio.ByteBuffer buffer) {
        int val = buffer.getInt();
        state.counter += val;
    }

    public Integer finalize(State state) {
        return state.counter;
    }
}

UDWF

A UDWF (user-defined window function) is a special UDAF that returns one result per input row rather than one result per group. It uses an OVER clause to define the partition and window frame, and adds a windowUpdate method to the standard UDAF interface.

Implement all six UDAF methods plus windowUpdate:

Method	Description
`void reset(State state)`	Reset the State when the window frame changes.
`void windowUpdate(State state, int peer_group_start, int peer_group_end, int frame_start, int frame_end, TYPE[] inputs)`	Update the State for the current row's window frame.

`windowUpdate` parameters:

Parameter	Description
`peer_group_start`	Start index of the current partition (rows sharing the same `PARTITION BY` key).
`peer_group_end`	End index of the current partition.
`frame_start`	Start index of the current window frame (e.g., `ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING`).
`frame_end`	End index of the current window frame.
`inputs`	Input column values for the window as a wrapper-class array. Use `Integer[]` for INT inputs.

The example below implements MY_WINDOW_SUM_INT, an INT window sum:

package com.starrocks.udf.sample;

public class WindowSumInt {
    public static class State {
        int counter = 0;
        public int serializeLength() { return 4; }
    }

    public State create() {
        return new State();
    }

    public void destroy(State state) {
    }

    public void update(State state, Integer val) {
        if (val != null) {
            state.counter += val;
        }
    }

    public void serialize(State state, java.nio.ByteBuffer buff) {
        buff.putInt(state.counter);
    }

    public void merge(State state, java.nio.ByteBuffer buffer) {
        int val = buffer.getInt();
        state.counter += val;
    }

    public Integer finalize(State state) {
        return state.counter;
    }

    public void reset(State state) {
        state.counter = 0;
    }

    public void windowUpdate(State state,
                            int peer_group_start, int peer_group_end,
                            int frame_start, int frame_end,
                            Integer[] inputs) {
        for (int i = (int)frame_start; i < (int)frame_end; ++i) {
            state.counter += inputs[i];
        }
    }
}

For more information about window function syntax, see Window functions.

UDTF

A UDTF reads one input row and returns multiple rows, all in a single column. It must implement the process method, which returns an array.

Note

UDTFs support returning multiple rows in a single column only.

Method	Description
`TYPE[] process()`	Invocation entry point. Returns an array — each element becomes a separate output row.

The example below implements MY_UDF_SPLIT, which splits a string on spaces:

package com.starrocks.udf.sample;

public class UDFSplit{
    public String[] process(String in) {
        if (in == null) return null;
        return in.split(" ");
    }
}

Step 4: Package the project

Run the following command to build the JAR:

mvn package

This generates two files in the target directory:

udf-1.0-SNAPSHOT.jar
udf-1.0-SNAPSHOT-jar-with-dependencies.jar

Step 5: Upload the JAR to OSS

Upload udf-1.0-SNAPSHOT-jar-with-dependencies.jar to an Object Storage Service (OSS) bucket and set the bucket ACL to allow public reads. For upload instructions, see Simple upload and Bucket ACLs.

Note

The frontend (FE) node verifies the JAR and computes its checksum. The backend (BE) node downloads and executes the JAR. The file property in Step 6 must use the OSS internal endpoint URL.

Step 6: Register the UDF in StarRocks

StarRocks supports two UDF namespaces: global and database-level.

Global UDF: Callable by name from any database without a catalog.database prefix. Use this for shared utility functions.
Database-level UDF: Callable by name within its own database. From a different database, use the catalog.database.function_name format. Use this when you need the same function name in multiple databases.

Required permissions:

Create a global UDF: system-level CREATE GLOBAL FUNCTION permission
Create a database-level UDF: database-level CREATE FUNCTION permissionGRANT
Call a UDF: USAGE permission on the UDF

For permission setup, see GRANT.

Syntax

CREATE [GLOBAL] [AGGREGATE | TABLE] FUNCTION function_name(arg_type [, ...])
RETURNS return_type
[PROPERTIES ("key" = "value" [, ...]) ]

Parameters

Parameter	Required	Description
`GLOBAL`	No	Creates a global UDF. Supported in StarRocks 3.0 and later.
`AGGREGATE`	No	Required for UDAFs and UDWFs.
`TABLE`	No	Required for UDTFs.
`function_name`	Yes	The function name. Include a database name to create the UDF in a specific database (e.g., `db1.my_func`). A function with the same name and identical parameter types cannot be created twice in the same database; different parameter types are allowed.
`arg_type`	Yes	Parameter type(s). See Data type mappings.
`return_type`	Yes	Return type. See Data type mappings.
`PROPERTIES`	Yes	Function properties. See the sub-sections below.

PROPERTIES parameters

Property	Required	Description
`symbol`	Yes	Fully qualified class name in `<package_name>.<class_name>` format.
`type`	Yes	Set to `StarrocksJar` for Java-based UDFs.
`file`	Yes	HTTP URL of the JAR using the OSS internal endpoint: `http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/<jar_package_name>`
`analytic`	No	Set to `true` for UDWFs. Not required for other UDF types.

Create a scalar UDF

CREATE [GLOBAL] FUNCTION MY_UDF_JSON_GET(string, string)
RETURNS string
PROPERTIES (
    "symbol" = "com.starrocks.udf.sample.UDFJsonGet",
    "type" = "StarrocksJar",
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

Create a UDAF

CREATE [GLOBAL] AGGREGATE FUNCTION MY_SUM_INT(INT)
RETURNS INT
PROPERTIES (
    "symbol" = "com.starrocks.udf.sample.SumInt",
    "type" = "StarrocksJar",
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

Create a UDWF

CREATE [GLOBAL] AGGREGATE FUNCTION MY_WINDOW_SUM_INT(Int)
RETURNS Int
PROPERTIES (
    "analytic" = "true",
    "symbol" = "com.starrocks.udf.sample.WindowSumInt",
    "type" = "StarrocksJar",
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

Create a UDTF

CREATE [GLOBAL] TABLE FUNCTION MY_UDF_SPLIT(string)
RETURNS string
PROPERTIES (
    "symbol" = "com.starrocks.udf.sample.UDFSplit",
    "type" = "StarrocksJar",
    "file" = "http://<YourBucketName>.oss-cn-xxxx-internal.aliyuncs.com/<YourPath>/udf-1.0-SNAPSHOT-jar-with-dependencies.jar"
);

Step 7: Call the UDF

Scalar UDF

SELECT MY_UDF_JSON_GET('{"key":"{\\"in\\":2}"}', '$.key.in');

UDAF

SELECT MY_SUM_INT(col1);

UDWF

SELECT MY_WINDOW_SUM_INT(intcol)
    OVER (PARTITION BY intcol2
          ORDER BY intcol3
          ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test_basic;

UDTF

-- Assume table t1 has columns a, b, and c1
SELECT t1.a, t1.b, t1.c1 FROM t1;
-- Output:
-- 1, 2.1, "hello world"
-- 2, 2.2, "hello UDTF."

-- Split c1 into one word per row
SELECT t1.a, t1.b, MY_UDF_SPLIT FROM t1, MY_UDF_SPLIT(t1.c1);
-- Output:
-- 1, 2.1, "hello"
-- 1, 2.1, "world"
-- 2, 2.2, "hello"
-- 2, 2.2, "UDTF."

Note

MY_UDF_SPLIT in the SELECT list is the column alias generated when you call the function. You cannot use AS t2(f1) to assign a table alias or column alias to a UDTF result.

View UDFs

SHOW [GLOBAL] FUNCTIONS;

Delete a UDF

DROP [GLOBAL] FUNCTION <function_name>(arg_type [, ...]);

FAQ

Can I use static variables in a UDF? Do static variables from different UDFs affect each other?

Yes. Static variables are isolated per UDF class — they do not interfere with static variables from other UDF classes, even if two classes share the same name.