Apache Hive provides many built-in functions for data processing. When built-in functions don't cover your logic—such as custom string transformations, data encryption, or domain-specific calculations—write a user-defined function (UDF) instead.
UDF types
Apache Hive supports three types of UDFs:
| Type | Full name | Behavior |
|---|---|---|
| UDF | User-defined scalar function | One-to-one mapping: reads one row, returns one value |
| UDTF | User-defined table-valued function | Returns multiple rows per input; the only type that can return multiple fields |
| UDAF | User-defined aggregate function | Many-to-one mapping: aggregates multiple rows into a single output value; used with GROUP BY |
Prerequisites
Before you begin, make sure you have:
An E-MapReduce (EMR) cluster with SSH access. See Log on to a cluster
Java Development Kit (JDK) installed
Apache Maven installed and configured
An integrated development environment (IDE) for Java development
Develop UDF code
This section walks through creating a simple UDF that appends :HelloWorld to a string input.
1. Create a Maven project
In your IDE, create a new Maven project with the following coordinates. Adjust groupId and artifactId to match your organization and project name.
<groupId>org.example</groupId>
<artifactId>hiveudf</artifactId>
<version>1.0-SNAPSHOT</version>2. Add the Hive dependency
Add the following dependency to your pom.xml:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.7</version>
<exclusions>
<exclusion>
<groupId>org.pentaho</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>3. Implement the UDF class
Create a class that extends org.apache.hadoop.hive.ql.exec.UDF and implement the evaluate() method. The class name can be anything; this example uses MyUDF.
package org.example;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MyUDF extends UDF {
public String evaluate(final String s) {
if (s == null) { return null; }
return s + ":HelloWorld";
}
}4. Build the JAR file
In the directory containing pom.xml, run:
mvn clean package -DskipTestsThe output JAR file hiveudf-1.0-SNAPSHOT.jar appears in the target directory.
Deploy and register the UDF
1. Transfer the JAR to your cluster
Use SSH Secure File Transfer Client to upload hiveudf-1.0-SNAPSHOT.jar to the root directory of your EMR cluster.
2. Upload the JAR to HDFS
Log on to your cluster in SSH mode.
Upload the JAR to Hadoop Distributed File System (HDFS):
hadoop fs -put hiveudf-1.0-SNAPSHOT.jar /user/hive/warehouse/Verify the upload succeeded:
hadoop fs -ls /user/hive/warehouse/Expected output:
Found 1 items -rw-r--r-- 1 xx xx 2668 2021-06-09 14:13 /user/hive/warehouse/hiveudf-1.0-SNAPSHOT.jar
3. Register the UDF in Hive
Open the Hive CLI:
hiveRegister the UDF as a permanent function:
create function myfunc as "org.example.MyUDF" using jar "hdfs:///user/hive/warehouse/hiveudf-1.0-SNAPSHOT.jar";In this command:
| Parameter | Description |
|---|---|
myfunc | The function name used in queries |
org.example.MyUDF | The fully qualified class name from the JAR |
hdfs:///user/hive/warehouse/hiveudf-1.0-SNAPSHOT.jar | The HDFS path to the JAR |
If registration succeeds, the output is similar to:
Added [/private/var/folders/2s/wzzsgpn13rn8rl_0fc4xxkc00000gp/T/40608d4a-a0e1-4bf5-92e8-b875fa6a1e53_resources/hiveudf-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs:///user/hive/warehouse/myfunc/hiveudf-1.0-SNAPSHOT.jar]4. Test the UDF
Call the UDF in a query the same way as any built-in function:
select myfunc("abc");Expected output:
OK
abc:HelloWorldTo confirm the function is registered, run:
SHOW FUNCTIONS LIKE '*myfunc*';Troubleshooting
The `create function` command fails with a class not found error
Make sure the class name in the create function statement matches the fully qualified class name in your JAR, including the package prefix. For example, if your class is MyUDF in package org.example, the reference must be org.example.MyUDF, not MyUDF.
The UDF returns `null` for all inputs
Check the evaluate() method's null-handling logic. The example implementation returns null when the input is null. If your function should handle null differently, update the method accordingly.