Hive provides many built-in functions. If the built-in functions cannot meet your computing requirements, you can create user-defined functions (UDFs). You can use UDFs in a similar way to common built-in functions. This topic describes how to develop and use UDFs.

Background information

The following table describes the types of UDFs.
UDF type Description
User-defined scalar function (UDF) The input and output of a UDF have a one-to-one mapping. Each time a UDF is called, it reads a single row of data and returns a single data value.
User-defined table-valued function (UDTF) Each time a UDTF is called, it returns multiple rows of data. Among all types of UDFs, only UDTFs can return multiple fields.
User-defined aggregate function (UDAF) The input and output of a UDAF have a many-to-one mapping. Each time a UDAF is called, it aggregates multiple input records into one output value. You can use a UDAF with the GROUP BY clause in an SQL statement.

Prerequisites

  • An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.
  • SSH Secure File Transfer Client is installed on your computer.

Develop UDF code

  1. Use an integrated development environment (IDE) to create a Maven project.
    The following code shows the basic information of the project. You can configure the groupId and artifactId parameters based on your business requirements.
    <groupId>org.example</groupId>
    <artifactId>hiveudf</artifactId>
    <version>1.0-SNAPSHOT</version>
  2. Add the following dependency to the pom.xml file:
    <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-exec</artifactId>
          <version>2.3.7</version>
          <exclusions>
             <exclusion>
               <groupId>org.pentaho</groupId>
               <artifactId>*</artifactId>
             </exclusion>
          </exclusions>
    </dependency>
  3. Create a class that inherits from the Hive UDF class.
    You can specify the class name. In this example, the class name is MyUDF.
    package org.example;
    
    import org.apache.hadoop.hive.ql.exec.UDF;
    
    /**
     * Hello world!
     *
     */
    public class MyUDF extends UDF
    {
        public String evaluate(final String s) {
            if (s == null) { return null; }
            return s + ":HelloWorld";
        }
    }
  4. Package the custom code into a JAR file.
    In the directory where the pom.xml file is stored, run the following command to create a JAR file:
    mvn clean package -DskipTests
    The hiveudf-1.0-SNAPSHOT.jar file appears in the target directory. The code development is complete.

Create and use a UDF

  1. Use SSH Secure File Transfer Client to upload the generated JAR file to the root directory of your EMR cluster.
  2. Upload the JAR file to HDFS.
    1. Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to upload the JAR file to HDFS:
      hadoop fs -put hiveudf-1.0-SNAPSHOT.jar /user/hive/warehouse/
      You can run the hadoop fs -ls /user/hive/warehouse/ command to check whether the upload is successful. If the following information is returned, the upload is successful:
      Found 1 items
      -rw-r--r--   1 xx xx 2668 2021-06-09 14:13 /user/hive/warehouse/hiveudf-1.0-SNAPSHOT.jar
  3. Create a UDF.
    1. Run the following command to go to the Hive CLI:
      hive
    2. Run the following command to create a UDF based on the uploaded JAR file:
      create function myfunc as "org.example.MyUDF" using jar "hdfs:///user/hive/warehouse/hiveudf-1.0-SNAPSHOT.jar";

      In the command, myfunc is the name of the UDF, org.example.MyUDF is the class created in Develop UDF code, and dfs:///user/hive/warehouse/hiveudf-1.0-SNAPSHOT.jar is the HDFS path to which the JAR file is uploaded.

      If the following information is returned, the UDF is created:
      Added [/private/var/folders/2s/wzzsgpn13rn8rl_0fc4xxkc00000gp/T/40608d4a-a0e1-4bf5-92e8-b875fa6a1e53_resources/hiveudf-1.0-SNAPSHOT.jar] to class path
      Added resources: [hdfs:///user/hive/warehouse/myfunc/hiveudf-1.0-SNAPSHOT.jar]
      Note You can also run the SHOW FUNCTIONS LIKE '*myfunc*' command to check whether the UDF is created.
  4. Run the following command to use the UDF.
    You can specify the name of the UDF in the command to use the UDF in the same way as you use a built-in function.
    select myfunc("abc");
    The following information is returned:
    OK
    abc:HelloWorld

References