All Products
Search
Document Center

Spark UDFs

Last Updated: Jun 16, 2021

This topic describes how to manage and use user-defined functions (UDFs) in the serverless Spark engine of Data Lake Analytics (DLA).

Manage UDFs

  • Create a UDF

    The metadata service of the serverless Spark engine allows you to create UDFs based on Hive 1.2.1. Syntax:

    CREATE FUNCTION function_name AS class_name USING resource_location_list;

    Parameter

    Description

    function_name

    The name of the UDF that you want to create. Before you create the UDF, execute the USE DatabaseName statement to specify the database for which the UDF is called. You can also explicitly specify the database.

    class_name

    The name of the UDF class. A complete class name must include the package information.

    resource_location_list

    The directories where the JAR packages or files required for creating a UDF are saved. You must use this parameter to explicitly specify the required JAR packages and uniform resource identifiers (URIs), for example, USING (JAR 'oss://test/function.jar',FILE 'oss://test/model.csv'.

  • Query all UDFs of a database

    USE databasename;
    SHOW USER FUNCTIONS;
    Note

    If the USER keyword is not added, the default UDFs of the serverless Spark engine are returned. The default UDFs cannot be dropped.

  • Drop a UDF

    USE databasename;
    DROP FUNCTION functionname;
    Note

    The metadata service of DLA does not support ALTER statements for UDFs. If you want to modify metadata configurations, drop the UDFs and create them again.

Use UDFs

  1. Develop a UDF.

    Initialize a Maven project and add the following dependency to the pom.xml file.

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.2.1</version>
    </dependency>

    Implement the Ten.java file in the org.test.udf package. The value plus 10 is returned.

    package org.test.udf;
    import org.apache.hadoop.hive.ql.exec.UDF;
    public class Ten extends UDF {
      public long evaluate(long value) {
        return value + 10;
      }
    }
  2. Create the UDF in the serverless Spark engine and use this UDF.

    Compile the code of the Maven project, package the code into a udf.jar file, and then upload the file to OSS. After that, you can create the UDF and execute SQL statements in the serverless Spark engine to access the UDF.

    -- here is the spark conf
    set spark.driver.resourceSpec=medium;
    set spark.executor.instances=5;
    set spark.executor.resourceSpec=medium;
    set spark.app.name=sparksqltest;
    set spark.sql.hive.metastore.version=dla;
    set spark.dla.connectors=oss;
    -- here is your sql statement
    use db;
    CREATE FUNCTION addten as 'com.aliyun.dla.udf.Ten' USING JAR 'oss://path/to/your/udf.jar';
    select addten(7);
  3. Check the result.

    After you create the UDF, you can call this UDF directly.

    The addten UDF has been created in the db database. After a Spark job that calls the UDF is run, you can view value 17 in log data.