All Products
Search
Document Center

Data Lake Analytics - Deprecated:Manage Spark UDFs

Last Updated:Feb 07, 2024

This topic describes how to manage and use user-defined functions (UDFs) in the serverless Spark engine of Data Lake Analytics (DLA).

Important

DLA is discontinued. AnalyticDB for MySQL Data Lakehouse Edition supports the features of DLA and provides more features and better performance. For more information about Spark SQL in AnalyticDB for MySQL, see Overview.

Manage UDFs by using the metadata service

  • Create a UDF

    The metadata service of the serverless Spark engine allows you to create UDFs based on Hive 1.2.1. The following sample code shows the syntax:

    CREATE FUNCTION function_name AS class_name USING resource_location_list;

    Parameter

    Description

    function_name

    The name of the UDF that you want to create. Before you create the UDF, execute the USE DatabaseName statement to specify the database in which the UDF is created. You can also explicitly specify the database.

    class_name

    The name of the UDF class. A complete class name must include the package information. For information about the format of a class name, see the development standards of Spark and Hive built-in functions.

    resource_location_list

    The directories in which the JAR packages or files required for creating the UDF are stored. You must use this parameter to explicitly specify the required JAR packages and files. Example: USING JAR 'oss://test/function.jar',FILE 'oss://test/model.csv'.

  • Query all UDFs of a database

    USE databasename;
    SHOW USER FUNCTIONS;
    Note

    If the USER keyword is not added, the default functions of the serverless Spark engine are returned. The default functions cannot be dropped.

  • Drop a UDF

    USE databasename;
    DROP FUNCTION functionname;
    Note

    The metadata service of DLA does not support ALTER statements for UDFs. If you want to modify metadata configurations, drop the UDFs and create them again.

Use UDFs

  1. Develop a UDF.

    Initialize a Maven project and add the following dependency to the pom.xml file:

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.2.1</version>
    </dependency>

    Implement a file in the org.test.udf package. In this example, the Ten.java file is implemented in the org.test.udf package. This file returns the result of the input value plus 10.

    package org.test.udf;
    import org.apache.hadoop.hive.ql.exec.UDF;
    public class Ten extends UDF {
      public long evaluate(long value) {
        return value + 10;
      }
    }
  2. Create the UDF in the serverless Spark engine and use this UDF.

    Compile the code of the Maven project, package the code into a udf.jar file, and then upload the file to Object Storage Service (OSS). After the file is uploaded, you can create the UDF and execute SQL statements in the serverless Spark engine to access the UDF.

    -- here is the spark conf
    set spark.driver.resourceSpec=medium;
    set spark.executor.instances=5;
    set spark.executor.resourceSpec=medium;
    set spark.app.name=sparksqltest;
    set spark.sql.hive.metastore.version=dla;
    set spark.dla.connectors=oss;
    -- here is your sql statement
    use db;
    CREATE FUNCTION addten as 'com.aliyun.dla.udf.Ten' USING JAR 'oss://path/to/your/udf.jar';
    select addten(7);
  3. Check the result.

    After the UDF is created, the UDF is registered with the serverless Spark engine. You can directly invoke this UDF by using the metadata service without repeated registration.

    In this example, the addten UDF is created in the db database. After a Spark job that invokes the UDF is run, you can view the result 17 in the logs of the job.