All Products
Search
Document Center

DataWorks:Create an EMR function

Last Updated:Mar 26, 2026

Register a custom Java user-defined function (UDF) in DataWorks DataStudio to extend the SQL capabilities of your E-MapReduce (EMR) cluster. After registration, call the function by name in EMR SQL statements.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud EMR cluster with the following inbound rule added to its security group:

    FieldValue
    ActionAllow
    Protocol typeCustom TCP
    Port range8898/8898
    Authorization object100.104.0.0/16
    This security group rule allows DataWorks to connect to your EMR cluster. Contact your cluster administrator if you do not have permission to modify security group rules.
  • An EMR compute engine instance associated with your workspace. The EMR folder appears in DataStudio only after you complete this association on the Workspace Management page. For more information, see Create and manage workspaces.

  • The required JAR resources uploaded to DataWorks. For more information, see Create and use an EMR resource.

Create a function

Step 1: Open DataStudio

Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and Governance > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

Step 2: Create a workflow

Create an auto triggered workflow to organize your function node. For more information, see Create an auto triggered workflow.

Step 3: Prepare your JAR package

Write your UDF logic in Java, package it as a JAR file, and upload it as a JAR resource to DataWorks.

The following is a minimal Java UDF example that converts a string to lowercase:

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class ToLowerUDF extends UDF {
    public Text evaluate(Text input) {
        if (input == null) return null;
        return new Text(input.toString().toLowerCase());
    }
}

Package the class as a JAR file using Maven:

mvn clean package -DskipTests

Then upload the resulting JAR file as a JAR resource to DataWorks. For more information, see Create and use an EMR resource.

Step 4: Create the function node

  1. In the Business Flow section, click your workflow to expand it. Right-click EMR and choose Create Solution.

  2. In the Create Function dialog box, set Name, Engine Instance, and Path, then click Create.

  3. In the Function information section on the configuration tab, set the following parameters.

    ParameterDescription
    Function TypeThe category of the function. Valid values: Mathematical Operation Functions, Aggregate Functions, String Processing Functions, Date Functions, Window Functions, and Other Functions. Choose based on what your UDF computes: for example, select String Processing Functions for text transformation logic.
    Engine InstanceThe EMR cluster associated with your workspace. By default, you cannot change the engine instance.
    Engine TypeThe compute engine type. By default, you cannot change the engine type.
    EMR databaseThe database in the EMR cluster where the function is registered. Select one from the drop-down list. To create a database, click New Library, fill in the required fields, and click OK.
    Function NameThe name used to call the function in SQL statements. Must be globally unique and cannot be changed after creation. For example: to_lower.
    OwnerSet automatically.
    Class NameRequired. The name of the class that implements the function.
    ResourceRequired. The JAR resource that contains your UDF class. Select a resource from the current workspace, or click Create Resource to upload a new one.
    DescriptionA description of the function. For example: Converts a string to lowercase.
    Expression SyntaxThe call syntax of the function. For example: to_lower(col_name).
    Parameter DescriptionDescriptions of the input and output parameters. For example: input: STRING col_name, output: STRING.
    Return ValueOptional. The return value. For example: 1.
    Example(Optional) A sample SQL statement that shows how to call the function. For example: SELECT to_lower(name) FROM my_table LIMIT 10;

    Function information section

Step 5: Save the function

Click the Save icon in the top toolbar to save your configuration.

Step 6: Commit the function

  1. Click the Commit icon in the top toolbar.

    You must select a resource group for scheduling when you commit the EMR function. We recommend that you use an exclusive resource group for scheduling. If no exclusive resource groups are available, you can purchase and configure one. For more information, see Create and use an exclusive resource group for scheduling.
  2. In the Commit Node dialog box, enter your comments in the Change description field.

  3. Click OK.

Step 7: Commit the UDF

  1. Click the Commit icon in the top toolbar.

  2. In the Commit Node dialog box, enter your comments in the Change description field.

  3. Click OK.