Register a custom Java user-defined function (UDF) in DataWorks DataStudio to extend the SQL capabilities of your E-MapReduce (EMR) cluster. After registration, call the function by name in EMR SQL statements.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud EMR cluster with the following inbound rule added to its security group:
Field Value Action Allow Protocol type Custom TCP Port range 8898/8898 Authorization object 100.104.0.0/16 This security group rule allows DataWorks to connect to your EMR cluster. Contact your cluster administrator if you do not have permission to modify security group rules.
An EMR compute engine instance associated with your workspace. The EMR folder appears in DataStudio only after you complete this association on the Workspace Management page. For more information, see Create and manage workspaces.
The required JAR resources uploaded to DataWorks. For more information, see Create and use an EMR resource.
Create a function
Step 1: Open DataStudio
Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and Governance > Data Development. Select your workspace from the drop-down list and click Go to Data Development.
Step 2: Create a workflow
Create an auto triggered workflow to organize your function node. For more information, see Create an auto triggered workflow.
Step 3: Prepare your JAR package
Write your UDF logic in Java, package it as a JAR file, and upload it as a JAR resource to DataWorks.
The following is a minimal Java UDF example that converts a string to lowercase:
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class ToLowerUDF extends UDF {
public Text evaluate(Text input) {
if (input == null) return null;
return new Text(input.toString().toLowerCase());
}
}Package the class as a JAR file using Maven:
mvn clean package -DskipTestsThen upload the resulting JAR file as a JAR resource to DataWorks. For more information, see Create and use an EMR resource.
Step 4: Create the function node
In the Business Flow section, click your workflow to expand it. Right-click EMR and choose Create Solution.
In the Create Function dialog box, set Name, Engine Instance, and Path, then click Create.
In the Function information section on the configuration tab, set the following parameters.
Parameter Description Function Type The category of the function. Valid values: Mathematical Operation Functions, Aggregate Functions, String Processing Functions, Date Functions, Window Functions, and Other Functions. Choose based on what your UDF computes: for example, select String Processing Functions for text transformation logic. Engine Instance The EMR cluster associated with your workspace. By default, you cannot change the engine instance. Engine Type The compute engine type. By default, you cannot change the engine type. EMR database The database in the EMR cluster where the function is registered. Select one from the drop-down list. To create a database, click New Library, fill in the required fields, and click OK. Function Name The name used to call the function in SQL statements. Must be globally unique and cannot be changed after creation. For example: to_lower.Owner Set automatically. Class Name Required. The name of the class that implements the function. Resource Required. The JAR resource that contains your UDF class. Select a resource from the current workspace, or click Create Resource to upload a new one. Description A description of the function. For example: Converts a string to lowercase.Expression Syntax The call syntax of the function. For example: to_lower(col_name).Parameter Description Descriptions of the input and output parameters. For example: input: STRING col_name, output: STRING.Return Value Optional. The return value. For example: 1.Example (Optional) A sample SQL statement that shows how to call the function. For example: SELECT to_lower(name) FROM my_table LIMIT 10;
Step 5: Save the function
Click the
icon in the top toolbar to save your configuration.
Step 6: Commit the function
Click the
icon in the top toolbar.You must select a resource group for scheduling when you commit the EMR function. We recommend that you use an exclusive resource group for scheduling. If no exclusive resource groups are available, you can purchase and configure one. For more information, see Create and use an exclusive resource group for scheduling.
In the Commit Node dialog box, enter your comments in the Change description field.
Click OK.
Step 7: Commit the UDF
Click the
icon in the top toolbar.In the Commit Node dialog box, enter your comments in the Change description field.
Click OK.