When Spark SQL's built-in functions don't cover your logic, create a user-defined function (UDF) and register it in Spark SQL. This topic walks you through registering and calling Python and Java/Scala UDFs, using prebuilt sample files so you can test end-to-end without writing code from scratch.
Supported versions
The following E-MapReduce (EMR) Serverless Spark runtime versions support the examples in this topic:
-
esr-5.x: esr-5.0.0 and later
-
esr-4.x: esr-4.6.0 and later
-
esr-3.x: esr-3.5.0 and later
-
esr-2.x: esr-2.9.0 and later
Prerequisites
Before you begin, ensure that you have:
-
An Object Storage Service (OSS) bucket that you are authorized to access
-
A workspace with access to Spark SQL task creation. For more information, see Getting started with SparkSQL development
Python UDF
This example uses a main function, my_adds.my_add, that imports two dependency packages — module1 and module2. module1 adds 0.5 to the input integer and module2 adds 0.3. The function combines both results and returns a floating-point number.
Step 1: Upload files to OSS
Download the three sample files and upload them to your OSS bucket.
-
Download the example files:
-
module1.tgz: Adds
0.5to an integer and returns a floating-point number. -
module2.tgz: Adds
0.3to an integer and returns a floating-point number. -
my_adds.py: The main function file. Calls
PythonUDF.your_add(a)andPythonUDF2.add2(a), adds the results, and returns the final value.
-
-
Upload all three files to your OSS bucket. For more information, see Simple upload.
If your UDF references custom packages, extract them into the Spark Python environment viasys.path. Themy_adds.pyfile demonstrates this pattern:
### my_adds.py content
import sys
def my_add(a: int) -> float:
# Make sure the dependency path is added to sys.path
if not sys.path.__contains__("./module1.tgz"):
sys.path.insert(0, "./module1.tgz/")
from module1 import PythonUDF
if not sys.path.__contains__("./module2.tgz"):
sys.path.insert(0, "./module2.tgz/")
from module2 import PythonUDF2
b = PythonUDF.your_add(a) + PythonUDF2.add2(a)
return b
Step 2: Register the UDF
Register the function in Spark SQL using CREATE FUNCTION. Choose permanent registration for production environments or temporary registration for development and testing.
-
In your workspace, create a Spark SQL task.
-
On the Spark SQL tab, run one of the following statements. Replace
<bucket>with your OSS bucket name. Permanent function — saved to the data catalog and reusable across all SQL sessions:-- The format after AS is "[python_file_name].[function_name]" -- Use a public-read or authorized OSS bucket. CREATE OR REPLACE FUNCTION adds AS "my_adds.my_add" USING FILE "oss://<bucket>/demo/udf/my_adds.py", FILE "oss://<bucket>/demo/udf/module1.tgz", FILE "oss://<bucket>/demo/udf/module2.tgz";Temporary function — valid only for the current SQL session:
-- Use the TEMPORARY keyword to create a temporary function. CREATE TEMPORARY FUNCTION adds AS "my_adds.my_add" USING FILE "oss://<bucket>/demo/udf/my_adds.py", FILE "oss://<bucket>/demo/udf/module1.tgz", FILE "oss://<bucket>/demo/udf/module2.tgz";
Step 3: Call the UDF
Run the following SQL statements on the Spark SQL tab to test the function.
-- Prepare test data
CREATE TABLE IF NOT EXISTS test_tbl (id INT, name STRING);
TRUNCATE TABLE test_tbl;
INSERT INTO test_tbl VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Charlie'), (4, 'David');
-- Call the registered Python UDF
SELECT id, adds(id) AS result FROM test_tbl;
The expected output is similar to:
Java/Scala UDF
This example uses a precompiled Java UDF JAR file. The function takes a string, appends ":HelloWorld", and returns the result. The process for Scala UDFs is the same.
Step 1: Upload the JAR file to OSS
-
Download the example JAR file: udf-1.0-SNAPSHOT.jar: Appends
":HelloWorld"to an input string and returns the result. -
Upload the file to your OSS bucket. For more information, see Simple upload.
Step 2: Register the UDF
Register the function using CREATE FUNCTION with USING JAR to point to the OSS path of the JAR file.
-
In your workspace, create a Spark SQL task.
-
On the Spark SQL tab, run one of the following statements. Replace the OSS path with the actual location of your JAR file. Permanent function — saved to the data catalog (DLF, DLF 1.0, or HMS) and reusable across all SQL sessions:
-- The class created in the UDF follows AS. CREATE FUNCTION myfunc AS "org.example.MyUDF" USING JAR "oss://path/to/udf-1.0-SNAPSHOT.jar";Temporary function — valid only for the current SQL session:
-- The temporary function is valid only in the current session. CREATE TEMPORARY FUNCTION myfunc AS "org.example.MyUDF" USING JAR "oss://path/to/udf-1.0-SNAPSHOT.jar";
Step 3: Call the UDF
In the task editor, run the following statement to test the function:
SELECT myfunc("abc");
The expected output is similar to: