MaxCompute:Develop a UDF in Python 3 - MaxCompute

Quick start

The following minimal example adds two integers and handles NULL inputs:

from odps.udf import annotate

@annotate("bigint,bigint->bigint")
class MyPlus(object):
    def evaluate(self, arg0, arg1):
        if None in (arg0, arg1):
            return None
        return arg0 + arg1

To run a Python 3 UDF, add the following session flag before your SQL statement:

SET odps.sql.python.version=cp37;
SELECT my_plus(col_a, col_b) FROM my_table;

UDF code structure

Every Python 3 UDF requires four components:

Component	Description
Module import	`from odps.udf import annotate` imports the `@annotate` decorator used to declare the function signature. To reference files or tables inside UDF code, also import `from odps.distcache import get_cache_file` or `from odps.distcache import get_cache_table`.
Function signature	`@annotate(<signature>)` declares the input and return types. MaxCompute validates type consistency during semantic parsing and returns an error if types do not match. See Function signatures and data types.
Custom Python class	The class is the organizational unit of your UDF. It defines the variables and methods that implement your business logic. Classes can also reference third-party libraries pre-installed in MaxCompute, or external files and tables. See Third-party libraries and Reference resources.
`evaluate` method	Defined inside the class, `evaluate` specifies the input parameters and return value of the UDF. Each class can have only one `evaluate` method.

Limitations

Internet access (enforced at runtime)

UDFs cannot access the internet by default. To enable internet access, submit a Network Connection Request Form. The MaxCompute technical support team will contact you to complete the setup. For instructions, see Network Connection Request FormNetwork Access Process.

VPC access (enforced at runtime)

UDFs cannot access virtual private clouds (VPCs) by default. To access VPC resources from a UDF, first create a network connection between your MaxCompute project and the target VPC. For more information, see Access resources in a VPC using a UDF.

Reading table data (enforced at runtime)

UDFs, user-defined aggregate functions (UDAFs), and user-defined table-valued functions (UDTFs) cannot read data from the following table types:

Tables with modified schemas (Schema Evolution)
Tables that contain complex data types
Tables that contain the JSON data type
Transactional tables

Usage notes

Python 2 and Python 3 are not compatible. Do not mix Python 2 and Python 3 UDFs in the same SQL statement.

Python 2 reached end of life (EOL) in early 2020. For guidance on migrating existing UDFs, see Migrate Python 2 UDFs.

NULL handling

Handle NULLs explicitly inside your evaluate method:

def evaluate(self, arg):
    if arg is None:
        return None
    return arg.upper()

Develop a UDF

MaxCompute supports UDF development with MaxCompute Studio, DataWorks, and the MaxCompute client (odpscmd). All three tools follow the same workflow:

Write UDF code
Upload the Python file and register the function
Call the UDF in SQL

The following sections walk through the workflow for each tool, using the same example function GetUrlChar that extracts a URL segment by position.

Use MaxCompute Studio

Prerequisites

Before you begin, ensure that you have:

Write UDF code

In the Project panel, right-click scripts under the MaxCompute script module and choose New > MaxCompute Python.
In the Create new MaxCompute python class dialog, enter a class name in Name, select python UDF from the Kind drop-down list, and click OK.

Write your UDF code in the editor. Example:

For local UDF testing, see Test UDFs.

from odps.udf import annotate

@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

Upload the file and register the function

Right-click the Python file in the scripts folder and select Deploy to server.... In the Submit resource and register function dialog, enter the function name and click OK. For details, see Upload a Python program and create a MaxCompute UDF.

Call the UDF

In the Project Explore tab, right-click your MaxCompute project, select Open Console, and run:

SET odps.sql.python.version=cp37;
SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);

Result:

+-----+
| _c0 |
+-----+
|  a  |
+-----+

Use DataWorks

Prerequisites

Before you begin, ensure that you have activated DataWorks and associated a DataWorks workspace with your MaxCompute project. For setup instructions, see DataWorks.

Write UDF code

Write the UDF code in any Python editor. Example:

from odps.udf import annotate

@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

Upload the file and register the function

Upload the packaged code in the DataWorks console and create the UDF. See:

Call the UDF

Create an ODPS SQL node in the DataWorks console, then run:

SET odps.sql.python.version=cp37;
SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);

For more information about ODPS SQL nodes, see Develop a MaxCompute SQL task.

Use the MaxCompute client (odpscmd)

Prerequisites

Before you begin, ensure that you have downloaded, installed, and configured the MaxCompute client (odpscmd). For setup instructions, see MaxCompute client (odpscmd).

Write UDF code

Write the UDF code in any Python editor. Example:

from odps.udf import annotate

@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

Upload the file and register the function

Upload the Python file and register the UDF using the following commands:

Call the UDF

Run the following SQL in the client:

SET odps.sql.python.version=cp37;
SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);

Third-party libraries

The built-in Python 3 runtime in MaxCompute does not include NumPy. To use NumPy, manually upload the NumPy wheel package. Download the package from PyPI or a mirror — the filename follows the pattern numpy-<version>-cp37-cp37m-manylinux1_x86_64.whl.

For upload instructions, see Resource operations or Use third-party packages in Python UDFs.

For a full list of standard libraries available in the Python 3.7 runtime, see The Python Standard Library.

Function signatures and data types

Before you write your UDF code, decide:

Which input types your function accepts and which type it returns
How your function handles NULL inputs (MaxCompute can pass NULLs to any UDF)

The function signature uses the @annotate decorator:

@annotate(<signature>)

The signature string format is:

'arg_type_list -> type'

Input types (`arg_type_list`)

Separate multiple input types with commas. The following types are supported:

BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, VARCHAR, and complex types ARRAY, MAP, STRUCT (including nested complex types).

Two special values for arg_type_list:

Value	Meaning
`*`	Accepts any number of arguments
`''` (empty string)	Accepts no arguments

Return type (`type`)

UDFs return a single column. The supported return types are:

BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), and complex types ARRAY, MAP, STRUCT (including nested complex types).

The available types depend on the MaxCompute data type edition used by your project. For details, see Data type editions.

Signature examples

Signature	Description
`'bigint,double->string'`	Takes BIGINT and DOUBLE inputs, returns STRING
`'*->string'`	Takes any number of inputs, returns STRING
`'->double'`	Takes no inputs, returns DOUBLE
`'array<bigint>->struct<x:string, y:int>'`	Takes ARRAY\<BIGINT\>, returns STRUCT\<x:STRING, y:INT\>
`'->map<bigint, string>'`	Takes no inputs, returns MAP\<BIGINT, STRING\>

MaxCompute SQL to Python 3 type mappings

Write your UDF code using these type mappings to ensure consistency:

MaxCompute SQL type	Python 3 type
BIGINT	INT
STRING	UNICODE
DOUBLE	FLOAT
BOOLEAN	BOOL
DATETIME	DATETIME.DATETIME
FLOAT	FLOAT
CHAR	UNICODE
VARCHAR	UNICODE
BINARY	BYTES
DATE	DATETIME.DATE
DECIMAL	DECIMAL.DECIMAL
ARRAY	LIST
MAP	DICT
STRUCT	COLLECTIONS.NAMEDTUPLE

Reference resources

Reference files or tables inside UDF code using the odps.distcache module.

Reference a file

odps.distcache.get_cache_file(resource_name, mode) returns the content of a file resource.

Parameter	Description
`resource_name`	Name of an existing file resource in your MaxCompute project. Returns an error if the name is invalid or the resource does not exist.
`mode`	Open mode. `'t'` (default) for text, `'b'` for binary.

The return value is a file-like object. Call close() on it when done to release the file handle.

from odps.udf import annotate
from odps.distcache import get_cache_file

@annotate('bigint->string')
class DistCacheExample(object):
    def __init__(self):
        cache_file = get_cache_file('test_distcache.txt')
        kv = {}
        for line in cache_file:
            line = line.strip()
            if not line:
                continue
            k, v = line.split()
            kv[int(k)] = v
        cache_file.close()
        self.kv = kv

    def evaluate(self, arg):
        return self.kv.get(arg)

Reference a table

odps.distcache.get_cache_table(resource_name) returns the content of a table resource.

Parameter	Description
`resource_name`	Name of an existing table resource in the current MaxCompute project. Returns an exception if the name is invalid or the resource does not exist.

The return value is a generator. A record of the ARRAY type is obtained each time the caller traverses the table. Supported column types: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, FLOAT, CHAR, VARCHAR, BINARY, DATE, DECIMAL, ARRAY, MAP, STRUCT.

from odps.udf import annotate
from odps.distcache import get_cache_table

@annotate('->string')
class DistCacheTableExample(object):
    def __init__(self):
        self.records = list(get_cache_table('udf_test'))
        self.counter = 0
        self.ln = len(self.records)

    def evaluate(self):
        if self.counter > self.ln - 1:
            return None
        ret = self.records[self.counter]
        self.counter += 1
        return str(ret)

Call a UDF

Enable Python 3

MaxCompute projects use Python 2 for UDFs by default. To run a Python 3 UDF, add the following line before your SQL statement:

SET odps.sql.python.version=cp37;

Call within the same project

Call the UDF the same way as a built-in function:

SET odps.sql.python.version=cp37;
SELECT my_udf(column1, column2) FROM my_table;

Call across projects

To use a UDF from another project (for example, to use a UDF from Project B in Project A), prefix the function call with the source project name:

SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;

For more information, see Access resources across projects using packages.

Migrate Python 2 UDFs

Python 2 reached end of life (EOL) in early 2020.

For new projects, write all Python UDFs in Python 3.

For existing projects with Python 2 UDFs, proceed with caution when switching to Python 3. Two approaches are available:

Write new UDFs in Python 3 and enable Python 3 at the session level for new jobs. For details, see Enable Python 3.
Rewrite existing Python 2 UDFs to be compatible with both Python 2 and Python 3. For guidance, see Porting Python 2 code to Python 3.

For UDFs shared across multiple projects, write code that is compatible with both Python 2 and Python 3.

Quick start

UDF code structure

Limitations

Usage notes

Develop a UDF

Use MaxCompute Studio

Use DataWorks

Use the MaxCompute client (odpscmd)

Third-party libraries

Function signatures and data types

Reference resources

Reference a file

Reference a table

Call a UDF

Migrate Python 2 UDFs

What's next