Write a UDF in Python 2 - MaxCompute

UDF code structure

A Python 2 UDF consists of five components. One is optional or conditional; four are always required.

Component	Required	Purpose
Encoding declaration	Optional (required if code contains Chinese characters)	Declares the file encoding
Module import	Required	Imports the function signature and any resource modules
Function signature	Required	Defines input and return data types via `@annotate`
Custom Python class	Required	The organizational unit of UDF logic
`evaluate` method	Required	Defines the UDF's input parameters and return value; each class can have only one

Encoding declaration

Add an encoding declaration at the top of any UDF file that contains Chinese characters. Without it, MaxCompute returns an error at runtime. Both formats below are equivalent:

#coding:utf-8

# -*- coding: utf-8 -*-

Module import

Every UDF must import the function signature module:

from odps.udf import annotate

To reference files or tables in UDF code, also import from odps.distcache:

from odps.distcache import get_cache_file   # for file resources
from odps.distcache import get_cache_table  # for table resources

Function signature

The @annotate decorator defines the data types of input parameters and return value. MaxCompute checks type consistency during semantic parsing and returns an error if types do not match.

@annotate("bigint,bigint->bigint")

For the full signature syntax, see Function signatures and data types.

Minimal example

The following example shows a complete, working UDF that adds two integers. It covers all required components.

#coding:utf-8
# Import the function signature.
from odps.udf import annotate

# Define input types (BIGINT, BIGINT) and return type (BIGINT).
@annotate("bigint,bigint->bigint")
class MyPlus(object):
    def evaluate(self, arg0, arg1):
        if None in (arg0, arg1):
            return None
        return arg0 + arg1

Note Always handle None inputs explicitly. NULL values in MaxCompute SQL map to None in Python 2, so failing to check for None can cause unexpected errors.

Limitations

Prohibited operations

MaxCompute runs Python 2 UDF code inside a sandbox. The following operations are not permitted:

Reading from or writing to local files
Starting subprocesses
Starting threads
Opening socket connections
Calling Python 2 UDFs from external systems

Because of these restrictions, all uploaded code must rely on Python standard libraries. Modules or C extension modules that perform the prohibited operations above cannot be used.

Available standard library modules

All pure-Python modules in the Python standard library (those with no dependency on C extension modules) are available.

The following C extension modules are also available:

array, audioop
binascii, bisect
cmath, _codecs_cn, _codecs_hk, _codecs_iso2022, _codecs_jp, _codecs_kr, _codecs_tw, _collections, cStringIO
datetime
_functools, future_builtins
_heapq, _hashlib
itertools
_json
_locale, _lsprof
math, _md5, _multibytecodec
operator
_random
_sha256, _sha512, _sha, _struct, strop
time
unicodedata
_weakref
cPickle

Note The maximum size of data that can be written to sys.stdout or sys.stderr is 20 KB. Any output beyond this limit is silently discarded.

Third-party libraries

Third-party libraries, such as NumPy, are pre-installed in the MaxCompute Python 2 environment as supplements to the standard library.

Note Third-party library usage is subject to the same sandbox restrictions. Local data access is not allowed, and network I/O is limited. The related APIs in affected libraries are disabled.

Function signatures and data types

Signature format

@annotate('<arg_type_list>-><return_type>')

arg_type_list specifies input parameter types, separated by commas. It accepts two special forms:

* — accepts any number of input parameters
'' (empty string) — accepts no input parameters

return_type specifies the type of the single return value.

Supported input and return types: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, VARCHAR, and the complex types ARRAY, MAP, STRUCT (including nested complex types).

Note The data types available in function signatures depend on the MaxCompute data type edition used by your project. For details, see Data type editions.

Signature examples

Signature	Description
`'bigint,double->string'`	Takes BIGINT and DOUBLE inputs; returns STRING
`'*->string'`	Takes any number of inputs; returns STRING
`'->double'`	Takes no inputs; returns DOUBLE
`'array<bigint>->struct<x:string, y:int>'`	Takes ARRAY\<BIGINT\>; returns STRUCT\<x:STRING, y:INT\>
`'->map<bigint, string>'`	Takes no inputs; returns MAP\<BIGINT, STRING\>

Data type mappings

Write Python UDF logic using the Python 2 types that correspond to MaxCompute SQL types. Type mismatches cause runtime errors.

MaxCompute SQL type	Python 2 type	Notes
BIGINT	int
STRING	str
DOUBLE	float
BOOLEAN	bool
DATETIME	int	Stored as milliseconds since 00:00:00 Thursday, January 1, 1970 (Unix epoch). Use the `datetime` module to work with these values.
FLOAT	float
CHAR	str
VARCHAR	str
BINARY	bytearray
DATE	int
DECIMAL	decimal.Decimal
ARRAY	list
MAP	dict
STRUCT	collections.namedtuple

Additional notes:

NULL in MaxCompute SQL maps to None in Python 2.
The silent parameter is added to odps.udf.int(value). If silent is set to True and the value cannot be converted to int, the function returns None instead of raising an error.

Reference resources

Use the odps.distcache module to load file or table resources into UDF code at initialization time.

Reference a file

get_cache_file(resource_name) returns a file-like object with the content of the specified file resource.

resource_name must be the name of an existing file resource in your MaxCompute project. If the name is invalid or the file does not exist, an error is returned.
Declare the file resource when you create the UDF. If you do not declare it, calling the UDF returns an error.
Call close() on the returned object when you are done with it.

from odps.udf import annotate
from odps.distcache import get_cache_file

@annotate('bigint->string')
class DistCacheExample(object):
    def __init__(self):
        cache_file = get_cache_file('test_distcache.txt')
        kv = {}
        for line in cache_file:
            line = line.strip()
            if not line:
                continue
            k, v = line.split()
            kv[int(k)] = v
        cache_file.close()
        self.kv = kv

    def evaluate(self, arg):
        return self.kv.get(arg)

Reference a table

get_cache_table(resource_name) returns a generator. Each iteration yields one record as a list.

resource_name must be the name of an existing table resource in your MaxCompute project. If the name is invalid or the table does not exist, an error is returned.

from odps.udf import annotate
from odps.distcache import get_cache_table

@annotate('->string')
class DistCacheTableExample(object):
    def __init__(self):
        self.records = list(get_cache_table('udf_test'))
        self.counter = 0
        self.ln = len(self.records)

    def evaluate(self):
        if self.counter > self.ln - 1:
            return None
        ret = self.records[self.counter]
        self.counter += 1
        return str(ret)

Development process

The development process for Python 2 UDFs — including setup, writing code, uploading the Python program, creating the UDF, debugging, and calling it — is the same as for Python 3 UDFs.

For the full development process, see Development process.
For a step-by-step guide using MaxCompute Studio, see Develop a Python UDF.

Supported development tools:

MaxCompute Studio
DataWorks
MaxCompute client (odpscmd)

Call a Python 2 UDF

After developing a Python 2 UDF, call it from MaxCompute SQL using one of the following approaches:

Within a project: Call the UDF the same way you call a built-in function.
Across projects: Call a UDF defined in project B from project A using the following syntax:
```
SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;
```
For setup instructions, see Cross-project resource access based on packages.