Write and use a Python 2 UDTF - MaxCompute

MaxCompute runs Python 2 user-defined table-valued functions (UDTFs) using Python 2.7. A UDTF takes one input row and returns zero or more output rows, making it useful for operations like splitting or expanding data.

To create and use a Python 2 UDTF:

Write a Python class that extends BaseUDTF and implements the required methods.
Register the class as a UDTF in MaxCompute, then call it in MaxCompute SQL.

UDTF code structure

A Python 2 UDTF consists of up to five components.

Component	Required	Description
Encoding declaration	No	Declares the file encoding. Use `#coding:utf-8` or `# -- coding: utf-8 --`. Add this if the code contains Chinese characters—without it, MaxCompute returns an error at runtime.
Module imports	Yes	Must include `from odps.udf import annotate` and `from odps.udf import BaseUDTF`. Add `from odps.distcache import get_cache_file` or `from odps.distcache import get_cache_table` if the UDTF references file or table resources.
Function signature	No	Annotates the UDTF with `@annotate(<signature>)` to declare input and output data types. Without a signature, MaxCompute accepts any input type but treats all output values as STRING.
Derived class	Yes	A Python class that extends `BaseUDTF`. This class contains all the UDTF logic.
Class methods	Yes	Implement `process` at minimum. See the methods table below.

Methods

Method	Required	Description
`BaseUDTF.init()`	No	Initializes state before the first record is processed. If you override `init`, call `super(BaseUDTF, self).init()` at the start. Use this to set up any state the UDTF needs to maintain across records.
`BaseUDTF.process([args, ...])`	Yes	Called once for each input row. The arguments match the UDTF's input parameters as declared in SQL.
`BaseUDTF.forward([args, ...])`	Yes (called inside `process`)	Emits one output row each time it is called. Call it once for each row you want to return. If no function signature is defined, convert all arguments to STRING before calling `forward`.
`BaseUDTF.close()`	No	Called once before the last record is processed. Use this to release resources or flush output.

Example

The following UDTF splits a comma-separated string and emits each value as a separate row.

#coding:utf-8
from odps.udf import annotate
from odps.udf import BaseUDTF

@annotate('string -> string')
class Explode(BaseUDTF):
    def process(self, arg):
        props = arg.split(',')
        for p in props:
            self.forward(p)

Function signatures and data types

Signature format

@annotate('arg_type_list -> type_list')

arg_type_list: comma-separated list of input parameter types. Use * to accept any number of arguments, or leave blank to accept no arguments.
type_list: comma-separated list of return value types. A UDTF can return multiple columns.

The following table shows valid signature examples.

Signature	Input types	Return types
`@annotate('bigint,boolean->string,datetime')`	BIGINT, BOOLEAN	STRING, DATETIME
`@annotate('*->string,datetime')`	Any number of arguments	STRING, DATETIME
`@annotate('->double,bigint,string')`	None	DOUBLE, BIGINT, STRING
`@annotate("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint>")`	ARRAY, STRUCT, STRING	MAP, STRUCT

During semantic parsing, MaxCompute checks that the data types of actual arguments match the signature. A mismatch returns an error.

Note

The available data types depend on the data type edition of your MaxCompute project. For more information, see Data type editions.

Data type mappings

Write UDTF code using the Python types that correspond to MaxCompute SQL types.

MaxCompute SQL type	Python 2 type
BIGINT	int
STRING	str
DOUBLE	float
BOOLEAN	bool
DATETIME	int (milliseconds since January 1, 1970, 00:00:00 UTC)
FLOAT	float
CHAR	str
VARCHAR	str
BINARY	bytearray
DATE	int
DECIMAL	decimal.Decimal
ARRAY	list
MAP	dict
STRUCT	collections.namedtuple

Additional notes on type handling:

NULL in MaxCompute SQL maps to None in Python.
odps.udf.int(value, silent=True) returns None instead of raising an error when the value cannot be converted to int.

Reference file and table resources

Use the odps.distcache module to load file or table resources into your UDTF.

get_cache_file(resource_name): returns a file-like object for the named file resource. Call close() on the object when done. Declare the file resource when registering the UDTF—otherwise, the call fails at runtime.
get_cache_table(resource_name): returns a generator over the named table resource. Each iteration yields a record as a list (ARRAY type).

The following example loads a JSON file and a table resource, then uses them to look up ad IDs by page ID.

# -*- coding: utf-8 -*-
from odps.udf import annotate
from odps.udf import BaseUDTF
from odps.distcache import get_cache_file
from odps.distcache import get_cache_table

@annotate('string -> string, bigint')
class UDTFExample(BaseUDTF):
    def __init__(self):
        import json
        # Load the JSON file resource into a dict
        cache_file = get_cache_file('test_json.txt')
        self.my_dict = json.load(cache_file)
        cache_file.close()
        # Merge records from the table resource
        records = list(get_cache_table('table_resource1'))
        for record in records:
            self.my_dict[record[0]] = [record[1]]

    def process(self, pageid):
        # Emit one row per ad ID associated with the page
        for adid in self.my_dict[pageid]:
            self.forward(pageid, adid)

Call the UDTF in MaxCompute SQL

After completing the development process, call the UDTF from MaxCompute SQL:

Within a project: Call the UDTF the same way you call built-in functions.
Across projects: To use a UDTF from project B in project A, prefix the function name with the project name:
```
SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;
```
For more information, see Cross-project resource access based on packages.

Limitations

MaxCompute runs Python 2 UDTF code in a sandbox environment. The following operations are not allowed:

Reading from or writing to local files
Starting subprocesses
Starting threads
Opening socket connections
Calling Python 2 UDFs from other systems

Upload only code that uses Python standard libraries. Modules or C extension modules that depend on the restricted operations above are not available.

Available C extension modules

The following C extension modules are available in the sandbox:

array, audioop, binascii, bisect, cmath, _codecs_cn, _codecs_hk, _codecs_iso2022, _codecs_jp, _codecs_kr, _codecs_tw, _collections, cStringIO, datetime, _functools, future_builtins, _heapq, _hashlib, itertools, _json, _locale, _lsprof, math, _md5, _multibytecodec, operator, _random, _sha256, _sha512, _sha, _struct, strop, time, unicodedata, _weakref, cPickle

All modules implemented purely in Python that do not depend on extension modules are also available.

Output size limit

Writing to sys.stdout or sys.stderr is capped at 20 KB. Characters beyond this limit are silently dropped.

Third-party libraries

Third-party libraries, such as NumPy, are pre-installed in the MaxCompute Python 2 environment. Local data access and most network I/O APIs are disabled for third-party libraries—only limited network I/O is available.

MaxCompute:Python 2 UDTF