All Products
Search
Document Center

MaxCompute:Python 2 UDTF

Last Updated:Mar 26, 2026

MaxCompute runs Python 2 user-defined table-valued functions (UDTFs) using Python 2.7. A UDTF takes one input row and returns zero or more output rows, making it useful for operations like splitting or expanding data.

To create and use a Python 2 UDTF:

  1. Write a Python class that extends BaseUDTF and implements the required methods.

  2. Register the class as a UDTF in MaxCompute, then call it in MaxCompute SQL.

UDTF code structure

A Python 2 UDTF consists of up to five components.

ComponentRequiredDescription
Encoding declarationNoDeclares the file encoding. Use #coding:utf-8 or # -*- coding: utf-8 -*-. Add this if the code contains Chinese characters—without it, MaxCompute returns an error at runtime.
Module importsYesMust include from odps.udf import annotate and from odps.udf import BaseUDTF. Add from odps.distcache import get_cache_file or from odps.distcache import get_cache_table if the UDTF references file or table resources.
Function signatureNoAnnotates the UDTF with @annotate(<signature>) to declare input and output data types. Without a signature, MaxCompute accepts any input type but treats all output values as STRING.
Derived classYesA Python class that extends BaseUDTF. This class contains all the UDTF logic.
Class methodsYesImplement process at minimum. See the methods table below.

Methods

MethodRequiredDescription
BaseUDTF.init()NoInitializes state before the first record is processed. If you override init, call super(BaseUDTF, self).init() at the start. Use this to set up any state the UDTF needs to maintain across records.
BaseUDTF.process([args, ...])YesCalled once for each input row. The arguments match the UDTF's input parameters as declared in SQL.
BaseUDTF.forward([args, ...])Yes (called inside process)Emits one output row each time it is called. Call it once for each row you want to return. If no function signature is defined, convert all arguments to STRING before calling forward.
BaseUDTF.close()NoCalled once before the last record is processed. Use this to release resources or flush output.

Example

The following UDTF splits a comma-separated string and emits each value as a separate row.

#coding:utf-8
from odps.udf import annotate
from odps.udf import BaseUDTF

@annotate('string -> string')
class Explode(BaseUDTF):
    def process(self, arg):
        props = arg.split(',')
        for p in props:
            self.forward(p)

Function signatures and data types

Signature format

@annotate('arg_type_list -> type_list')
  • arg_type_list: comma-separated list of input parameter types. Use * to accept any number of arguments, or leave blank to accept no arguments.

  • type_list: comma-separated list of return value types. A UDTF can return multiple columns.

The following table shows valid signature examples.

SignatureInput typesReturn types
@annotate('bigint,boolean->string,datetime')BIGINT, BOOLEANSTRING, DATETIME
@annotate('*->string,datetime')Any number of argumentsSTRING, DATETIME
@annotate('->double,bigint,string')NoneDOUBLE, BIGINT, STRING
@annotate("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint>")ARRAY, STRUCT, STRINGMAP, STRUCT

During semantic parsing, MaxCompute checks that the data types of actual arguments match the signature. A mismatch returns an error.

Note

The available data types depend on the data type edition of your MaxCompute project. For more information, see Data type editions.

Data type mappings

Write UDTF code using the Python types that correspond to MaxCompute SQL types.

MaxCompute SQL typePython 2 type
BIGINTint
STRINGstr
DOUBLEfloat
BOOLEANbool
DATETIMEint (milliseconds since January 1, 1970, 00:00:00 UTC)
FLOATfloat
CHARstr
VARCHARstr
BINARYbytearray
DATEint
DECIMALdecimal.Decimal
ARRAYlist
MAPdict
STRUCTcollections.namedtuple

Additional notes on type handling:

  • NULL in MaxCompute SQL maps to None in Python.

  • odps.udf.int(value, silent=True) returns None instead of raising an error when the value cannot be converted to int.

Reference file and table resources

Use the odps.distcache module to load file or table resources into your UDTF.

  • get_cache_file(resource_name): returns a file-like object for the named file resource. Call close() on the object when done. Declare the file resource when registering the UDTF—otherwise, the call fails at runtime.

  • get_cache_table(resource_name): returns a generator over the named table resource. Each iteration yields a record as a list (ARRAY type).

The following example loads a JSON file and a table resource, then uses them to look up ad IDs by page ID.

# -*- coding: utf-8 -*-
from odps.udf import annotate
from odps.udf import BaseUDTF
from odps.distcache import get_cache_file
from odps.distcache import get_cache_table

@annotate('string -> string, bigint')
class UDTFExample(BaseUDTF):
    def __init__(self):
        import json
        # Load the JSON file resource into a dict
        cache_file = get_cache_file('test_json.txt')
        self.my_dict = json.load(cache_file)
        cache_file.close()
        # Merge records from the table resource
        records = list(get_cache_table('table_resource1'))
        for record in records:
            self.my_dict[record[0]] = [record[1]]

    def process(self, pageid):
        # Emit one row per ad ID associated with the page
        for adid in self.my_dict[pageid]:
            self.forward(pageid, adid)

Call the UDTF in MaxCompute SQL

After completing the development process, call the UDTF from MaxCompute SQL:

Limitations

MaxCompute runs Python 2 UDTF code in a sandbox environment. The following operations are not allowed:

  • Reading from or writing to local files

  • Starting subprocesses

  • Starting threads

  • Opening socket connections

  • Calling Python 2 UDFs from other systems

Upload only code that uses Python standard libraries. Modules or C extension modules that depend on the restricted operations above are not available.

Available C extension modules

The following C extension modules are available in the sandbox:

array, audioop, binascii, bisect, cmath, _codecs_cn, _codecs_hk, _codecs_iso2022, _codecs_jp, _codecs_kr, _codecs_tw, _collections, cStringIO, datetime, _functools, future_builtins, _heapq, _hashlib, itertools, _json, _locale, _lsprof, math, _md5, _multibytecodec, operator, _random, _sha256, _sha512, _sha, _struct, strop, time, unicodedata, _weakref, cPickle

All modules implemented purely in Python that do not depend on extension modules are also available.

Output size limit

Writing to sys.stdout or sys.stderr is capped at 20 KB. Characters beyond this limit are silently dropped.

Third-party libraries

Third-party libraries, such as NumPy, are pre-installed in the MaxCompute Python 2 environment. Local data access and most network I/O APIs are disabled for third-party libraries—only limited network I/O is available.

What's next