MaxCompute runs Python 2 user-defined table-valued functions (UDTFs) using Python 2.7. A UDTF takes one input row and returns zero or more output rows, making it useful for operations like splitting or expanding data.
To create and use a Python 2 UDTF:
Write a Python class that extends
BaseUDTFand implements the required methods.Register the class as a UDTF in MaxCompute, then call it in MaxCompute SQL.
UDTF code structure
A Python 2 UDTF consists of up to five components.
| Component | Required | Description |
|---|---|---|
| Encoding declaration | No | Declares the file encoding. Use #coding:utf-8 or # -*- coding: utf-8 -*-. Add this if the code contains Chinese characters—without it, MaxCompute returns an error at runtime. |
| Module imports | Yes | Must include from odps.udf import annotate and from odps.udf import BaseUDTF. Add from odps.distcache import get_cache_file or from odps.distcache import get_cache_table if the UDTF references file or table resources. |
| Function signature | No | Annotates the UDTF with @annotate(<signature>) to declare input and output data types. Without a signature, MaxCompute accepts any input type but treats all output values as STRING. |
| Derived class | Yes | A Python class that extends BaseUDTF. This class contains all the UDTF logic. |
| Class methods | Yes | Implement process at minimum. See the methods table below. |
Methods
| Method | Required | Description |
|---|---|---|
BaseUDTF.init() | No | Initializes state before the first record is processed. If you override init, call super(BaseUDTF, self).init() at the start. Use this to set up any state the UDTF needs to maintain across records. |
BaseUDTF.process([args, ...]) | Yes | Called once for each input row. The arguments match the UDTF's input parameters as declared in SQL. |
BaseUDTF.forward([args, ...]) | Yes (called inside process) | Emits one output row each time it is called. Call it once for each row you want to return. If no function signature is defined, convert all arguments to STRING before calling forward. |
BaseUDTF.close() | No | Called once before the last record is processed. Use this to release resources or flush output. |
Example
The following UDTF splits a comma-separated string and emits each value as a separate row.
#coding:utf-8
from odps.udf import annotate
from odps.udf import BaseUDTF
@annotate('string -> string')
class Explode(BaseUDTF):
def process(self, arg):
props = arg.split(',')
for p in props:
self.forward(p)Function signatures and data types
Signature format
@annotate('arg_type_list -> type_list')arg_type_list: comma-separated list of input parameter types. Use*to accept any number of arguments, or leave blank to accept no arguments.type_list: comma-separated list of return value types. A UDTF can return multiple columns.
The following table shows valid signature examples.
| Signature | Input types | Return types |
|---|---|---|
@annotate('bigint,boolean->string,datetime') | BIGINT, BOOLEAN | STRING, DATETIME |
@annotate('*->string,datetime') | Any number of arguments | STRING, DATETIME |
@annotate('->double,bigint,string') | None | DOUBLE, BIGINT, STRING |
@annotate("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint>") | ARRAY, STRUCT, STRING | MAP, STRUCT |
During semantic parsing, MaxCompute checks that the data types of actual arguments match the signature. A mismatch returns an error.
The available data types depend on the data type edition of your MaxCompute project. For more information, see Data type editions.
Data type mappings
Write UDTF code using the Python types that correspond to MaxCompute SQL types.
| MaxCompute SQL type | Python 2 type |
|---|---|
| BIGINT | int |
| STRING | str |
| DOUBLE | float |
| BOOLEAN | bool |
| DATETIME | int (milliseconds since January 1, 1970, 00:00:00 UTC) |
| FLOAT | float |
| CHAR | str |
| VARCHAR | str |
| BINARY | bytearray |
| DATE | int |
| DECIMAL | decimal.Decimal |
| ARRAY | list |
| MAP | dict |
| STRUCT | collections.namedtuple |
Additional notes on type handling:
NULL in MaxCompute SQL maps to
Nonein Python.odps.udf.int(value, silent=True)returnsNoneinstead of raising an error when the value cannot be converted to int.
Reference file and table resources
Use the odps.distcache module to load file or table resources into your UDTF.
get_cache_file(resource_name): returns a file-like object for the named file resource. Callclose()on the object when done. Declare the file resource when registering the UDTF—otherwise, the call fails at runtime.get_cache_table(resource_name): returns a generator over the named table resource. Each iteration yields a record as a list (ARRAY type).
The following example loads a JSON file and a table resource, then uses them to look up ad IDs by page ID.
# -*- coding: utf-8 -*-
from odps.udf import annotate
from odps.udf import BaseUDTF
from odps.distcache import get_cache_file
from odps.distcache import get_cache_table
@annotate('string -> string, bigint')
class UDTFExample(BaseUDTF):
def __init__(self):
import json
# Load the JSON file resource into a dict
cache_file = get_cache_file('test_json.txt')
self.my_dict = json.load(cache_file)
cache_file.close()
# Merge records from the table resource
records = list(get_cache_table('table_resource1'))
for record in records:
self.my_dict[record[0]] = [record[1]]
def process(self, pageid):
# Emit one row per ad ID associated with the page
for adid in self.my_dict[pageid]:
self.forward(pageid, adid)Call the UDTF in MaxCompute SQL
After completing the development process, call the UDTF from MaxCompute SQL:
Within a project: Call the UDTF the same way you call built-in functions.
Across projects: To use a UDTF from project B in project A, prefix the function name with the project name:
SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;For more information, see Cross-project resource access based on packages.
Limitations
MaxCompute runs Python 2 UDTF code in a sandbox environment. The following operations are not allowed:
Reading from or writing to local files
Starting subprocesses
Starting threads
Opening socket connections
Calling Python 2 UDFs from other systems
Upload only code that uses Python standard libraries. Modules or C extension modules that depend on the restricted operations above are not available.
Available C extension modules
The following C extension modules are available in the sandbox:
array, audioop, binascii, bisect, cmath, _codecs_cn, _codecs_hk, _codecs_iso2022, _codecs_jp, _codecs_kr, _codecs_tw, _collections, cStringIO, datetime, _functools, future_builtins, _heapq, _hashlib, itertools, _json, _locale, _lsprof, math, _md5, _multibytecodec, operator, _random, _sha256, _sha512, _sha, _struct, strop, time, unicodedata, _weakref, cPickle
All modules implemented purely in Python that do not depend on extension modules are also available.
Output size limit
Writing to sys.stdout or sys.stderr is capped at 20 KB. Characters beyond this limit are silently dropped.
Third-party libraries
Third-party libraries, such as NumPy, are pre-installed in the MaxCompute Python 2 environment. Local data access and most network I/O APIs are disabled for third-party libraries—only limited network I/O is available.