All Products
Search
Document Center

MaxCompute:Develop a UDF in Python 2

Last Updated:Mar 26, 2026

MaxCompute uses Python 2.7 to run user-defined functions (UDFs). This topic explains how to write a Python 2 UDF — from code structure and sandbox constraints to data type mappings and resource references.

UDF code structure

A Python 2 UDF consists of five components. One is optional or conditional; four are always required.

Component Required Purpose
Encoding declaration Optional (required if code contains Chinese characters) Declares the file encoding
Module import Required Imports the function signature and any resource modules
Function signature Required Defines input and return data types via @annotate
Custom Python class Required The organizational unit of UDF logic
evaluate method Required Defines the UDF's input parameters and return value; each class can have only one

Encoding declaration

Add an encoding declaration at the top of any UDF file that contains Chinese characters. Without it, MaxCompute returns an error at runtime. Both formats below are equivalent:

#coding:utf-8
# -*- coding: utf-8 -*-

Module import

Every UDF must import the function signature module:

from odps.udf import annotate

To reference files or tables in UDF code, also import from odps.distcache:

from odps.distcache import get_cache_file   # for file resources
from odps.distcache import get_cache_table  # for table resources

Function signature

The @annotate decorator defines the data types of input parameters and return value. MaxCompute checks type consistency during semantic parsing and returns an error if types do not match.

@annotate("bigint,bigint->bigint")

For the full signature syntax, see Function signatures and data types.

Minimal example

The following example shows a complete, working UDF that adds two integers. It covers all required components.

#coding:utf-8
# Import the function signature.
from odps.udf import annotate

# Define input types (BIGINT, BIGINT) and return type (BIGINT).
@annotate("bigint,bigint->bigint")
class MyPlus(object):
    def evaluate(self, arg0, arg1):
        if None in (arg0, arg1):
            return None
        return arg0 + arg1
Note Always handle None inputs explicitly. NULL values in MaxCompute SQL map to None in Python 2, so failing to check for None can cause unexpected errors.

Limitations

Prohibited operations

MaxCompute runs Python 2 UDF code inside a sandbox. The following operations are not permitted:

  • Reading from or writing to local files

  • Starting subprocesses

  • Starting threads

  • Opening socket connections

  • Calling Python 2 UDFs from external systems

Because of these restrictions, all uploaded code must rely on Python standard libraries. Modules or C extension modules that perform the prohibited operations above cannot be used.

Available standard library modules

All pure-Python modules in the Python standard library (those with no dependency on C extension modules) are available.

The following C extension modules are also available:

  • array, audioop

  • binascii, bisect

  • cmath, _codecs_cn, _codecs_hk, _codecs_iso2022, _codecs_jp, _codecs_kr, _codecs_tw, _collections, cStringIO

  • datetime

  • _functools, future_builtins

  • _heapq, _hashlib

  • itertools

  • _json

  • _locale, _lsprof

  • math, _md5, _multibytecodec

  • operator

  • _random

  • _sha256, _sha512, _sha, _struct, strop

  • time

  • unicodedata

  • _weakref

  • cPickle

Note The maximum size of data that can be written to sys.stdout or sys.stderr is 20 KB. Any output beyond this limit is silently discarded.

Third-party libraries

Third-party libraries, such as NumPy, are pre-installed in the MaxCompute Python 2 environment as supplements to the standard library.

Note Third-party library usage is subject to the same sandbox restrictions. Local data access is not allowed, and network I/O is limited. The related APIs in affected libraries are disabled.

Function signatures and data types

Signature format

@annotate('<arg_type_list>-><return_type>')

arg_type_list specifies input parameter types, separated by commas. It accepts two special forms:

  • * — accepts any number of input parameters

  • '' (empty string) — accepts no input parameters

return_type specifies the type of the single return value.

Supported input and return types: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, VARCHAR, and the complex types ARRAY, MAP, STRUCT (including nested complex types).

Note The data types available in function signatures depend on the MaxCompute data type edition used by your project. For details, see Data type editions.

Signature examples

Signature Description
'bigint,double->string' Takes BIGINT and DOUBLE inputs; returns STRING
'*->string' Takes any number of inputs; returns STRING
'->double' Takes no inputs; returns DOUBLE
'array<bigint>->struct<x:string, y:int>' Takes ARRAY\<BIGINT\>; returns STRUCT\<x:STRING, y:INT\>
'->map<bigint, string>' Takes no inputs; returns MAP\<BIGINT, STRING\>

Data type mappings

Write Python UDF logic using the Python 2 types that correspond to MaxCompute SQL types. Type mismatches cause runtime errors.

MaxCompute SQL type Python 2 type Notes
BIGINT int
STRING str
DOUBLE float
BOOLEAN bool
DATETIME int Stored as milliseconds since 00:00:00 Thursday, January 1, 1970 (Unix epoch). Use the datetime module to work with these values.
FLOAT float
CHAR str
VARCHAR str
BINARY bytearray
DATE int
DECIMAL decimal.Decimal
ARRAY list
MAP dict
STRUCT collections.namedtuple

Additional notes:

  • NULL in MaxCompute SQL maps to None in Python 2.

  • The silent parameter is added to odps.udf.int(value). If silent is set to True and the value cannot be converted to int, the function returns None instead of raising an error.

Reference resources

Use the odps.distcache module to load file or table resources into UDF code at initialization time.

Reference a file

get_cache_file(resource_name) returns a file-like object with the content of the specified file resource.

  • resource_name must be the name of an existing file resource in your MaxCompute project. If the name is invalid or the file does not exist, an error is returned.

  • Declare the file resource when you create the UDF. If you do not declare it, calling the UDF returns an error.

  • Call close() on the returned object when you are done with it.

from odps.udf import annotate
from odps.distcache import get_cache_file

@annotate('bigint->string')
class DistCacheExample(object):
    def __init__(self):
        cache_file = get_cache_file('test_distcache.txt')
        kv = {}
        for line in cache_file:
            line = line.strip()
            if not line:
                continue
            k, v = line.split()
            kv[int(k)] = v
        cache_file.close()
        self.kv = kv

    def evaluate(self, arg):
        return self.kv.get(arg)

Reference a table

get_cache_table(resource_name) returns a generator. Each iteration yields one record as a list.

  • resource_name must be the name of an existing table resource in your MaxCompute project. If the name is invalid or the table does not exist, an error is returned.

from odps.udf import annotate
from odps.distcache import get_cache_table

@annotate('->string')
class DistCacheTableExample(object):
    def __init__(self):
        self.records = list(get_cache_table('udf_test'))
        self.counter = 0
        self.ln = len(self.records)

    def evaluate(self):
        if self.counter > self.ln - 1:
            return None
        ret = self.records[self.counter]
        self.counter += 1
        return str(ret)

Development process

The development process for Python 2 UDFs — including setup, writing code, uploading the Python program, creating the UDF, debugging, and calling it — is the same as for Python 3 UDFs.

Supported development tools:

  • MaxCompute Studio

  • DataWorks

  • MaxCompute client (odpscmd)

Call a Python 2 UDF

After developing a Python 2 UDF, call it from MaxCompute SQL using one of the following approaches: