MaxCompute uses Python 2.7 to run user-defined functions (UDFs). This topic explains how to write a Python 2 UDF — from code structure and sandbox constraints to data type mappings and resource references.
UDF code structure
A Python 2 UDF consists of five components. One is optional or conditional; four are always required.
| Component | Required | Purpose |
|---|---|---|
| Encoding declaration | Optional (required if code contains Chinese characters) | Declares the file encoding |
| Module import | Required | Imports the function signature and any resource modules |
| Function signature | Required | Defines input and return data types via @annotate |
| Custom Python class | Required | The organizational unit of UDF logic |
evaluate method |
Required | Defines the UDF's input parameters and return value; each class can have only one |
Encoding declaration
Add an encoding declaration at the top of any UDF file that contains Chinese characters. Without it, MaxCompute returns an error at runtime. Both formats below are equivalent:
#coding:utf-8# -*- coding: utf-8 -*-
Module import
Every UDF must import the function signature module:
from odps.udf import annotate
To reference files or tables in UDF code, also import from odps.distcache:
from odps.distcache import get_cache_file # for file resources
from odps.distcache import get_cache_table # for table resources
Function signature
The @annotate decorator defines the data types of input parameters and return value. MaxCompute checks type consistency during semantic parsing and returns an error if types do not match.
@annotate("bigint,bigint->bigint")
For the full signature syntax, see Function signatures and data types.
Minimal example
The following example shows a complete, working UDF that adds two integers. It covers all required components.
#coding:utf-8
# Import the function signature.
from odps.udf import annotate
# Define input types (BIGINT, BIGINT) and return type (BIGINT).
@annotate("bigint,bigint->bigint")
class MyPlus(object):
def evaluate(self, arg0, arg1):
if None in (arg0, arg1):
return None
return arg0 + arg1
None inputs explicitly. NULL values in MaxCompute SQL map to None in Python 2, so failing to check for None can cause unexpected errors.Limitations
Prohibited operations
MaxCompute runs Python 2 UDF code inside a sandbox. The following operations are not permitted:
-
Reading from or writing to local files
-
Starting subprocesses
-
Starting threads
-
Opening socket connections
-
Calling Python 2 UDFs from external systems
Because of these restrictions, all uploaded code must rely on Python standard libraries. Modules or C extension modules that perform the prohibited operations above cannot be used.
Available standard library modules
All pure-Python modules in the Python standard library (those with no dependency on C extension modules) are available.
The following C extension modules are also available:
-
array,audioop -
binascii,bisect -
cmath,_codecs_cn,_codecs_hk,_codecs_iso2022,_codecs_jp,_codecs_kr,_codecs_tw,_collections,cStringIO -
datetime -
_functools,future_builtins -
_heapq,_hashlib -
itertools -
_json -
_locale,_lsprof -
math,_md5,_multibytecodec -
operator -
_random -
_sha256,_sha512,_sha,_struct,strop -
time -
unicodedata -
_weakref -
cPickle
sys.stdout or sys.stderr is 20 KB. Any output beyond this limit is silently discarded.Third-party libraries
Third-party libraries, such as NumPy, are pre-installed in the MaxCompute Python 2 environment as supplements to the standard library.
Function signatures and data types
Signature format
@annotate('<arg_type_list>-><return_type>')
arg_type_list specifies input parameter types, separated by commas. It accepts two special forms:
-
*— accepts any number of input parameters -
''(empty string) — accepts no input parameters
return_type specifies the type of the single return value.
Supported input and return types: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, VARCHAR, and the complex types ARRAY, MAP, STRUCT (including nested complex types).
Signature examples
| Signature | Description |
|---|---|
'bigint,double->string' |
Takes BIGINT and DOUBLE inputs; returns STRING |
'*->string' |
Takes any number of inputs; returns STRING |
'->double' |
Takes no inputs; returns DOUBLE |
'array<bigint>->struct<x:string, y:int>' |
Takes ARRAY\<BIGINT\>; returns STRUCT\<x:STRING, y:INT\> |
'->map<bigint, string>' |
Takes no inputs; returns MAP\<BIGINT, STRING\> |
Data type mappings
Write Python UDF logic using the Python 2 types that correspond to MaxCompute SQL types. Type mismatches cause runtime errors.
| MaxCompute SQL type | Python 2 type | Notes |
|---|---|---|
| BIGINT | int | |
| STRING | str | |
| DOUBLE | float | |
| BOOLEAN | bool | |
| DATETIME | int | Stored as milliseconds since 00:00:00 Thursday, January 1, 1970 (Unix epoch). Use the datetime module to work with these values. |
| FLOAT | float | |
| CHAR | str | |
| VARCHAR | str | |
| BINARY | bytearray | |
| DATE | int | |
| DECIMAL | decimal.Decimal | |
| ARRAY | list | |
| MAP | dict | |
| STRUCT | collections.namedtuple |
Additional notes:
-
NULL in MaxCompute SQL maps to
Nonein Python 2. -
The
silentparameter is added toodps.udf.int(value). Ifsilentis set toTrueand the value cannot be converted toint, the function returnsNoneinstead of raising an error.
Reference resources
Use the odps.distcache module to load file or table resources into UDF code at initialization time.
Reference a file
get_cache_file(resource_name) returns a file-like object with the content of the specified file resource.
-
resource_namemust be the name of an existing file resource in your MaxCompute project. If the name is invalid or the file does not exist, an error is returned. -
Declare the file resource when you create the UDF. If you do not declare it, calling the UDF returns an error.
-
Call
close()on the returned object when you are done with it.
from odps.udf import annotate
from odps.distcache import get_cache_file
@annotate('bigint->string')
class DistCacheExample(object):
def __init__(self):
cache_file = get_cache_file('test_distcache.txt')
kv = {}
for line in cache_file:
line = line.strip()
if not line:
continue
k, v = line.split()
kv[int(k)] = v
cache_file.close()
self.kv = kv
def evaluate(self, arg):
return self.kv.get(arg)
Reference a table
get_cache_table(resource_name) returns a generator. Each iteration yields one record as a list.
-
resource_namemust be the name of an existing table resource in your MaxCompute project. If the name is invalid or the table does not exist, an error is returned.
from odps.udf import annotate
from odps.distcache import get_cache_table
@annotate('->string')
class DistCacheTableExample(object):
def __init__(self):
self.records = list(get_cache_table('udf_test'))
self.counter = 0
self.ln = len(self.records)
def evaluate(self):
if self.counter > self.ln - 1:
return None
ret = self.records[self.counter]
self.counter += 1
return str(ret)
Development process
The development process for Python 2 UDFs — including setup, writing code, uploading the Python program, creating the UDF, debugging, and calling it — is the same as for Python 3 UDFs.
-
For the full development process, see Development process.
-
For a step-by-step guide using MaxCompute Studio, see Develop a Python UDF.
Supported development tools:
-
MaxCompute Studio
-
DataWorks
-
MaxCompute client (odpscmd)
Call a Python 2 UDF
After developing a Python 2 UDF, call it from MaxCompute SQL using one of the following approaches:
-
Within a project: Call the UDF the same way you call a built-in function.
-
Across projects: Call a UDF defined in project B from project A using the following syntax:
SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;For setup instructions, see Cross-project resource access based on packages.