Develop and use Python 3 table-valued functions - MaxCompute

Enable Python 3

By default, MaxCompute projects use Python 2 for UDFs. To use Python 3, add the following session-level command before your SQL statement and submit them together:

set odps.sql.python.version=cp37;

UDTF code structure

Use MaxCompute Studio to write UDTF code in Python 3. A UDTF has four components:

Component	Required	Description
Module imports	Required	Must include `from odps.udf import annotate` and `from odps.udf import BaseUDTF`. To reference files or tables, also add `from odps.distcache import get_cache_file` or `from odps.distcache import get_cache_table`.
Function signature	Optional	Declared with `@annotate(<signature>)`. Defines the data types of input parameters and return values. Without a signature, any input data type is accepted and all return values default to STRING.
Custom Python class	Required	A derived class of `BaseUDTF`. Defines the variables and methods for your business logic.
Class methods	Required	Implement the required methods described in the table below.

Class methods

Method	Required	When called	Description
`BaseUDTF.init()`	Optional	Once, before the first record	Initialization method. When overriding, call `super(BaseUDTF, self).init()` at the start. Use this to set up internal state that persists across records.
`BaseUDTF.process([args, ...])`	Required	Once per SQL record	Processes each input row. The parameters of the `process` function are the input parameters of the UDTF specified in SQL statements.
`BaseUDTF.forward([args, ...])`	Required	Called by your code	Outputs one row per call. The parameters in the `forward` method are the UDTF output parameters specified in SQL statements. Without a function signature, convert all values to STRING before calling `forward`.
`BaseUDTF.close()`	Optional	Once, before the last record	Cleanup method. Use this to release resources when the UDTF terminates.

The following example shows a minimal UDTF that splits a comma-separated string into individual rows:

# Import the function signature module and the base class.
from odps.udf import annotate
from odps.udf import BaseUDTF

# Function signature: takes a STRING, returns a STRING.
@annotate('string -> string')

# Custom Python class derived from BaseUDTF.
class Explode(BaseUDTF):

    def process(self, arg):
        props = arg.split(',')
        for p in props:
            self.forward(p)

Python 2 UDTFs and Python 3 UDTFs run on different underlying Python versions. Write each UDTF according to the syntax and capabilities of the Python version it targets.

Limitations

Python 3 is not compatible with Python 2. A single SQL statement cannot mix Python 2 UDTFs and Python 3 UDTFs.

Migrate Python 2 UDTFs

Python 2 has reached EOL. Migrate your existing Python 2 UDTFs based on your project situation:

New project or first Python UDTF: Write all Python UDTFs in Python 3 from the start.
Existing project with many Python 2 UDTFs: Migrate gradually to avoid disruption. Choose one of the following approaches:
- Write new UDTFs in Python 3 and enable Python 3 at the session level for jobs that use those new UDTFs. For details, see Enable Python 3.
- Rewrite existing Python 2 UDTFs to be compatible with both Python 2 and Python 3. See Porting Python 2 Code to Python 3 for guidance.

If a UDTF is shared across multiple MaxCompute projects, make it compatible with both Python 2 and Python 3 to avoid breaking projects that still use Python 2.

Third-party libraries

NumPy is not included in the MaxCompute Python 3 runtime environment. To use NumPy in a UDTF, manually upload a NumPy wheel package as a resource. The expected filename from Python Package Index (PyPI) or an image is:

numpy-<Version>-cp37-cp37m-manylinux1_x86_64.whl

For instructions on uploading the package, see Resource operations or Reference third-party packages in Python UDFs.

Function signatures and data types

A function signature declares the data types of a UDTF's input parameters and return values. MaxCompute validates the signature during semantics parsing and returns an error if the actual types do not match.

Signature format

@annotate('arg_type_list -> type_list')

arg_type_list: comma-separated input parameter types. Set to * to accept any number of parameters, or leave blank to accept no parameters.
type_list: return value types. A UDTF can return multiple columns.

Supported types for `type_list`: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), and complex types (ARRAY, MAP, STRUCT), including nested complex types.

Supported types for `arg_type_list`: all types listed above, plus CHAR and VARCHAR.

Select data types based on the data type edition of your MaxCompute project.

Signature examples

Signature	Description
`@annotate('bigint,boolean->string,datetime')`	Two input parameters (BIGINT, BOOLEAN); two return values (STRING, DATETIME).
`@annotate('*->string,datetime')`	Any number of input parameters; two return values (STRING, DATETIME).
`@annotate('->double,bigint,string')`	No input parameters; three return values (DOUBLE, BIGINT, STRING).
`@annotate("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint>")`	Complex type inputs and outputs.

Data type mappings

Write Python UDTFs using the Python types that correspond to MaxCompute SQL types:

MaxCompute SQL type	Python 3 type
BIGINT	INT
STRING	UNICODE
DOUBLE	FLOAT
BOOLEAN	BOOL
DATETIME	DATETIME.DATETIME
FLOAT	FLOAT
CHAR	UNICODE
VARCHAR	UNICODE
BINARY	BYTES
DATE	DATETIME.DATE
DECIMAL	DECIMAL.DECIMAL
ARRAY	LIST
MAP	DICT
STRUCT	COLLECTIONS.NAMEDTUPLE

Reference resources

Reference files and tables in a Python UDTF using the odps.distcache module.

`odps.distcache.get_cache_file(resource_name)`

Returns the content of a file resource.

resource_name: the name of an existing file resource in your MaxCompute project. Returns an error if the name is invalid or the file does not exist.
Returns a file-like object. Call close() on the object when done to release the file handle.
Declare the file resource when creating the UDTF. If you omit this declaration, calling the UDTF returns an error.

`odps.distcache.get_cache_table(resource_name)`

Returns the content of a table resource.

resource_name: the name of an existing table resource in your MaxCompute project. Returns an error if the name is invalid or the table does not exist.
Returns a generator. Iterating over it yields one record per row, where each record is an ARRAY.

The following example reads data from a JSON file and a table resource, then outputs rows based on a lookup:

from odps.udf import annotate
from odps.udf import BaseUDTF
from odps.distcache import get_cache_file
from odps.distcache import get_cache_table

@annotate('string -> string, bigint')
class UDTFExample(BaseUDTF):

    def __init__(self):
        import json
        # Load the JSON file resource into a dict.
        cache_file = get_cache_file('test_json.txt')
        self.my_dict = json.load(cache_file)
        cache_file.close()

        # Append records from the table resource into the dict.
        records = list(get_cache_table('table_resource1'))
        for record in records:
            self.my_dict[record[0]] = record[1]

    def process(self, pageid):
        # For each input pageid, forward all associated adid values.
        for adid in self.my_dict[pageid]:
            self.forward(pageid, adid)

Call a Python 3 UDTF

After developing a Python 3 UDTF following the development process, call it from MaxCompute SQL.

Within a project: Call the UDTF the same way as a built-in function.
Across projects: Reference a UDTF from another project using the project name as a prefix:
```
SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;
```
For cross-project resource sharing setup, see Package-based resource sharing across projects.

To develop and test a Python 3 UDTF in MaxCompute Studio, see Develop a Python UDF.