MaxCompute supports Python 3 using CPython 3.7.3. Python 2 has reached End of Life (EOL), so write all new user-defined table-valued functions (UDTFs) in Python 3.
Enable Python 3
By default, MaxCompute projects use Python 2 for UDFs. To use Python 3, add the following session-level command before your SQL statement and submit them together:
set odps.sql.python.version=cp37;
UDTF code structure
Use MaxCompute Studio to write UDTF code in Python 3. A UDTF has four components:
| Component | Required | Description |
|---|---|---|
| Module imports | Required | Must include from odps.udf import annotate and from odps.udf import BaseUDTF. To reference files or tables, also add from odps.distcache import get_cache_file or from odps.distcache import get_cache_table. |
| Function signature | Optional | Declared with @annotate(<signature>). Defines the data types of input parameters and return values. Without a signature, any input data type is accepted and all return values default to STRING. |
| Custom Python class | Required | A derived class of BaseUDTF. Defines the variables and methods for your business logic. |
| Class methods | Required | Implement the required methods described in the table below. |
Class methods
| Method | Required | When called | Description |
|---|---|---|---|
BaseUDTF.init() |
Optional | Once, before the first record | Initialization method. When overriding, call super(BaseUDTF, self).init() at the start. Use this to set up internal state that persists across records. |
BaseUDTF.process([args, ...]) |
Required | Once per SQL record | Processes each input row. The parameters of the process function are the input parameters of the UDTF specified in SQL statements. |
BaseUDTF.forward([args, ...]) |
Required | Called by your code | Outputs one row per call. The parameters in the forward method are the UDTF output parameters specified in SQL statements. Without a function signature, convert all values to STRING before calling forward. |
BaseUDTF.close() |
Optional | Once, before the last record | Cleanup method. Use this to release resources when the UDTF terminates. |
The following example shows a minimal UDTF that splits a comma-separated string into individual rows:
# Import the function signature module and the base class.
from odps.udf import annotate
from odps.udf import BaseUDTF
# Function signature: takes a STRING, returns a STRING.
@annotate('string -> string')
# Custom Python class derived from BaseUDTF.
class Explode(BaseUDTF):
def process(self, arg):
props = arg.split(',')
for p in props:
self.forward(p)
Python 2 UDTFs and Python 3 UDTFs run on different underlying Python versions. Write each UDTF according to the syntax and capabilities of the Python version it targets.
Limitations
Python 3 is not compatible with Python 2. A single SQL statement cannot mix Python 2 UDTFs and Python 3 UDTFs.
Migrate Python 2 UDTFs
Python 2 has reached EOL. Migrate your existing Python 2 UDTFs based on your project situation:
-
New project or first Python UDTF: Write all Python UDTFs in Python 3 from the start.
-
Existing project with many Python 2 UDTFs: Migrate gradually to avoid disruption. Choose one of the following approaches:
-
Write new UDTFs in Python 3 and enable Python 3 at the session level for jobs that use those new UDTFs. For details, see Enable Python 3.
-
Rewrite existing Python 2 UDTFs to be compatible with both Python 2 and Python 3. See Porting Python 2 Code to Python 3 for guidance.
-
If a UDTF is shared across multiple MaxCompute projects, make it compatible with both Python 2 and Python 3 to avoid breaking projects that still use Python 2.
Third-party libraries
NumPy is not included in the MaxCompute Python 3 runtime environment. To use NumPy in a UDTF, manually upload a NumPy wheel package as a resource. The expected filename from Python Package Index (PyPI) or an image is:
numpy-<Version>-cp37-cp37m-manylinux1_x86_64.whl
For instructions on uploading the package, see Resource operations or Reference third-party packages in Python UDFs.
Function signatures and data types
A function signature declares the data types of a UDTF's input parameters and return values. MaxCompute validates the signature during semantics parsing and returns an error if the actual types do not match.
Signature format
@annotate('arg_type_list -> type_list')
-
arg_type_list: comma-separated input parameter types. Set to*to accept any number of parameters, or leave blank to accept no parameters. -
type_list: return value types. A UDTF can return multiple columns.
Supported types for `type_list`: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), and complex types (ARRAY, MAP, STRUCT), including nested complex types.
Supported types for `arg_type_list`: all types listed above, plus CHAR and VARCHAR.
Select data types based on the data type edition of your MaxCompute project.
Signature examples
| Signature | Description |
|---|---|
@annotate('bigint,boolean->string,datetime') |
Two input parameters (BIGINT, BOOLEAN); two return values (STRING, DATETIME). |
@annotate('*->string,datetime') |
Any number of input parameters; two return values (STRING, DATETIME). |
@annotate('->double,bigint,string') |
No input parameters; three return values (DOUBLE, BIGINT, STRING). |
@annotate("array<string>,struct<a1:bigint,b1:string>,string->map<string,bigint>,struct<b1:bigint>") |
Complex type inputs and outputs. |
Data type mappings
Write Python UDTFs using the Python types that correspond to MaxCompute SQL types:
| MaxCompute SQL type | Python 3 type |
|---|---|
| BIGINT | INT |
| STRING | UNICODE |
| DOUBLE | FLOAT |
| BOOLEAN | BOOL |
| DATETIME | DATETIME.DATETIME |
| FLOAT | FLOAT |
| CHAR | UNICODE |
| VARCHAR | UNICODE |
| BINARY | BYTES |
| DATE | DATETIME.DATE |
| DECIMAL | DECIMAL.DECIMAL |
| ARRAY | LIST |
| MAP | DICT |
| STRUCT | COLLECTIONS.NAMEDTUPLE |
Reference resources
Reference files and tables in a Python UDTF using the odps.distcache module.
`odps.distcache.get_cache_file(resource_name)`
Returns the content of a file resource.
-
resource_name: the name of an existing file resource in your MaxCompute project. Returns an error if the name is invalid or the file does not exist. -
Returns a file-like object. Call
close()on the object when done to release the file handle. -
Declare the file resource when creating the UDTF. If you omit this declaration, calling the UDTF returns an error.
`odps.distcache.get_cache_table(resource_name)`
Returns the content of a table resource.
-
resource_name: the name of an existing table resource in your MaxCompute project. Returns an error if the name is invalid or the table does not exist. -
Returns a generator. Iterating over it yields one record per row, where each record is an ARRAY.
The following example reads data from a JSON file and a table resource, then outputs rows based on a lookup:
from odps.udf import annotate
from odps.udf import BaseUDTF
from odps.distcache import get_cache_file
from odps.distcache import get_cache_table
@annotate('string -> string, bigint')
class UDTFExample(BaseUDTF):
def __init__(self):
import json
# Load the JSON file resource into a dict.
cache_file = get_cache_file('test_json.txt')
self.my_dict = json.load(cache_file)
cache_file.close()
# Append records from the table resource into the dict.
records = list(get_cache_table('table_resource1'))
for record in records:
self.my_dict[record[0]] = record[1]
def process(self, pageid):
# For each input pageid, forward all associated adid values.
for adid in self.my_dict[pageid]:
self.forward(pageid, adid)
Call a Python 3 UDTF
After developing a Python 3 UDTF following the development process, call it from MaxCompute SQL.
-
Within a project: Call the UDTF the same way as a built-in function.
-
Across projects: Reference a UDTF from another project using the project name as a prefix:
SELECT B:udf_in_other_project(arg0, arg1) AS res FROM table_t;For cross-project resource sharing setup, see Package-based resource sharing across projects.
To develop and test a Python 3 UDTF in MaxCompute Studio, see Develop a Python UDF.