Python Software Foundation announced the End of Life (EOL) for Python 2. Due to this reason, MaxCompute supports Python 3 and uses CPython 3.7.3. This topic describes how to write a user-defined function (UDF) in Python 3.

UDF code structure

You can use MaxCompute Studio to write UDF code in Python 3. The UDF code can contain the following information:
  • Module import: required.

    UDF code must include from odps.udf import annotate, which is used to import the function signature module. This way, MaxCompute can identify the function signature that is defined in the code. If you want to reference files or tables in UDF code, the UDF code must include from odps.distcache import get_cache_file or from odps.distcache import get_cache_table.

  • Function signature: required.

    The function signature is in the @annotate(<signature>) format. The signature parameter is used to define the data types of the input parameters and return value of the UDF. For more information about function signatures, see Function signatures and data types.

  • Custom Python class: required.

    A custom Python class is the organizational unit of UDF code. This class defines the variables and methods that are used to meet your business requirements. In UDF code, you can also reference third-party libraries that are installed in MaxCompute or reference files or tables. For more information, see Third-party libraries or Reference resources.

  • evaluate method: required.

    The evaluate method is contained in the custom Python class. The evaluate method defines the input parameters and return value of the UDF. Each Python class can contain only one evaluate method.

Sample code:
# Import the function signature module. 
from odps.udf import annotate
# The function signature. 
@annotate("bigint,bigint->bigint")
# The custom Python class. 
class MyPlus(object):
# The evaluate method. 
   def evaluate(self, arg0, arg1):
       if None in (arg0, arg1):
           return None
       return arg0 + arg1
Note Python 2 UDFs and Python 3 UDFs differ in terms of the underlying Python version. You can write a UDF based on the capability of the Python version that you use.

Limits

Python 3 is incompatible with Python 2. Due to this reason, you cannot use Python 2 code and Python 3 code in a single SQL statement at the same time.

Port Python 2 UDFs

Python Software Foundation announced the EOL for Python 2. Therefore, we recommend that you port Python 2 UDFs. The method used to port Python 2 UDFs varies based on the types of MaxCompute projects.
  • In a new project or an existing project for which you write UDFs in Python for the first time, we recommend that you use Python 3 to write all Python UDFs.
  • In an existing project where a large number of Python 2 UDFs exist, proceed with caution when you enable Python 3. If you want to replace Python 2 UDFs with Python 3 UDFs, use the following methods:
    • Use Python 3 to write new UDFs and enable Python 3 for new jobs at the session level. For more information about how to enable Python 3, see Enable Python 3.
    • Rewrite Python 2 UDFs in a manner in which the UDFs are compatible with Python 2 and Python 3. For more information about how to rewrite UDFs, see Porting Python 2 Code to Python 3.
      Note If you want to write a public UDF that is shared among multiple projects, we recommend that you use a UDF that is compatible with Python 2 and Python 3.

Enable Python 3

By default, Python 2 is used to write UDFs in a MaxCompute project. If you want to write UDFs in Python 3, add the following command before the SQL statement that you want to execute. Then, commit and execute the statement.
set odps.sql.python.version=cp37;

Third-party libraries

NumPy is not installed in the Python 3 runtime environment in MaxCompute. To use a NumPy UDF, you must manually upload a NumPy wheel package. If you obtain this package from Python Package Index (PyPI) or an image, the package is named numpy-<Version>-cp37-cp37m-manylinux1_x86_64.whl. For more information about how to upload a file, see Resource operations or Reference third-party packages in Python UDFs.

For more information about standard libraries that are supported by Python 3, see The Python Standard Library.

Function signatures and data types

Format of function signatures:
@annotate(<signature>)
The signature parameter is a string that specifies the data types of input parameters and return value. When you run a UDF, the data types of the input parameters and return value of the UDF must be consistent with the data types specified in the function signature. The data type consistency is checked during semantic parsing. If the data types are inconsistent, an error is returned. Format of a signature:
'arg_type_list -> type'
Parameter description:
  • arg_type_list: specifies the data types of input parameters. If multiple input parameters are used, their data types are separated by commas (,). The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, and VARCHAR. Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.
    arg_type_list can be represented by an asterisk (*) or left empty ('').
    • If arg_type_list is represented by an asterisk (*), a random number of input parameters are allowed.
    • If arg_type_list is left empty (''), no input parameters are used.
  • type: specifies the data types of return value. For a UDF, only one column of values is returned. The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, and DECIMAL(precision,scale). Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.
Note When you write UDF code, you can select a data type based on the data type edition used by your MaxCompute project. For more information about MaxCompute data type editions and the data types supported in each edition, see Data type editions.
The following table provides examples of valid function signatures.
Function signature Description
'bigint,double->string' The data types of the input parameters are BIGINT and DOUBLE and the data type of the return value is STRING.
'*->string' A random number of input parameters are used and the data type of the return value is STRING.
'->double' No input parameters are used and the data type of the return value is DOUBLE.
'array<bigint>->struct<x:string, y:int>' The data type of the input parameters is ARRAY<BIGINT> and the data type of the return value is STRUCT<x:STRING, y:INT>.
'->map<bigint, string>' No input parameters are used and the data type of the return value is MAP<BIGINT, STRING>.

The following table describes the mappings between the data types that are supported in MaxCompute SQL and the Python 2 data types. You must write Python UDFs based on the mappings to ensure the consistency of data types.

MaxCompute SQL data type Python 3 data type
BIGINT INT
STRING UNICODE
DOUBLE FLOAT
BOOLEAN BOOL
DATETIME DATETIME.DATETIME
FLOAT FLOAT
CHAR UNICODE
VARCHAR UNICODE
BINARY BYTES
DATE DATETIME.DATE
DECIMAL DECIMAL.DECIMAL
ARRAY LIST
MAP DICT
STRUCT COLLECTIONS.NAMEDTUPLE

Reference resources

You can reference files or tables in Python 2 UDF code by using the odps.distcache module.

  • odps.distcache.get_cache_file(resource_name, mode): returns the content of a specified file based on the value of mode that you specified.
    • resource_name is a string that specifies the name of an existing table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned.
    • The value of mode is of the STRING type. Default value: 't'. If the value of mode is 't', the file is displayed in text mode. If the value of mode is 'b', the file is displayed in binary mode.
    • The return value is a file-like object. If this object is no longer used, you must call the close method to release the open file.
    The following code shows how to reference a file.
    from odps.udf import annotate
    from odps.distcache import get_cache_file
    @annotate('bigint->string')
    class DistCacheExample(object):
    def __init__(self):
        cache_file = get_cache_file('test_distcache.txt')
        kv = {}
        for line in cache_file:
            line = line.strip()
            if not line:
                continue
            k, v = line.split()
            kv[int(k)] = v
        cache_file.close()
        self.kv = kv
    def evaluate(self, arg):
        return self.kv.get(arg)
  • odps.distcache.get_cache_table(resource_name): returns the content of a specified table.
    • resource_name specifies the name of the table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned. Data of the following types in the table can be read: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, FLOAT, CHAR, VARCHAR, BINARY, DATE, DECIMAL, ARRAY, MAP, and STRUCT.
    • The return value is of the GENERATOR data type. The caller traverses the table to obtain the table content. A record of the ARRAY type is obtained each time the caller traverses the table.
The following code shows how to reference a table.
from odps.udf import annotate
from odps.distcache import get_cache_table
@annotate('->string')
class DistCacheTableExample(object):
    def __init__(self):
        self.records = list(get_cache_table('udf_test'))
        self.counter = 0
        self.ln = len(self.records)
    def evaluate(self):
        if self.counter > self.ln - 1:
            return None
        ret = self.records[self.counter]
        self.counter += 1
        return str(ret)

Instructions

After you develop a Python 3 UDF, you can use MaxCompute SQL to call the UDF. For more information about how to call a Python 3 UDF, see Development process. You can call a Python 3 UDF by using one of the following methods:
  • Use a UDF in a MaxCompute project: The method is similar to that of using built-in functions.
  • Use a UDF across projects: Use a UDF of Project B in Project A. The following statement shows an example: select B:udf_in_other_project(arg0, arg1) as res from table_t;. For more information about resource sharing across projects, see Cross-project resource access based on packages.

For more information about how to use MaxCompute Studio to develop and call a Python 3 UDF, see Develop a Python UDF.