Python 3 UDAF - MaxCompute - Alibaba Cloud Documentation Center

UDAF code structure

You can use MaxCompute Studio to write UDAF code in Python 3. The UDAF code must contain the following information:

Module import: required.
UDAF code must include at least from odps.udf import annotate and from odps.udf import BaseUDAF. from odps.udf import annotate is used to import the function signature module. This way, MaxCompute can identify the function signature that is defined in the code. from odps.udf import BaseUDAF is a base class for Python UDAFs. You must use this class to implement methods such as iterate, merge, or terminate in derived classes.

If you want to reference file or table resources in UDAF code, UDAF code must include from odps.distcache import get_cache_file or from odps.distcache import get_cache_table.
Function signature: required.
The function signature is in the @annotate(<signature>) format. The signature parameter is used to define the data types of the input parameters and return value of the UDF. For more information about function signatures, see Function signatures and data types.
Custom Python class (derived class): required.
A custom Java class is the organizational unit of UDAF code. This class defines the variables and methods that are used to meet your business requirements. In UDF code, you can also reference third-party libraries that are installed in MaxCompute or reference files or tables. For more information, see Third-party libraries or Reference resources.

Methods to implement Python classes: required.

The following table describes the four methods that can be used to implement Python classes. You can select a method based on your business requirements.


Method	Description
`BaseUDAF.new_buffer()`	Returns the intermediate value buffer of a UDAF. `buffer` must be a marshallable object, such as LIST or DICT, and the `buffer` size cannot increase with the amount of data. In extreme cases, the `buffer` size cannot exceed 2 MB after the marshaling operation.
`BaseUDAF.iterate(buffer[, args, ...])`	Aggregates `args` into the intermediate value `buffer`.
`BaseUDAF.merge(buffer, pbuffer)`	Stores the merged results of `pbuffer` and the intermediate value `buffer` in the `buffer`.
`BaseUDAF.terminate(buffer)`	Converts `buffer` into a value of a basic data type in MaxCompute SQL.

Sample code:

# Import the function signature module and the base class. 
from odps.udf import annotate
from odps.udf import BaseUDAF
# The function signature. 
@annotate('double->double')
# The custom Python class. 
class Average(BaseUDAF):
# Methods used to implement the custom Python class. 
    def new_buffer(self):
        return [0, 0]
    def iterate(self, buffer, number):
        if number is not None:
            buffer[0] += number
            buffer[1] += 1
    def merge(self, buffer, pbuffer):
        buffer[0] += pbuffer[0]
        buffer[1] += pbuffer[1]
    def terminate(self, buffer):
        if buffer[1] == 0:
            return 0.0
        return buffer[0] / buffer[1]

Note Python 2 UDAFs and Python 3 UDAFs differ in terms of the underlying Python version. You can write a UDAF based on the capability of the Python version that you use.

Limits

Python 3 is incompatible with Python 2. Due to this reason, you cannot use Python 2 code and Python 3 code in a single SQL statement at the same time.

Port Python 2 UDAFs

Python Software Foundation announced the EOL for Python 2. Therefore, we recommend that you port Python 2 UDAFs. The method used to port Python 2 UDAF varies based on the types of MaxCompute projects.

New project: If your project is a new MaxCompute project or if it is the first time that you use Python to write a UDAF for a MaxCompute project, we recommend that you use Python 3 to write all Python UDAFs.
Existing project: If your project is an existing project for which a large number of Python 2 UDAFs are created, proceed with caution when you enable Python 3. If you plan to gradually replace Python 2 UDAFs with Python 3 UDAFs, use one of the following methods:
- Use Python 3 to write new UDAFs and enable Python 3 for new jobs at the session level. For more information about how to enable Python 3, see Enable Python 3.
- Rewrite Python 2 UDAFs to make them compatible with both Python 2 and Python 3. For more information about how to rewrite Python 2 UDAFs, see Porting Python 2 Code to Python 3.
  
  Note If you want to write a public UDAF that can be shared among multiple projects, we recommend that the UDAF be compatible with both Python 2 and Python 3.

Enable Python 3

By default, Python 2 is used to write UDFs in a MaxCompute project. If you want to write UDFs in Python 3, add the following command before the SQL statement that you want to execute. Then, commit and execute the statement.

set odps.sql.python.version=cp37;

Third-party libraries

The third-party library NumPy is not installed in the Python 3 environment of MaxCompute. To use a NumPy UDAF, you must manually upload a NumPy wheel package. If you download a NumPy wheel package from Python Package Index (PyPI) or obtain this package from an image, the package is named in the following format numpy-<Version>-cp37-cp37m-manylinux1_x86_64.whl. For more information about how to upload a package, see Resource operations or Reference third-party packages in Python UDFs.

Function signatures and data types

Format of function signatures:

@annotate(<signature>)

The signature parameter is a string that specifies the data types of the input parameters and return value. When you run a UDAF, the data types of input parameters and the return value of the UDAF must be consistent with the data types specified in the function signature. Data type consistency is checked during semantic parsing. If the data types are inconsistent, an error is returned. Format of function signature:

'arg_type_list -> type'

arg_type_list: indicates the data types of input parameters. If multiple input parameters are used, specify multiple data types and separate them with commas (,). The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, VARCHAR, complex data types (ARRAY, MAP, and STRUCT), and nested complex data types.
arg_type_list can also be set to an asterisk (*) or left empty.
- If arg_type_list is set to an asterisk (*), a random number of input parameters are used.
- If arg_type_list is left empty, no input parameters are used.
type: specifies the data type of return values. For a UDAF, only one column of values is returned. The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, and DECIMAL(precision, scale). Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.

Note When you write UDAF code, you can select a data type based on the data type edition used by your MaxCompute project. For more information about data type editions and data types supported by each edition, see Data type editions.

The following table provides examples of valid function signatures.


Function signature	Description
`@annotate('bigint,double->string')`	The data types of input parameters are BIGINT and DOUBLE and the data type of the return values is STRING.
`@annotate('*->string')`	A random number of input parameters are used and the data type of the return values is STRING.
`@annotate('->double')`	No input parameters are used and the data type of the return values is DOUBLE.
`@annotate('array<bigint>->struct<x:string, y:int>')`	The data type of input parameters is ARRAY<BIGINT> and the data type of the return value is STRUCT<x:STRING, y:INT>.

The following table describes the mappings between the data types that are supported in MaxCompute SQL and the Python 2 data types. You must write Python UDAFs based on the mappings to ensure data type consistency. The following table describes the data type mappings.


MaxCompute SQL data type	Python 3 data type
BIGINT	INT
STRING	UNICODE
DOUBLE	FLOAT
BOOLEAN	BOOL
DATETIME	DATETIME.DATETIME
FLOAT	FLOAT
CHAR	UNICODE
VARCHAR	UNICODE
BINARY	BYTES
DATE	DATETIME.DATE
DECIMAL	DECIMAL.DECIMAL
ARRAY	LIST
MAP	DICT
STRUCT	COLLECTIONS.NAMEDTUPLE

Reference resources

You can reference files and tables in Python 2 UDAF code by using the odps.distcache module.

odps.distcache.get_cache_file(resource_name): returns the content of a specific file.
- resource_name is a string that specifies the name of an existing file in your MaxCompute project. If the file name is invalid or the file does not exist, an error is returned.
  
  Note To reference a file in the UDAF code, you must declare the file when you create the UDAF. Otherwise, an error is returned when you call the UDAF.
- The return value is a file-like object. If this object is no longer used, you must call the close method to release the file.
odps.distcache.get_cache_table(resource_name): returns the content of a specific table.
- resource_name is a string that specifies the name of an existing table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned.
- The return value is of the GENERATOR type. The caller traverses the table to obtain the table content. A record of the ARRAY type is obtained each time the caller traverses the table.

For more information, see Reference resources (Python 3 UDFs) and Reference resources (Python 3 UDTFs).

Usage notes

After you develop a Python 3 UDAF by following the instructions in Development process, you can use MaxCompute SQL to call this UDAF. The following steps describe how to call a Python 3 UDAF:

Use a UDF in a MaxCompute project: The method is similar to that of using built-in functions.
Use a UDF across projects: Use a UDF of Project B in Project A. The following statement shows an example: select B:udf_in_other_project(arg0, arg1) as res from table_t;. For more information about resource sharing across projects, see Cross-project resource access based on packages.

For more information about how to use MaxCompute Studio to develop and call a Python 3 UDAF, see Develop a Python UDF.

Dynamic parameters of UDAFs

Function signature

For more information about the format of the function signature of Python UDAFs, see Function signatures and data types.

You can use an asterisk (*) in a parameter list to indicate that an input parameter can be of any length and type. For example, @annotate('double,*->string') indicates a parameter list in which the first parameter is of the DOUBLE data type and is followed by parameters of any length and type. In this case, you must compile code to calculate the number and types of input parameters, and manage them based on the printf function in the C programming language.

Note Asterisks (*) in return values indicate different meanings.
Asterisks (*) can be used in return values of UDTFs to indicate that any number of values of the STRING data type can be returned. The number of return values is based on the number of aliases that are configured when a function is called. For example, the call method of @annotate("bigint,string->double,*") is UDTF(x, y) as (a, b, c). In this example, the aliases a, b, and c are configured after as. The editor identifies that a is of the DOUBLE type and b and c are of the STRING data type. The data type of return values in the first column returned in the Resolve annotation is given. Three return values are provided in this example. Therefore, the forward method called by a UDTF must forward an array of three elements. Otherwise, an error is returned.

Note However, the error is not returned during compilation. Therefore, the UDTF caller must configure the number of aliases in SQL based on the rule that is defined in the UDTF. The number of return values of an aggregate function is fixed to 1. Therefore, this rule has no effect on UDAFs.

UDAF example

from odps.udf import annotate
from odps.udf import BaseUDAF
@annotate('bigint,*->string')
class MultiColSum(BaseUDAF):
    def new_buffer(self):
        return [0]
    def iterate(self, buffer, *args):
        for arg in args:
            buffer[0] += int(arg)
    def merge(self, buffer, pbuffer):
        buffer[0] += pbuffer[0]
    def terminate(self, buffer):
        return str(buffer[0])

The number of the return values of a UDAF can only be fixed to one. In the preceding example, the return value is the sum of values of multiple input parameters and the sum of values in multiple rows. Sample statements:

-- Calculate the sum of values of multiple input parameters.
SELECT my_multi_col_sum(a,b,c,d,e) from values (1,"2","3","4","5"), (6,"7","8","9","10") t(a,b,c,d,e);
-- The return value is 55.