Python Software Foundation announced the End of Life (EOL) for Python 2. Due to this reason, MaxCompute supports Python 3 and uses CPython 3.7.3. This topic describes how to write a user-defined function (UDF) in Python 3.
UDF code structure
- Module import: required.
UDF code must include
from odps.udf import annotate
, which is used to import the function signature module. This way, MaxCompute can identify the function signature that is defined in the code. If you want to reference files or tables in UDF code, the UDF code must includefrom odps.distcache import get_cache_file
orfrom odps.distcache import get_cache_table
. - Function signature: required.
The function signature is in the
@annotate(<signature>)
format. Thesignature
parameter is used to define the data types of the input parameters and return value of the UDF. For more information about function signatures, see Function signatures and data types. - Custom Python class: required.
A custom Python class is the organizational unit of UDF code. This class defines the variables and methods that are used to meet your business requirements. In UDF code, you can also reference third-party libraries that are installed in MaxCompute or reference files or tables. For more information, see Third-party libraries or Reference resources.
evaluate
method: required.The evaluate method is contained in the custom Python class. The
evaluate
method defines the input parameters and return value of the UDF. Each Python class can contain only oneevaluate
method.
# Import the function signature module.
from odps.udf import annotate
# The function signature.
@annotate("bigint,bigint->bigint")
# The custom Python class.
class MyPlus(object):
# The evaluate method.
def evaluate(self, arg0, arg1):
if None in (arg0, arg1):
return None
return arg0 + arg1
Limits
Python 3 is incompatible with Python 2. Due to this reason, you cannot use Python 2 code and Python 3 code in a single SQL statement at the same time.
Port Python 2 UDFs
- In a new project or an existing project for which you write UDFs in Python for the first time, we recommend that you use Python 3 to write all Python UDFs.
- In an existing project where a large number of Python 2 UDFs exist, proceed with caution
when you enable Python 3. If you want to replace Python 2 UDFs with Python 3 UDFs,
use the following methods:
- Use Python 3 to write new UDFs and enable Python 3 for new jobs at the session level. For more information about how to enable Python 3, see Enable Python 3.
- Rewrite Python 2 UDFs in a manner in which the UDFs are compatible with Python 2 and
Python 3. For more information about how to rewrite UDFs, see Porting Python 2 Code to Python 3.
Note If you want to write a public UDF that is shared among multiple projects, we recommend that you use a UDF that is compatible with Python 2 and Python 3.
Enable Python 3
set odps.sql.python.version=cp37;
Third-party libraries
NumPy is not installed in the Python 3 runtime environment in MaxCompute. To use a NumPy UDF, you must manually upload a NumPy wheel package. If you obtain this package from Python Package Index (PyPI) or an image, the package is named numpy-<Version>-cp37-cp37m-manylinux1_x86_64.whl. For more information about how to upload a file, see Resource operations or Reference third-party packages in Python UDFs.
For more information about standard libraries that are supported by Python 3, see The Python Standard Library.
Function signatures and data types
@annotate(<signature>)
signature
parameter is a string that specifies the data types of input parameters and return
value. When you run a UDF, the data types of the input parameters and return value
of the UDF must be consistent with the data types specified in the function signature.
The data type consistency is checked during semantic parsing. If the data types are
inconsistent, an error is returned. Format of a signature: 'arg_type_list -> type'
Parameter description:arg_type_list
: specifies the data types of input parameters. If multiple input parameters are used, their data types are separated by commas (,). The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, and VARCHAR. Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.arg_type_list
can be represented by an asterisk (*) or left empty ('').- If
arg_type_list
is represented by an asterisk (*), a random number of input parameters are allowed. - If
arg_type_list
is left empty (''), no input parameters are used.
- If
type
: specifies the data types of return value. For a UDF, only one column of values is returned. The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, and DECIMAL(precision,scale). Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.
Function signature | Description |
---|---|
'bigint,double->string' |
The data types of the input parameters are BIGINT and DOUBLE and the data type of the return value is STRING. |
'*->string' |
A random number of input parameters are used and the data type of the return value is STRING. |
'->double' |
No input parameters are used and the data type of the return value is DOUBLE. |
'array<bigint>->struct<x:string, y:int>' |
The data type of the input parameters is ARRAY<BIGINT> and the data type of the return value is STRUCT<x:STRING, y:INT>. |
'->map<bigint, string>' |
No input parameters are used and the data type of the return value is MAP<BIGINT, STRING>. |
The following table describes the mappings between the data types that are supported in MaxCompute SQL and the Python 2 data types. You must write Python UDFs based on the mappings to ensure the consistency of data types.
MaxCompute SQL data type | Python 3 data type |
---|---|
BIGINT | INT |
STRING | UNICODE |
DOUBLE | FLOAT |
BOOLEAN | BOOL |
DATETIME | DATETIME.DATETIME |
FLOAT | FLOAT |
CHAR | UNICODE |
VARCHAR | UNICODE |
BINARY | BYTES |
DATE | DATETIME.DATE |
DECIMAL | DECIMAL.DECIMAL |
ARRAY | LIST |
MAP | DICT |
STRUCT | COLLECTIONS.NAMEDTUPLE |
Reference resources
You can reference files or tables in Python 2 UDF code by using the odps.distcache
module.
odps.distcache.get_cache_file(resource_name, mode)
: returns the content of a specified file based on the value ofmode
that you specified.resource_name
is a string that specifies the name of an existing table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned.- The value of
mode
is of the STRING type. Default value:'t'
. If the value ofmode
is't'
, the file is displayed in text mode. If the value ofmode
is'b'
, the file is displayed in binary mode. - The return value is a file-like object. If this object is no longer used, you must
call the
close
method to release the open file.
The following code shows how to reference a file.from odps.udf import annotate from odps.distcache import get_cache_file @annotate('bigint->string') class DistCacheExample(object): def __init__(self): cache_file = get_cache_file('test_distcache.txt') kv = {} for line in cache_file: line = line.strip() if not line: continue k, v = line.split() kv[int(k)] = v cache_file.close() self.kv = kv def evaluate(self, arg): return self.kv.get(arg)
odps.distcache.get_cache_table(resource_name)
: returns the content of a specified table.resource_name
specifies the name of the table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned. Data of the following types in the table can be read: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, FLOAT, CHAR, VARCHAR, BINARY, DATE, DECIMAL, ARRAY, MAP, and STRUCT.- The return value is of the GENERATOR data type. The caller traverses the table to obtain the table content. A record of the ARRAY type is obtained each time the caller traverses the table.
from odps.udf import annotate
from odps.distcache import get_cache_table
@annotate('->string')
class DistCacheTableExample(object):
def __init__(self):
self.records = list(get_cache_table('udf_test'))
self.counter = 0
self.ln = len(self.records)
def evaluate(self):
if self.counter > self.ln - 1:
return None
ret = self.records[self.counter]
self.counter += 1
return str(ret)
Instructions
- Use a UDF in a MaxCompute project: The method is similar to that of using built-in functions.
- Use a UDF across projects: Use a UDF of Project B in Project A. The following statement
shows an example:
select B:udf_in_other_project(arg0, arg1) as res from table_t;
. For more information about resource sharing across projects, see Cross-project resource access based on packages.
For more information about how to use MaxCompute Studio to develop and call a Python 3 UDF, see Develop a Python UDF.