Reference a third-party package in a PyODPS node - MaxCompute

This topic describes how to reference a third-party package in a PyODPS node. For more information about how to generate a third-party package for PyODPS, see Generate a third-party package for PyODPS.

Prerequisites

MaxCompute is activated. For more information, see Activate MaxCompute.
DataWorks is activated. For more information, see Activate DataWorks.

Upload a third-party package

Before you reference a third-party package, you must make sure that the package has been uploaded to MaxCompute as an archive resource. You can upload a third-party package by using one of the following methods:

Use code to upload a third-party package. In the following sample code, replace packages.tar.gz with the path and name of the package that you want to upload.

import os
from odps import ODPS

# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. 
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. 
# We recommend that you do not directly use your AccessKey ID or AccessKey secret.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='<your-default-project>',
    endpoint='<your-end-point>',
)
o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))

Use DataWorks to upload a third-party package. For more information, see Step 1: Create a resource or upload an existing resource.

Reference a third-party package in a Python UDF

Before you reference a third-party package in a Python user-defined function (UDF), you must modify the Python UDF. Procedure:

Add a reference to the third-party package in the _init_ method of the UDF class.
Reference the third-party package by using the UDF code, such as the evaluate function or the process method.

Example

In this example, a third-party package is referenced by using a Python UDF named psi in SciPy.

Run the following command to package SciPy.
```
pyodps-pack -o scipy-bundle.tar.gz scipy
```
Write the following code and save it as a file named test_psi_udf.py.
```
import sys
from odps.udf import annotate

@annotate("double->double")
class MyPsi(object):
    def __init__(self):
        # Add the path to the reference path.
        sys.path.insert(0, "work/scipy-bundle.tar.gz/packages")

    def evaluate(self, arg0):
        # Put the IMPORT statement inside the evaluate function.
        from scipy.special import psi

        return float(psi(arg0))
```
Code description: The __init__ function adds work/scipy-bundle.tar.gz/packages to sys.path. This is because MaxCompute decompresses all archive resources that are referenced by the UDF to folders in the work directory. The folder name is the same as the resource name. The packages directory is the subdirectory of the package that is generated by using pyodps-pack. The IMPORT statement of SciPy is placed within the evaluate function body. This is because the third-party package is available only during runtime. When the UDF is parsed on the MaxCompute server, the parsing environment does not contain the third-party package. If you put the IMPORT statement outside the function body to import the third-party package, an error is reported.
Upload test_psi_udf.py as a MaxCompute Python resource and upload scipy-bundle.tar.gz as an archive resource.

Create a UDF named test_psi_udf, reference the two uploaded resource files, and specify the class name as test_psi_udf.MyPsi.

You can perform Step 3 and Step 4 in a PyODPS node or on the MaxCompute client.

In a PyODPS node:

import os
from odps import ODPS

# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. 
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. 
# We recommend that you do not directly use your AccessKey ID or AccessKey secret.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='<your-default-project>',
    endpoint='<your-end-point>',
)

bundle_res = o.create_resource(
    "scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb")
)
udf_res = o.create_resource(
    "test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb")
)
o.create_function(
    "test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res]
)

On the MaxCompute client:

add archive scipy-bundle.tar.gz;
add py test_psi_udf.py;
create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;

After you complete the preceding operations, use the UDF to execute SQL statements.

set odps.pypy.enabled=false;
set odps.isolation.session.enable=true;
select test_psi_udf(sepal_length) from iris;

Reference a third-party package in PyODPS DataFrame

You can reference a third-party library in PyODPS DataFrame by specifying the libraries parameter in the execute or persist method. This section describes how to reference a third-party package in PyODPS DataFrame if you use the map method. The procedure is similar if you use the apply or map_reduce method.

Run the following command to package SciPy:
```
pyodps-pack -o scipy-bundle.tar.gz scipy
```

A table named test_float_col is available. This table contains only one column of the FLOAT type.

   col1
0  3.75
1  2.51

Run the following code to calculate the value of psi(col1):

import os
from odps import ODPS, options

def my_psi(v):
    from scipy.special import psi

    return float(psi(v))

# If the isolation feature is enabled for your project, you are not required to configure the following option:
options.sql.settings = {"odps.isolation.session.enable": True}

# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. 
# Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. 
# We recommend that you do not directly use your AccessKey ID or AccessKey secret.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='<your-default-project>',
    endpoint='<your-end-point>',
)
df = o.get_table("test_float_col").to_df()
# Run the following code and obtain the result.
df.col1.map(my_psi).execute(libraries=["scipy-bundle.tar.gz"])
# Save the result to another table.
df.col1.map(my_psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])

Optional. If you want to use the same third-party package throughout the runtime, you can configure the global parameter.
```
from odps import options
options.df.libraries = ["scipy-bundle.tar.gz"]
```

After you perform the preceding operations, reference the third-party package in PyODPS DataFrame.

Reference a third-party package in DataWorks

A DataWorks PyODPS node provides built-in third-party packages and also provides the load_resource_package method for you to reference other packages. For more information, see Use a third-party package.

Manually upload and reference a third-party package

Note

You can follow the instructions in this section to maintain an existing project or environment. For newly created projects, we recommend that you use pyodps-pack.

In some existing projects, you may manually upload all WHL dependencies and reference them in the code, or use MaxCompute of an early version that does not support binary packages. This section describes how to manually upload and reference a third-party package in these existing projects. In this example, python_dateutil is used in the map method.

Run the pip download command in Linux Bash to download the third-party package and its dependencies to a directory. Two packages are downloaded: six-1.10.0-py2.py3-none-any.whl and python_dateutil-2.5.3-py2.py3-none-any.whl.
```
pip download python-dateutil -d /to/path/
```
Note
Note that the downloaded packages must support the Linux operating system. We recommend that you run this command in the Linux operating system.

Upload the downloaded packages to MaxCompute.

Method 1: Use code.

# You must make sure that file name extensions are valid.
odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb'))
odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))

Method 2: Use DataWorks.
You can upload and submit the destination resources by following instructions in Step 1: Create a resource or upload an existing resource.

Reference the third-party package.

In this example, a DataFrame object contains only a field of the STRING type. The field contains the following content.

               datestr
0  2016-08-26 14:03:29
1  2015-08-26 14:03:29

Use the following third-party libraries in global configurations:

from odps import options

def get_year(t):
    from dateutil.parser import parse
    return parse(t).strftime('%Y')

options.df.libraries = ['six.whl', 'python_dateutil.whl']
df.datestr.map(get_year).execute()

   datestr
0     2016
1     2015

Use the libraries parameter of an immediately-invoked method to specify the libraries:

def get_year(t):
    from dateutil.parser import parse
    return parse(t).strftime('%Y')

df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])

   datestr
0     2016
1     2015

By default, PyODPS supports Python libraries that contain only Python code and do not involve file operations. In later versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or involve file operations. The library names must have suffixes. The following table describes the suffixes that are supported by different Python libraries.

Platform	Python version	Supported suffix
RHEL 5 x86_64	Python 2.7	cp27-cp27m-manylinux1_x86_64
RHEL 5 x86_64	Python 3.7	cp37-cp37m-manylinux1_x86_64
RHEL 7 x86_64	Python 2.7	cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64
RHEL 7 x86_64	Python 3.7	cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64
RHEL 7 Arm64	Python 3.7	cp37-cp37m-manylinux2014_aarch64

All WHL packages must be uploaded to MaxCompute as archive resources. Before you upload the packages, you must convert the packages into ZIP files by changing file name extensions. You also need to set the odps.isolation.session.enable parameter to True for the job or your project. The following example demonstrates how to upload and use special functions in SciPy:

# Packages that contain binary code must be uploaded as archive resources. You must convert WHL packages into ZIP files before you upload the packages.
odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))

# If the isolation feature is enabled for your project, you are not required to configure the following option:
options.sql.settings = { 'odps.isolation.session.enable': True }

def my_psi(value):
    # We recommend that you put the IMPORT statement inside a function to import third-party libraries. This prevents runtime errors caused by structural differences of binary packages in different operating systems.
    from scipy.special import psi
    return float(psi(value))

df.float_col.map(my_psi).execute(libraries=['scipy.zip'])

You can package binary packages that contain only source code into WHL files by running the following shell command in Linux and then upload them. WHL files generated in macOS or Windows cannot be used in MaxCompute.

python setup.py bdist_wheel