Use third-party Python packages—such as SciPy or python-dateutil—in PyODPS by uploading them as MaxCompute resources and referencing them in your code. For instructions on generating a package with pyodps-pack, see Generate a third-party package for PyODPS.
Prerequisites
Before you begin, make sure that you have:
Choose a method
Select the method that fits your scenario:
| Scenario | Recommended method |
|---|---|
| New project, Python UDF or DataFrame | Use pyodps-pack to package and upload, then reference via sys.path or the libraries parameter |
| DataWorks PyODPS node with built-in packages | Use the DataWorks built-in method or load_resource_package |
| Existing project with manually uploaded WHL files | Manual upload (legacy maintenance only; use pyodps-pack for new projects) |
Upload a third-party package
Before referencing a third-party package, upload it to MaxCompute as an archive resource. Use one of the following methods:
Upload with code. Replace
packages.tar.gzwith the path and name of your package file.import os from odps import ODPS # Load credentials from environment variables. # Avoid hardcoding your AccessKey ID or AccessKey secret in code. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-end-point>', ) o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))Upload with DataWorks. See Step 1: Create a resource or upload an existing resource.
Reference a third-party package in a Python UDF
To use a third-party package in a Python user-defined function (UDF), modify the UDF class:
Add the package path to
sys.pathin the__init__method.Place the import statement inside the function body (the
evaluatefunction orprocessmethod).
The import statement must go inside the function body, not at the top of the file. Third-party packages are available only at runtime. When MaxCompute parses the UDF, the parsing environment does not include third-party packages, so a top-level import causes an error.
Example: use SciPy in a UDF
This example uses the psi function from SciPy in a UDF.
Package SciPy.
pyodps-pack -o scipy-bundle.tar.gz scipyWrite the UDF code and save it as
test_psi_udf.py.import sys from odps.udf import annotate @annotate("double->double") class MyPsi(object): def __init__(self): # Add the package path to sys.path. # MaxCompute decompresses archive resources into folders under the work/ directory. # The folder name matches the resource name. # packages/ is the subdirectory created by pyodps-pack. sys.path.insert(0, "work/scipy-bundle.tar.gz/packages") def evaluate(self, arg0): # Place the import statement inside the function body. from scipy.special import psi return float(psi(arg0))Upload
test_psi_udf.pyas a Python resource andscipy-bundle.tar.gzas an archive resource.Create the UDF, reference both resources, and set the class name to
test_psi_udf.MyPsi. Do this in a PyODPS node or on the MaxCompute client.In a PyODPS node:
import os from odps import ODPS # Load credentials from environment variables. # Avoid hardcoding your AccessKey ID or AccessKey secret in code. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-end-point>', ) bundle_res = o.create_resource( "scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb") ) udf_res = o.create_resource( "test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb") ) o.create_function( "test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res] )On the MaxCompute client:
add archive scipy-bundle.tar.gz; add py test_psi_udf.py; create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;
Run the UDF in a SQL statement.
set odps.pypy.enabled=false; set odps.isolation.session.enable=true; select test_psi_udf(sepal_length) from iris;
Reference a third-party package in PyODPS DataFrame
Pass the libraries parameter to the execute or persist method. The following example uses the map method; the procedure is the same for the apply and map_reduce methods.
Package SciPy.
pyodps-pack -o scipy-bundle.tar.gz scipyRun the following code to apply the package to a DataFrame operation. This example calculates
psi(col1)on a table namedtest_float_col, which has a single column of the FLOAT type.import os from odps import ODPS, options def my_psi(v): from scipy.special import psi return float(psi(v)) # Skip this setting if isolation is already enabled for your project. options.sql.settings = {"odps.isolation.session.enable": True} # Load credentials from environment variables. # Avoid hardcoding your AccessKey ID or AccessKey secret in code. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-end-point>', ) df = o.get_table("test_float_col").to_df() # Execute and return the result. df.col1.map(my_psi).execute(libraries=["scipy-bundle.tar.gz"]) # Save the result to another table. df.col1.map(my_psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])The input data looks like this:
col1 0 3.75 1 2.51(Optional) To use the same package across all DataFrame operations in the session, set the global parameter.
from odps import options options.df.libraries = ["scipy-bundle.tar.gz"]
Reference a third-party package in DataWorks
A DataWorks PyODPS node provides built-in third-party packages. To use a package that is not built in, call the load_resource_package method. For details, see Use a third-party package.
Manually upload and reference a third-party package
Follow these instructions only for existing projects that already use manually uploaded WHL dependencies, or for environments running an early MaxCompute version that does not support binary packages. For new projects, use pyodps-pack instead.This example uses python-dateutil in the map method.
Download python-dateutil and its dependencies to a local directory. Run this command in Linux to make sure the packages are compatible with the Linux operating system.
pip download python-dateutil -d /to/path/Two packages are downloaded:
six-1.10.0-py2.py3-none-any.whlandpython_dateutil-2.5.3-py2.py3-none-any.whl.Upload the packages to MaxCompute.
Method 1: Use code.
# Make sure the file name extensions are valid. odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb')) odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))Method 2: Use DataWorks. See Step 1: Create a resource or upload an existing resource.
Reference the packages in your code. This example parses date strings from a DataFrame column.
Set libraries globally:
from odps import options def get_year(t): from dateutil.parser import parse return parse(t).strftime('%Y') options.df.libraries = ['six.whl', 'python_dateutil.whl'] df.datestr.map(get_year).execute()Output:
datestr 0 2016 1 2015Pass libraries per call:
def get_year(t): from dateutil.parser import parse return parse(t).strftime('%Y') df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])Output:
datestr 0 2016 1 2015
Binary package compatibility
PyODPS supports Python libraries that contain only Python code with no file operations by default. Later versions of MaxCompute also support libraries with binary code or file operations. Library names must include a platform-specific suffix.
The following table lists the supported suffixes by platform and Python version.
| Platform | Python version | Supported suffix |
|---|---|---|
| RHEL 5 x86_64 | Python 2.7 | cp27-cp27m-manylinux1_x86_64 |
| RHEL 5 x86_64 | Python 3.7 | cp37-cp37m-manylinux1_x86_64 |
| RHEL 7 x86_64 | Python 2.7 | cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64 |
| RHEL 7 x86_64 | Python 3.7 | cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64 |
| RHEL 7 Arm64 | Python 3.7 | cp37-cp37m-manylinux2014_aarch64 |
All WHL packages must be uploaded to MaxCompute as archive resources. Before uploading, rename each WHL file to a ZIP file by changing its extension. Also set odps.isolation.session.enable to True for the job or your project.
The following example uploads and uses SciPy as a binary package.
# Binary packages must be uploaded as archive resources.
# Rename the .whl file to .zip before uploading.
odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
# Skip this setting if isolation is already enabled for your project.
options.sql.settings = { 'odps.isolation.session.enable': True }
def my_psi(value):
# Place the import statement inside the function to avoid runtime errors
# caused by structural differences in binary packages across operating systems.
from scipy.special import psi
return float(psi(value))
df.float_col.map(my_psi).execute(libraries=['scipy.zip'])To package source-only binary packages into WHL files, run the following command in Linux. WHL files built on macOS or Windows cannot be used in MaxCompute.
python setup.py bdist_wheel