MaxCompute allows you to reference third-party packages in Python user-defined functions (UDFs), such as NumPy packages, third-party packages that need to be compiled, and third-party packages that are dependent on dynamic-link libraries (DLLs). This topic describes how to reference third-party packages in Python UDFs.

Background information

You can use one of the following methods to reference third-party packages in a Python UDF:
  • Reference NumPy packages in Python 3 UDFs

    You must change the file name extension of the NumPy package, use the MaxCompute client to upload the NumPy package, and then register the package. After you register the package, a UDF is created. You can call the UDF after you create the UDF.

  • Reference third-party packages that need to be compiled

    You must compile the setup.py script in a third-party package, generate a wheel package, and then change the file name extension of the wheel package in an environment that is compatible with MaxCompute. Then, use the MaxCompute client to upload the wheel package and register the package. After you register the package, a UDF is created. You can call the UDF after you create the UDF. We recommend that you use a Linux operating system. If you use a Windows operating system, we recommend that you use Docker.

  • Reference third-party packages that are dependent on DLLs

    You must compile the .so library file based on the source code of a third-party package, generate a wheel package, and then change the file name extension of the wheel package. Then, use the MaxCompute client to upload the wheel package and the .so library file and register the package and the file. After you register the package and the file, a UDF is created. You can call the UDF after you create the UDF.

Prerequisites

Make sure that the following prerequisites are met:
  • Python is installed. We recommend that you install Python 3.
  • The MaxCompute client is installed and configured. For more information, see Install and configure the MaxCompute client.
  • pip, setuptools, and wheel are installed if you want to use Python UDFs to reference third-party packages that need to be compiled. You can run the pip install setuptools command to install setuptools and run the pip install wheel command to install wheel.
  • PROJ 6 is installed if you use a third-party package of GDAL 3.0 or later.
  • Docker is installed if you use Docker to compile third-party packages. For more information, see Docker documentation.

Reference NumPy packages in Python 3 UDFs

You can use Python 3 in MaxCompute to reference NumPy packages. By default, the NumPy library is installed in Python 2 in MaxCompute. You do not need to manually upload NumPy packages in Python 2. To reference a NumPy package in Python 3 UDFs, perform the following steps:

  1. In the Download files section of the PyPI page, click the package whose name ends with cp37-cp37m-manylinux1_x86_64.whl to download the package. In this example, NumPy 1.19.2 is used.
    Download the NumPy package
    Note If you download a package whose name ends with other characters, the operation may fail. If you need to select another version of the NumPy package, click Release history in the Navigation section in the upper-left corner of the PyPI page to view the historical versions.
  2. Change the file name extension of the downloaded NumPy package to .zip.
    Example: numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip.
  3. Use the MaxCompute client to upload the NumPy package to your MaxCompute project. For more information about how to upload the package, see Resource operations.
    Sample command:
    ADD ARCHIVE D:\Downloads\numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip -f;
  4. Write a Python UDF script and save it as a PY file.
    In this example, the saved file is named import_numpy.py. The following code shows the Python UDF script:
    from odps.udf import annotate
    
    @annotate("->string")
    class TryImport(object): # The class name is TryImport. 
        def __init__(self):
            import sys
            sys.path.insert(0, 'work/numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip') # The NumPy package. You need only to change the package name after work/. 
    
        def evaluate(self):
            import numpy
            return "import succeed"
  5. Use the MaxCompute client to upload the import_numpy.py script to your MaxCompute project as a resource.
    Sample command:
    ADD PY D:\Desktop\import_numpy.py -f;
  6. Use the uploaded import_numpy.py script and NumPy package to create a UDF on the MaxCompute client. For more information about how to create a UDF, see Function operations.
    In this example, the created UDF is named numpy. Sample command:
    CREATE FUNCTION numpy AS 'import_numpy.TryImport' USING 'doc_test_dev/resources/import_numpy.py,numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip';
    Note When you create a UDF, you must add a NumPy package, such as numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip, to the resource list.
  7. After you register the UDF, you can call the UDF in SQL statements. Make sure that Python 3 is enabled to execute SQL statements. For more information, see Python 3 UDFs.

Reference third-party packages that need to be compiled

If a third-party package is a TAR.GZ package that is downloaded from PyPI or a source code package that is downloaded from GitHub, the setup.py file may be stored in the root directory of the decompressed third-party package. To use this type of third-party package, you must compile the setup.py file and generate a wheel package in an environment that is compatible with MaxCompute. Then, upload the package as a resource and register the package as a UDF. After the UDF is created, you can call third-party packages in Python UDFs. For more information about how to upload a resource and create a UDF, see Reference NumPy packages in Python 3 UDFs.

Notice
  • Third-party packages run in a Linux operating system. We recommend that you compile third-party packages in a Linux operating system. If you compile third-party packages in a Windows operating system, compatibility issues may occur.
  • If you use a Windows operating system, we recommend that you use Python of the required version to compile the setup.py file and generate a wheel package in the Docker container created from the quay.io/pypa/manylinux2010_x86_64 image. Python of the required version is stored in /opt/python/cp27-cp27m/bin/python or /opt/python/cp37-cp37m/bin/python3.
If you use a Linux operating system, make sure that the following requirements are met:
  • A Python version that is compatible with MaxCompute is used. You can run the following command in the command-line interface (CLI) of your system to check the Python version:
    python -c "import wheel.pep425tags; print(wheel.pep425tags.get_abi_tag())"
    • If cp27m or cp37m is returned, the version meets the compatibility requirements.
    • If cp27mu or cp37mu is returned, the version does not meet the compatibility requirements. In this case, you must run the ./configure --enable-unicode=ucs2 command to change the Python encoding format to UCS-2.
  • If code in C or C++ is required, your Linux operating system must be compatible with the GNU Compiler Collection (GCC) version in use.
    Note We recommend that you use GCC 4.9.2 or earlier. If the GCC version is later than 4.9.2, the .so file in the generated wheel package may be incompatible with MaxCompute.

If all requirements are met, perform the following steps to compile the setup.py file and generate a wheel package:

  1. Decompress a third-party package to your on-premises machine and run the required command in the CLI to go to the path where the setup.py file is stored.
    For example, the GDAL-3.2.0.zip package is downloaded. After you decompress the package, the setup.py file is stored in D:\Downloads\GDAL-3.2.0. Sample command:
    cd D:\Downloads\GDAL-3.2.0
    File path after decompression
  2. Run the following command in the CLI to check whether bdist_wheel is returned:
    Sample command:
    python setup.py --help-command
    • If yes, go to Step 3.
    • If no, change from distutils.core import setup to from setuptools import setup in the setup.py file. Then, go to Step 3.
  3. Run the following command in the CLI to compile the setup.py file and generate a wheel package:
    python setup.py bdist_wheel 
    Note The wheel package is stored in the dist folder.

Reference third-party packages that are dependent on DLLs

Some third-party packages for Python depend on Python libraries and other DLLs. This section describes how to use the Docker container to compile the .so library file and generate a wheel package that can be used in MaxCompute. The container is created from the quay.io/pypa/manylinux2010_x86_64 image. GDAL 3.0.4 is used in this example. You must upload the generated .so library file, wheel package, or NumPy package as resources and register the file and the packages as a UDF. After the UDF is created, you can call third-party packages in Python UDFs. For more information about how to upload a resource and create a UDF, see Reference NumPy packages in Python 3 UDFs.
Note Make sure that Docker is installed before you can reference third-party packages that are dependent on DLLs in Python UDFs. For more information, see Docker documentation.

To reference third-party packages that are dependent on DLLs in Python UDFs, perform the following steps:

  1. View the dependencies in the Dependencies section of the PyPI page.
    The following figure shows the dependencies of GDAL 3.0.4. View dependencies
    Note In the preceding figure, the dependencies include libgdal and numpy. To obtain libgdal, compile the GDAL source code in the Docker container. To obtain numpy, obtain the NumPy package on the PyPI page or from the Docker container.
  2. Obtain the NumPy package.
    You can use one of the following methods to obtain the NumPy package:
    • In the Download files section of the PyPI page, click the package whose name ends with cp37-cp37m-manylinux1_x86_64.whl to download the package.
      Note If Python 2 is used, perform the following operations to download the NumPy package: In the Navigation section of the PyPI page, click Release history, select 1.16.6 or an earlier version, and then click the package whose name ends with cp27-cp27m-manylinux1_x86_64.whl.
    • Run the /opt/python/cp37-cp37m/bin/pip download numpy -d ./ command to download the Numpy package to the current directory.
  3. Compile the .so library file.
    1. Download the GDAL 3.0.4 source code file and decompress it to your on-premises machine.
    2. Download the Docker container created from the quay.io/pypa/manylinux2010_x86_64 image and enter the input mode of the Docker client.
      Sample commands:
      docker pull quay.io/pypa/manylinux2010_x86_64
      docker run -it quay.io/pypa/manylinux1_x86_64 /bin/bash
    3. Upload the GDAL 3.0.4 source code to the Docker container.
      Sample command:
      docker cp ./gdal-3.0.4 <CONTAINER ID>:/opt/source/  
      For more information about how to obtain CONTAINER ID, see docker ps.
  4. Compile GDAL 3.0.4 source code in the container. For more information, see BuildingOnUnix.
    Sample commands:
    # Specify the directory to install PROJ 6 in the configure field. 
    ./configure --prefix=/path/to/install/prefix --with-proj=/path/to/install/proj6/prefix
    make
    make install
    export PATH=/path/to/install/prefix/bin:$PATH
    export LD_LIBRARY_PATH=/path/to/install/prefix/lib:$LD_LIBRARY_PATH
    export GDAL_DATA=/path/to/install/prefix/share/gdal
    # Test
    gdalinfo --version
    The following errors may occur during compilation:
    • configure: error: PROJ 6 symbols not found: If this error occurs, install PROJ 6 to support GDAL 3.0 or later.
    • fatal error: zlib.h: No such file or directory: If this error occurs, use the yum install zlib-devel command instead.
  5. Run the Docker download commands to download two .so library files (not symbolic links) to your on-premises machine. Obtain libgdal.so from the lib folder in the installation directory of GDAL and libproj.so from the lib folder in the installation directory of PROJ 6.
  6. Generate a GDAL wheel package in the Docker container. For more information, see BuildingOnUnix.
    Sample commands:
    # If NumPy is required, install NumPy first. 
    /opt/python/cp37-cp37m/bin/pip install numpy
    # Switch to the directory in which GDAL source code is saved. 
    cd swig/python
    # Generate a wheel package and save it in the dist folder. Example: GDAL-3.0.4-cp37-cp37m-linux_x86_64.whl
    /opt/python/cp37-cp37m/bin/python setup.py bdist_wheel
  7. Upload the generated .so library file, wheel package, or NumPy package as resources and register the file and the packages as a UDF. After the UDF is created, you can call third-party packages in Python UDFs. For more information about how to upload a resource and create a UDF, see Reference NumPy packages in Python 3 UDFs.
    Take note of the following items when you upload a resource and create a UDF:
    • When you upload resources, you must upload libgdal.so and libproj.so as file resources and numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip and GDAL-3.0.4-cp37-cp37m-linux_x86_64.zip as archive resources.
    • When you create functions, you must add libgdal.so, libproj.so, numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip, and GDAL-3.0.4-cp37-cp37m-linux_x86_64.zip to the resource list of the functions.
    Sample code for a Python UDF:
    # coding: utf-8
    from odps.udf import annotate
    from odps.distcache import get_cache_file
    
    def include_file(file_name):
        import os, sys
        so_file = get_cache_file(file_name, 'b')
        
        with open(so_file.name, 'rb') as fp:
            content=fp.read()
            so = open(file_name, "wb")
            so.write(content)
            so.flush()
            so.close()
    
    @annotate("->string")
    class TryImport(object):
        def __init__(self):
            import sys
            include_file('libgdal.so.26')
            include_file('libproj.so.15')
            sys.path.insert(0, 'work/GDAL-3.0.4-cp37-cp37m-linux_x86_64.zip') # The GDAL package after compilation. You need only to change the package name that follows work/. 
            sys.path.insert(0, 'work/numpy-1.19.2-cp37-cp37m-manylinux1_x86_64.zip') # The NumPy package. You need only to change the package name after work/. 
    
        def evaluate(self):
            from osgeo import gdal
            from osgeo import ogr
            from osgeo import osr
            from osgeo import gdal_array
            from osgeo import gdalconst
            return "import succeed"
    Note If an error that indicates libgdal.so.26 or libproj.so.15 cannot be found occurs, you must change libgdal.so to libgdal.so.26 or libproj.so to libproj.so.15.