PySpark Python Versions and Dependencies - MaxCompute - Alibaba Cloud Documentation Center

This topic describes multiple packaging methods for PySpark packages.

Overview

A PySpark job typically depends on two types of Python resources: third-party libraries (such as other Python libraries, plug-ins, or projects) and user-defined modules. Because you cannot install Python libraries directly on MaxCompute clusters, package them locally and upload them using spark-submit.

Manage third-party libraries: For specific dependencies, your local packaging environment must match the production environment. Use one of the following methods:
- Method 1: Use public resources without packaging
  No extra resources are required. However, you can only use the default Python environment.
- Method 2: Upload a single wheel package
  Use this method when you need only a few simple Python dependencies.
- Method 3: Use pyodps-pack to package quickly (recommended)
- Method 4: Generate a Python environment with a script
  This method uses Docker to provide several Python versions. It reads your requirements file and generates a complete Python package in one step.
- Method 5: Package a Python environment with Docker
  You can choose any Python version. Docker provides only a Linux environment. You must upload the final Python environment to MaxCompute resources.
Manage user-defined modules: This means referencing user-defined Python packages. Package the folder that contains all your custom modules (for example, a Python package with an __init__.py file) into a .zip file. This avoids uploading and referencing files one by one.

Method 1: Use public resources without packaging

Update the configuration in spark-defaults.conf or in DataWorks. The following examples show the default environment configurations for different Python versions.

Python 2.7.13

Default Python 2.7.13 environment configuration. View the list of third-party libraries.

spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python

Python 3.6.12

Default Python 3.6.12 environment configuration.

spark.hadoop.odps.cupid.resources = public.python-3.6.12.tar.gz
spark.pyspark.python = ./public.python-3.6.12.tar.gz/python-3.6.12/bin/python3

Python 3.7.9

Default Python 3.7.9 environment configuration. View the list of third-party libraries.

spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz
spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3

Python 3.11

spark.hadoop.odps.spark.alinux3.enabled = true

Method 2: Upload a single wheel package

If your dependencies are simple, upload a single wheel package. Use a manylinux version. Download a wheel package.

Rename the wheel package as a zip file. For example, rename the downloaded PyMySQL wheel package as pymysql.zip.
Log on to the DataWorks console and select a region in the upper-left corner.
In the Select Workspace section, click Go To DataStudio.
In the navigation pane on the left, click the icon. Then click to create a MaxCompute Archive resource. Upload the packaged pymysql.zip.
You can use the Spark node in DataWorks.

Add the following configuration to the spark-defaults.conf file or in DataWorks:

spark.executorEnv.PYTHONPATH=pymysql  
spark.yarn.appMasterEnv.PYTHONPATH=pymysql

In your code, import the package: import pymysql.

Method 3: Use pyodps-pack to package quickly

To package many third-party Python libraries and do not depend heavily on a specific Python version, use the pyodps-pack tool. This tool uses Docker to provide a packaging environment. It supports batch processing of dependencies from a requirements.txt file. The following example shows how to package the pandas library using Python 3.11. For full details, see Create and use third-party packages – PyODPS 0.12.4.

Python version support: pyodps-pack supports Python versions 3.8 through 3.14.

Package with pyodps-pack

# Run in your development environment.
pip install pyodps

# Use the same Python version as your Spark job. Install and run Docker if it is not already installed.
pyodps-pack pandas --python-version=3.11 -o pyodps-pandas.tar.gz

# Or use a requirements.txt file.
pyodps-pack -r requirements.txt --python-version=3.11 -o pyodps-pandas.tar.gz

Upload the package with odpscmd

add archive PATH/pyodps-pandas.tar.gz -f;

Configure job startup settings

Update the spark-defaults.conf file.

spark.hadoop.odps.cupid.resources = {your_project}.pyodps-pandas.tar.gz
spark.executorEnv.PYTHONPATH = ./{your_project}.pyodps-pandas.tar.gz/packages
spark.yarn.appMasterEnv.PYTHONPATH = ./{your_project}.pyodps-pandas.tar.gz/packages

# Skip the following lines if you use notebooks.
# Set the Spark version to 3.4 or 3.5.
spark.hadoop.odps.spark.version = spark-3.4.2-odps0.48.0
or spark.hadoop.odps.spark.version = spark-3.5.2-odps0.48.0

# Switch to Python 3.11 (optional). Make sure this matches the Python version used with pyodps-pack.
spark.hadoop.odps.spark.alinux3.enabled = true

Method 4: Generate a Python environment with a script

To package many third-party libraries, avoid the repetitive work of uploading individual wheel packages (Method 2). Instead, use this automated packaging script. Prepare a requirements file (see the requirements file guide) and run the script to build a complete Python environment with all required dependencies. You can use this environment directly in PySpark jobs. This method supports Python 2.7, 3.5, 3.6, and 3.7. If your project does not require a specific Python version, use Python 3.7.

Follow these steps:

Download the automated packaging script.
The script runs on Mac and Linux. Install Docker first. See the Docker documentation.
In the command line, change the file permissions and view usage help:
```
$ chmod +x generate_env_pyspark.sh
$ generate_env_pyspark.sh -h 
Usage:
generate_env_pyspark.sh [-p] [-r] [-t] [-c] [-h]
Description:
-p ARG, the version of Python. Supported versions: 2.7, 3.5, 3.6, and 3.7.
-r ARG, the local path to your requirements file.
-t ARG, the output directory for the gz package.
-c, clean mode. Packages only your specified dependencies. Does not include pre-installed dependencies.
-h, display help for this script.
```
The -c option enables clean mode. In clean mode, pre-installed dependencies cannot be used, but the resulting Python packages are smaller. MaxCompute currently has a 500 MB limit on uploaded resources. Therefore, if you do not need most pre-installed dependencies, we strongly recommend using the -c option to package your code. Click the links to view pre-installed dependencies for different Python versions: Python 2.7 pre-installed dependencies, Python 3.5 pre-installed dependencies, Python 3.6 pre-installed dependencies, Python 3.7 pre-installed dependencies.

Example packaging commands

# Package with pre-installed dependencies.
$ generate_env_pyspark.sh -p 3.7 -r your_path_to_requirements -t your_output_directory

# Package without pre-installed dependencies (clean mode).
generate_env_pyspark.sh -p 3.7 -r your_path_to_requirements -t your_output_directory -c

Usage in Spark

generate_env_pyspark.sh generates a gz package for the specified Python version (the -p option) in the specified directory (the -t option). For example, Python 3.7 generates py37.tar.gz. You can then upload this package as an archive resource. For example, you can use odpscmd or odps-sdk to upload the package. For more information about uploading resources, see Resource Operations.

# Run in odpscmd.
add archive /your/path/to/py37.tar.gz -f;

Then add the following configuration to the spark-defaults.conf file or in DataWorks:

spark.hadoop.odps.cupid.resources = your_project.py37.tar.gz
spark.pyspark.python = your_project.py37.tar.gz/bin/python

If the preceding two parameters do not take effect—for example, when debugging PySpark in Zeppelin notebooks—add these two configuration items to your Spark job:
spark.yarn.appMasterEnv.PYTHONPATH = ./your_project.py37.tar.gz/bin/python
spark.executorEnv.PYTHONPATH = ./your_project.py37.tar.gz/bin/python

Method 5: Package a Python environment with Docker

Use this method if either of the following applies:

Your dependencies include non-Python code: Your library includes binary files such as .so files. These cannot be distributed using simple pip install commands or zip uploads.
You need a custom Python version: Your project requires a specific Python version (for example, Python 3.8), not the versions provided by the platform (2.7, 3.5, 3.6, or 3.7).

This method uses Docker to package your environment. The following example shows how to build a Python 3.8 environment.

Prepare a Dockerfile.

On a local development machine with Docker installed, create a file named Dockerfile. Choose one of the templates below based on your target environment.

Python 3.8 + Alibaba Cloud Linux 2

FROM alibaba-cloud-linux-2-registry.cn-hangzhou.cr.aliyuncs.com/alinux2/alinux2:latest
RUN curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
RUN curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
RUN set -ex \
    # Pre-install required components.
    && yum clean all \
    && yum makecache \
    && yum install -y wget tar libffi-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make initscripts zip\
    && wget https://www.python.org/ftp/python/3.8.20/Python-3.8.20.tgz \
    && tar -zxvf Python-3.8.20.tgz \
    && cd Python-3.8.20 \
    && ./configure prefix=/usr/local/python3 \
    && make \
    && make install \
    && make clean \
    && rm -rf /Python-3.8.20* \
    && yum install -y epel-release \
    && yum install -y python-pip
# Set Python 3 as the default.
RUN set -ex \
    # Back up the old Python version.
    && mv /usr/bin/python /usr/bin/python27 \
    && mv /usr/bin/pip /usr/bin/pip-python27 \
    # Configure Python 3 as the default.
    && ln -s /usr/local/python3/bin/python3.8 /usr/bin/python \
    && ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
# Fix yum failures caused by changing the Python version.
RUN set -ex \
    && sed -i "s#/usr/bin/python#/usr/bin/python27#" /usr/bin/yum \
    && sed -i "s#/usr/bin/python#/usr/bin/python27#" /usr/libexec/urlgrabber-ext-down \
    && yum install -y deltarpm
# Upgrade pip.
RUN pip install --upgrade pip

Python 3.7 + Alibaba Cloud Linux 3

FROM alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/alinux3:latest
RUN set -ex \
    && yum clean all \
    && yum makecache \
    && yum install -y \
        wget \
        tar \
        libffi-devel \
        zlib-devel \
        bzip2-devel \
        openssl-devel \
        ncurses-devel \
        sqlite-devel \
        readline-devel \
        tk-devel \
        xz-devel \
        gcc \
        make \
        initscripts \
        zip \
        which
RUN set -ex \
    && wget https://www.python.org/ftp/python/3.7.17/Python-3.7.17.tgz \
    && tar -zxvf Python-3.7.17.tgz \
    && rm -f Python-3.7.17.tgz \
    && cd Python-3.7.17 \
    && ./configure --prefix=/usr/local/python3 \
        --enable-optimizations \
    && make -j$(nproc) \
    && make install \
    && make clean \
    && cd / \
    && rm -rf Python-3.7.17 \
    && yum clean all

RUN ln -sf /usr/local/python3/bin/python3.7 /usr/bin/python \
    && ln -sf /usr/local/python3/bin/pip3 /usr/bin/pip

# Upgrade pip.
RUN pip install --no-cache-dir --upgrade pip

Build the image and run the container.

In the directory that contains the Dockerfile, run these commands:

# 1. Build the Docker image (example for Python 3.8).
# -t names the image in the format <image_name>:<tag>.
docker build --platform linux/amd64 -t python-centos:3.8 .

# 2. Start a background container.
# --name names the container for easy reference.
docker run --platform linux/amd64 -itd --name python3.8 python-centos:3.8 bash

Enter the container and install Python dependencies.

docker attach python3.8
pip install [required dependencies]

Package the Python environment.

cd /usr/local/
zip -r python3.8.zip python3/

Copy the Python environment from the container to the host.

Press Ctrl+P+Q to detach from the container.
docker cp python3.8:/usr/local/python3.8.zip .

Upload python3.8.zip to MaxCompute resources.
Upload the file using the local client (odpscmd). Set the upload type to archive. For full command details, see Resource operations.
```
add archive /path/to/python3.8.zip -f;
```
Update the spark-defaults.conf file or DataWorks configuration.
Add the following configuration to the spark-defaults.conf file or in DataWorks when submitting your Spark job:
```
spark.hadoop.odps.cupid.resources=[project_name].python3.8.zip
spark.pyspark.python=./[project_name].python3.8.zip/python3/bin/python3.8
```

If shared object (.so) files are missing, manually copy them into the Python environment. Then add these environment variables to your Spark configuration. You can usually find the .so files inside the container:

spark.executorEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./[project_name].python3.8.zip/python3/[so_file_directory]
spark.yarn.appMasterEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./[project_name].python3.8.zip/python3/[so_file_directory]

Reference user-defined Python packages

Create a zip package.

Create an empty __init__.py file in the target directory. Then create a zip package.

cd /path/to/parent_dir
touch __init__.py
zip -r target_dir.zip target_dir/
Note: Use relative paths. Do not include the parent directory in the zip file. When extracted, target_dir.zip must not contain the parent_dir path.

Upload the zip package.
- Option 1: Upload as an archive resource in DataWorks.
- Option 2: Upload as an archive resource with odpscmd.
```
add archive /path/to/python3.8.zip -f;
```
Reference the zip package in your task.
- Option 1: Add the zip package in the archive resources section of a DataWorks task node.
- Option 2: Update the spark-defaults.conf file. Use the spark.hadoop.odps.cupid.resources parameter to reference the zip package and assign an alias.
```
Example:
spark.hadoop.odps.cupid.resources=[project_name].target_dir.zip:target_dr
Separate multiple resources with commas.
```

Configure task parameters.

Update the spark-defaults.conf file.

Set task parameters and import the Python package. If you are unsure about file locations, see Data interoperability configuration. Make sure you understand the internal directory structure of the uploaded package. In the examples below, ./ means the current working directory /workdir/. Set the PYTHONPATH environment variable and the corresponding import path correctly.

## 1. First case: Assume the internal structure of target_dir.zip is target_dir.zip/sub/target_module.
spark.executorEnv.PYTHONPATH=./target_dir
spark.yarn.appMasterEnv.PYTHONPATH=./target_dir
# Import the Python package.
from sub import target_module.xxx

## 2. Second case: Assume the internal structure of target_dir.zip is target_dir.zip/target_module.
spark.executorEnv.PYTHONPATH=./
spark.yarn.appMasterEnv.PYTHONPATH=./
# Import the Python package.
from target_module import xxx

## Other cases follow the same pattern.