This topic describes multiple packaging methods for PySpark packages.
Overview
A PySpark job typically depends on two types of Python resources: third-party libraries (such as other Python libraries, plug-ins, or projects) and user-defined modules. Because you cannot install Python libraries directly on MaxCompute clusters, package them locally and upload them using spark-submit.
Manage third-party libraries: For specific dependencies, your local packaging environment must match the production environment. Use one of the following methods:
Method 1: Use public resources without packaging
No extra resources are required. However, you can only use the default Python environment.
Method 2: Upload a single wheel package
Use this method when you need only a few simple Python dependencies.
Method 4: Generate a Python environment with a script
This method uses Docker to provide several Python versions. It reads your requirements file and generates a complete Python package in one step.
Method 5: Package a Python environment with Docker
You can choose any Python version. Docker provides only a Linux environment. You must upload the final Python environment to MaxCompute resources.
Manage user-defined modules: This means referencing user-defined Python packages. Package the folder that contains all your custom modules (for example, a Python package with an
__init__.pyfile) into a.zipfile. This avoids uploading and referencing files one by one.
Method 1: Use public resources without packaging
Update the configuration in spark-defaults.conf or in DataWorks. The following examples show the default environment configurations for different Python versions.
Python 2.7.13
Default Python 2.7.13 environment configuration. View the list of third-party libraries.
spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/pythonPython 3.6.12
Default Python 3.6.12 environment configuration.
spark.hadoop.odps.cupid.resources = public.python-3.6.12.tar.gz
spark.pyspark.python = ./public.python-3.6.12.tar.gz/python-3.6.12/bin/python3Python 3.7.9
Default Python 3.7.9 environment configuration. View the list of third-party libraries.
spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz
spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3Python 3.11
spark.hadoop.odps.spark.alinux3.enabled = trueMethod 2: Upload a single wheel package
If your dependencies are simple, upload a single wheel package. Use a manylinux version. Download a wheel package.
Rename the wheel package as a zip file. For example, rename the downloaded PyMySQL wheel package as
pymysql.zip.Log on to the DataWorks console and select a region in the upper-left corner.
In the Select Workspace section, click Go To DataStudio.
In the navigation pane on the left, click the
icon. Then click
to create a MaxCompute Archive resource. Upload the packaged pymysql.zip.You can use the Spark node in DataWorks.
Add the following configuration to the
spark-defaults.conffile or in DataWorks:spark.executorEnv.PYTHONPATH=pymysql spark.yarn.appMasterEnv.PYTHONPATH=pymysqlIn your code, import the package:
import pymysql.
Method 3: Use pyodps-pack to package quickly
To package many third-party Python libraries and do not depend heavily on a specific Python version, use the pyodps-pack tool. This tool uses Docker to provide a packaging environment. It supports batch processing of dependencies from a requirements.txt file. The following example shows how to package the pandas library using Python 3.11. For full details, see Create and use third-party packages – PyODPS 0.12.4.
Python version support: pyodps-pack supports Python versions 3.8 through 3.14.
Package with
pyodps-pack# Run in your development environment. pip install pyodps # Use the same Python version as your Spark job. Install and run Docker if it is not already installed. pyodps-pack pandas --python-version=3.11 -o pyodps-pandas.tar.gz # Or use a requirements.txt file. pyodps-pack -r requirements.txt --python-version=3.11 -o pyodps-pandas.tar.gzUpload the package with odpscmd
add archive PATH/pyodps-pandas.tar.gz -f;Configure job startup settings
Update the
spark-defaults.conffile.spark.hadoop.odps.cupid.resources = {your_project}.pyodps-pandas.tar.gz spark.executorEnv.PYTHONPATH = ./{your_project}.pyodps-pandas.tar.gz/packages spark.yarn.appMasterEnv.PYTHONPATH = ./{your_project}.pyodps-pandas.tar.gz/packages # Skip the following lines if you use notebooks. # Set the Spark version to 3.4 or 3.5. spark.hadoop.odps.spark.version = spark-3.4.2-odps0.48.0 or spark.hadoop.odps.spark.version = spark-3.5.2-odps0.48.0 # Switch to Python 3.11 (optional). Make sure this matches the Python version used with pyodps-pack. spark.hadoop.odps.spark.alinux3.enabled = true
Method 4: Generate a Python environment with a script
To package many third-party libraries, avoid the repetitive work of uploading individual wheel packages (Method 2). Instead, use this automated packaging script. Prepare a requirements file (see the requirements file guide) and run the script to build a complete Python environment with all required dependencies. You can use this environment directly in PySpark jobs. This method supports Python 2.7, 3.5, 3.6, and 3.7. If your project does not require a specific Python version, use Python 3.7.
Follow these steps:
Download the automated packaging script.
The script runs on Mac and Linux. Install Docker first. See the Docker documentation.
In the command line, change the file permissions and view usage help:
$ chmod +x generate_env_pyspark.sh $ generate_env_pyspark.sh -h Usage: generate_env_pyspark.sh [-p] [-r] [-t] [-c] [-h] Description: -p ARG, the version of Python. Supported versions: 2.7, 3.5, 3.6, and 3.7. -r ARG, the local path to your requirements file. -t ARG, the output directory for the gz package. -c, clean mode. Packages only your specified dependencies. Does not include pre-installed dependencies. -h, display help for this script.The
-coption enables clean mode. In clean mode, pre-installed dependencies cannot be used, but the resulting Python packages are smaller. MaxCompute currently has a 500 MB limit on uploaded resources. Therefore, if you do not need most pre-installed dependencies, we strongly recommend using the-coption to package your code. Click the links to view pre-installed dependencies for different Python versions: Python 2.7 pre-installed dependencies, Python 3.5 pre-installed dependencies, Python 3.6 pre-installed dependencies, Python 3.7 pre-installed dependencies.Example packaging commands
# Package with pre-installed dependencies. $ generate_env_pyspark.sh -p 3.7 -r your_path_to_requirements -t your_output_directory # Package without pre-installed dependencies (clean mode). generate_env_pyspark.sh -p 3.7 -r your_path_to_requirements -t your_output_directory -cUsage in Spark
generate_env_pyspark.shgenerates a gz package for the specified Python version (the -p option) in the specified directory (the -t option). For example, Python 3.7 generatespy37.tar.gz. You can then upload this package as an archive resource. For example, you can useodpscmdorodps-sdkto upload the package. For more information about uploading resources, see Resource Operations.# Run in odpscmd. add archive /your/path/to/py37.tar.gz -f;Then add the following configuration to the
spark-defaults.conffile or in DataWorks:spark.hadoop.odps.cupid.resources = your_project.py37.tar.gz spark.pyspark.python = your_project.py37.tar.gz/bin/python If the preceding two parameters do not take effect—for example, when debugging PySpark in Zeppelin notebooks—add these two configuration items to your Spark job: spark.yarn.appMasterEnv.PYTHONPATH = ./your_project.py37.tar.gz/bin/python spark.executorEnv.PYTHONPATH = ./your_project.py37.tar.gz/bin/python
Method 5: Package a Python environment with Docker
Use this method if either of the following applies:
Your dependencies include non-Python code: Your library includes binary files such as
.sofiles. These cannot be distributed using simplepip installcommands or zip uploads.You need a custom Python version: Your project requires a specific Python version (for example, Python 3.8), not the versions provided by the platform (2.7, 3.5, 3.6, or 3.7).
This method uses Docker to package your environment. The following example shows how to build a Python 3.8 environment.
Prepare a Dockerfile.
On a local development machine with Docker installed, create a file named
Dockerfile. Choose one of the templates below based on your target environment.Python 3.8 + Alibaba Cloud Linux 2
FROM alibaba-cloud-linux-2-registry.cn-hangzhou.cr.aliyuncs.com/alinux2/alinux2:latest RUN curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo RUN curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo RUN set -ex \ # Pre-install required components. && yum clean all \ && yum makecache \ && yum install -y wget tar libffi-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make initscripts zip\ && wget https://www.python.org/ftp/python/3.8.20/Python-3.8.20.tgz \ && tar -zxvf Python-3.8.20.tgz \ && cd Python-3.8.20 \ && ./configure prefix=/usr/local/python3 \ && make \ && make install \ && make clean \ && rm -rf /Python-3.8.20* \ && yum install -y epel-release \ && yum install -y python-pip # Set Python 3 as the default. RUN set -ex \ # Back up the old Python version. && mv /usr/bin/python /usr/bin/python27 \ && mv /usr/bin/pip /usr/bin/pip-python27 \ # Configure Python 3 as the default. && ln -s /usr/local/python3/bin/python3.8 /usr/bin/python \ && ln -s /usr/local/python3/bin/pip3 /usr/bin/pip # Fix yum failures caused by changing the Python version. RUN set -ex \ && sed -i "s#/usr/bin/python#/usr/bin/python27#" /usr/bin/yum \ && sed -i "s#/usr/bin/python#/usr/bin/python27#" /usr/libexec/urlgrabber-ext-down \ && yum install -y deltarpm # Upgrade pip. RUN pip install --upgrade pipPython 3.7 + Alibaba Cloud Linux 3
FROM alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/alinux3:latest RUN set -ex \ && yum clean all \ && yum makecache \ && yum install -y \ wget \ tar \ libffi-devel \ zlib-devel \ bzip2-devel \ openssl-devel \ ncurses-devel \ sqlite-devel \ readline-devel \ tk-devel \ xz-devel \ gcc \ make \ initscripts \ zip \ which RUN set -ex \ && wget https://www.python.org/ftp/python/3.7.17/Python-3.7.17.tgz \ && tar -zxvf Python-3.7.17.tgz \ && rm -f Python-3.7.17.tgz \ && cd Python-3.7.17 \ && ./configure --prefix=/usr/local/python3 \ --enable-optimizations \ && make -j$(nproc) \ && make install \ && make clean \ && cd / \ && rm -rf Python-3.7.17 \ && yum clean all RUN ln -sf /usr/local/python3/bin/python3.7 /usr/bin/python \ && ln -sf /usr/local/python3/bin/pip3 /usr/bin/pip # Upgrade pip. RUN pip install --no-cache-dir --upgrade pipBuild the image and run the container.
In the directory that contains the Dockerfile, run these commands:
# 1. Build the Docker image (example for Python 3.8). # -t names the image in the format <image_name>:<tag>. docker build --platform linux/amd64 -t python-centos:3.8 . # 2. Start a background container. # --name names the container for easy reference. docker run --platform linux/amd64 -itd --name python3.8 python-centos:3.8 bashEnter the container and install Python dependencies.
docker attach python3.8 pip install [required dependencies]Package the Python environment.
cd /usr/local/ zip -r python3.8.zip python3/Copy the Python environment from the container to the host.
Press Ctrl+P+Q to detach from the container. docker cp python3.8:/usr/local/python3.8.zip .Upload python3.8.zip to MaxCompute resources.
Upload the file using the local client (odpscmd). Set the upload type to archive. For full command details, see Resource operations.
add archive /path/to/python3.8.zip -f;Update the
spark-defaults.conffile or DataWorks configuration.Add the following configuration to the
spark-defaults.conffile or in DataWorks when submitting your Spark job:spark.hadoop.odps.cupid.resources=[project_name].python3.8.zip spark.pyspark.python=./[project_name].python3.8.zip/python3/bin/python3.8
If shared object (.so) files are missing, manually copy them into the Python environment. Then add these environment variables to your Spark configuration. You can usually find the .so files inside the container:
spark.executorEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./[project_name].python3.8.zip/python3/[so_file_directory]
spark.yarn.appMasterEnv.LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./[project_name].python3.8.zip/python3/[so_file_directory]Reference user-defined Python packages
Create a zip package.
Create an empty
__init__.pyfile in the target directory. Then create a zip package.cd /path/to/parent_dir touch __init__.py zip -r target_dir.zip target_dir/ Note: Use relative paths. Do not include the parent directory in the zip file. When extracted, target_dir.zip must not contain the parent_dir path.Upload the zip package.
Option 1: Upload as an archive resource in DataWorks.
Option 2: Upload as an archive resource with odpscmd.
add archive /path/to/python3.8.zip -f;
Reference the zip package in your task.
Option 1: Add the zip package in the archive resources section of a DataWorks task node.
Option 2: Update the
spark-defaults.conffile. Use thespark.hadoop.odps.cupid.resourcesparameter to reference the zip package and assign an alias.Example: spark.hadoop.odps.cupid.resources=[project_name].target_dir.zip:target_dr Separate multiple resources with commas.
Configure task parameters.
Update the
spark-defaults.conffile.Set task parameters and import the Python package. If you are unsure about file locations, see Data interoperability configuration. Make sure you understand the internal directory structure of the uploaded package. In the examples below,
./means the current working directory/workdir/. Set thePYTHONPATHenvironment variable and the correspondingimportpath correctly.## 1. First case: Assume the internal structure of target_dir.zip is target_dir.zip/sub/target_module. spark.executorEnv.PYTHONPATH=./target_dir spark.yarn.appMasterEnv.PYTHONPATH=./target_dir # Import the Python package. from sub import target_module.xxx ## 2. Second case: Assume the internal structure of target_dir.zip is target_dir.zip/target_module. spark.executorEnv.PYTHONPATH=./ spark.yarn.appMasterEnv.PYTHONPATH=./ # Import the Python package. from target_module import xxx ## Other cases follow the same pattern.