When you run PyFlink jobs on an E-MapReduce (EMR) Dataflow cluster, you often need to package Python environments, third-party libraries, JAR connectors, or data files alongside your job. The Flink Python API in an EMR Dataflow cluster supports all features of the Apache Flink Python API. For background on the API, see Python API.
This topic explains how to package and use each of the four supported Python dependency types in a PyFlink job.
Python dependency types
| Dependency type | When to use |
|---|---|
| Custom Python virtual environment | Your job requires a specific Python version or a set of pre-installed packages |
| Third-party Python package | Your job imports an external Python library that is not in the default environment |
| JAR package | Your job uses Java classes, such as a connector or a user-defined function written in Java |
| Data file | Your job reads a static file at runtime, such as a model file |
Use a custom Python virtual environment
Build the virtual environment using one of the following methods, then reference it in your PyFlink job.
Method 1: Build on a cluster node
Use this method when you have direct access to a node in the Dataflow cluster.
-
On a cluster node, create a file named
setup-pyflink-virtual-env.shwith the following content:set -e # Create a Python virtual environment. python3.6 -m venv venv # Activate the Python virtual environment. source venv/bin/activate # Upgrade pip. pip install --upgrade pip # Install PyFlink. pip install "apache-flink==1.13.0" # Exit the virtual environment. deactivate -
Run the script:
./setup-pyflink-virtual-env.sh
A venv folder is created in the current directory. It contains the Python 3.6 virtual environment. To use a different Python version or include additional packages, edit the script before running it.
To use the virtual environment in a PyFlink job, distribute it to all nodes in the Dataflow cluster, or reference it when submitting the job via the command-line interface (CLI). For details, see Submitting PyFlink jobs.
Method 2: Build on a local machine
Use this method when you need to build a cross-platform virtual environment that runs on most Linux distributions. This approach uses the quay.io/pypa/manylinux2014_x86_64 Docker image to ensure compatibility. For more information about the image, see manylinux.
-
On your local machine, create a file named
setup-pyflink-virtual-env.shwith the following content:set -e # Download the Miniconda installer for Python 3.7. wget "https://repo.continuum.io/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh" -O "miniconda.sh" # Make the installer executable. chmod +x miniconda.sh # Create a Python virtual environment. ./miniconda.sh -b -p venv # Activate the Conda virtual environment. source venv/bin/activate "" # Install PyFlink. pip install "apache-flink==1.13.0" # Exit the Conda virtual environment. conda deactivate # Remove cached packages to reduce size. rm -rf venv/pkgs # Archive the virtual environment. zip -r venv.zip venv -
Create a file named
build.shwith the following content:#!/bin/bash set -e -x yum install -y zip wget cd /root/ bash /build/setup-pyflink-virtual-env.sh mv venv.zip /build/ -
Run the build inside the manylinux Docker container:
docker run -it --rm -v $PWD:/build -w /build quay.io/pypa/manylinux2014_x86_64 ./build.sh
A file named venv.zip is created in the current directory. It contains the Python 3.7 virtual environment packaged for Linux. To use a different Python version or include additional packages, edit setup-pyflink-virtual-env.sh before running the build.
Use a third-party Python package
How you use a third-party Python package depends on whether the package is marked with the zip_safe flag:
-
`zip_safe` packages: Use the package directly in a PyFlink job. See Python libraries for details.
-
Source code packages (with a
setup.pyin the root directory): Compile the package first, then use it.
To compile a source code package, use one of these methods:
-
Compile it on a node in the Dataflow cluster.
-
Compile it using the
quay.io/pypa/manylinux2014_x86_64Docker image. Packages compiled with this image are compatible with most Linux operating systems.
The following steps use opencv-python-headless as an example.
Step 1: Compile the package
-
On your local machine, create a file named
requirements.txtand list the packages to install:opencv-python-headlessTo install multiple packages, add one package per line.
-
Create a file named
build.shwith the following content:#!/bin/bash set -e -x yum install -y zip PYBIN=/opt/python/cp37-cp37m/bin "${PYBIN}/pip" install --target __pypackages__ -r requirements.txt --no-deps cd __pypackages__ && zip -r deps.zip . && mv deps.zip ../ && cd .. rm -rf __pypackages__ -
Run the build inside the manylinux Docker container:
docker run -it --rm -v $PWD:/build -w /build quay.io/pypa/manylinux2014_x86_64 /bin/bash build.sh
A file named deps.zip is created in the current directory. It contains the compiled package. To install additional packages, add them to requirements.txt before running the build.
Step 2: Use the package in a PyFlink job
For the full list of options and additional usage details, see Python libraries.
Use a JAR package
If your PyFlink job uses Java classes—such as a connector or a user-defined function written in Java—specify the JAR package that contains those classes. For details, see JAR dependencies.
Use a data file
If your PyFlink job needs to read a static file at runtime, such as a model file, package it as an archive file and reference it in the job. For details, see Archives.