Develop with Flink Python and manage dependencies - E-MapReduce

When you run PyFlink jobs on an E-MapReduce (EMR) Dataflow cluster, you often need to package Python environments, third-party libraries, JAR connectors, or data files alongside your job. The Flink Python API in an EMR Dataflow cluster supports all features of the Apache Flink Python API. For background on the API, see Python API.

This topic explains how to package and use each of the four supported Python dependency types in a PyFlink job.

Python dependency types

Dependency type	When to use
Custom Python virtual environment	Your job requires a specific Python version or a set of pre-installed packages
Third-party Python package	Your job imports an external Python library that is not in the default environment
JAR package	Your job uses Java classes, such as a connector or a user-defined function written in Java
Data file	Your job reads a static file at runtime, such as a model file

Use a custom Python virtual environment

Build the virtual environment using one of the following methods, then reference it in your PyFlink job.

Method 1: Build on a cluster node

Use this method when you have direct access to a node in the Dataflow cluster.

On a cluster node, create a file named setup-pyflink-virtual-env.sh with the following content:

set -e
# Create a Python virtual environment.
python3.6 -m venv venv

# Activate the Python virtual environment.
source venv/bin/activate

# Upgrade pip.
pip install --upgrade pip

# Install PyFlink.
pip install "apache-flink==1.13.0"

# Exit the virtual environment.
deactivate

Run the script:
```
./setup-pyflink-virtual-env.sh
```

A venv folder is created in the current directory. It contains the Python 3.6 virtual environment. To use a different Python version or include additional packages, edit the script before running it.

To use the virtual environment in a PyFlink job, distribute it to all nodes in the Dataflow cluster, or reference it when submitting the job via the command-line interface (CLI). For details, see Submitting PyFlink jobs.

Method 2: Build on a local machine

Use this method when you need to build a cross-platform virtual environment that runs on most Linux distributions. This approach uses the quay.io/pypa/manylinux2014_x86_64 Docker image to ensure compatibility. For more information about the image, see manylinux.

On your local machine, create a file named setup-pyflink-virtual-env.sh with the following content:

set -e
# Download the Miniconda installer for Python 3.7.
wget "https://repo.continuum.io/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh" -O "miniconda.sh"

# Make the installer executable.
chmod +x miniconda.sh

# Create a Python virtual environment.
./miniconda.sh -b -p venv

# Activate the Conda virtual environment.
source venv/bin/activate ""

# Install PyFlink.
pip install "apache-flink==1.13.0"

# Exit the Conda virtual environment.
conda deactivate

# Remove cached packages to reduce size.
rm -rf venv/pkgs

# Archive the virtual environment.
zip -r venv.zip venv

Create a file named build.sh with the following content:

#!/bin/bash
set -e -x
yum install -y zip wget

cd /root/
bash /build/setup-pyflink-virtual-env.sh
mv venv.zip /build/

Run the build inside the manylinux Docker container:

docker run -it --rm -v $PWD:/build -w /build quay.io/pypa/manylinux2014_x86_64 ./build.sh

A file named venv.zip is created in the current directory. It contains the Python 3.7 virtual environment packaged for Linux. To use a different Python version or include additional packages, edit setup-pyflink-virtual-env.sh before running the build.

Use a third-party Python package

How you use a third-party Python package depends on whether the package is marked with the zip_safe flag:

`zip_safe` packages: Use the package directly in a PyFlink job. See Python libraries for details.
Source code packages (with a setup.py in the root directory): Compile the package first, then use it.

To compile a source code package, use one of these methods:

Compile it on a node in the Dataflow cluster.
Compile it using the quay.io/pypa/manylinux2014_x86_64 Docker image. Packages compiled with this image are compatible with most Linux operating systems.

The following steps use opencv-python-headless as an example.

Step 1: Compile the package

On your local machine, create a file named requirements.txt and list the packages to install:
```
opencv-python-headless
```
To install multiple packages, add one package per line.

Create a file named build.sh with the following content:

#!/bin/bash
set -e -x
yum install -y zip

PYBIN=/opt/python/cp37-cp37m/bin

"${PYBIN}/pip" install --target __pypackages__ -r requirements.txt --no-deps
cd __pypackages__ && zip -r deps.zip . && mv deps.zip ../ && cd ..
rm -rf __pypackages__

Run the build inside the manylinux Docker container:

docker run -it --rm -v $PWD:/build -w /build quay.io/pypa/manylinux2014_x86_64 /bin/bash build.sh

A file named deps.zip is created in the current directory. It contains the compiled package. To install additional packages, add them to requirements.txt before running the build.

Step 2: Use the package in a PyFlink job

For the full list of options and additional usage details, see Python libraries.

Use a JAR package

If your PyFlink job uses Java classes—such as a connector or a user-defined function written in Java—specify the JAR package that contains those classes. For details, see JAR dependencies.

Use a data file

If your PyFlink job needs to read a static file at runtime, such as a model file, package it as an archive file and reference it in the job. For details, see Archives.