Zeppelin Notebook: An Important Tool for PyFlink Development Environment

PyFlink is the Python language portal of Flink. Its Python language is really simple and easy to learn, but the development environment of PyFlink is not easy to build. If you are not careful, the PyFlink environment will be messed up, and it is difficult to find out the cause. This article introduces a PyFlink development environment tool that can help users solve these problems, Zeppelin Notebook. The main contents are listed below:

Preparations
Build the PyFlink Environment
Summary and Future Outlook

Check out the GitHub page! You are welcome to give it a like and send stars!

You may have heard of Zeppelin for a long time, but previous articles mainly focused on how to develop Flink SQL in Zeppelin. Today, we will introduce how to develop PyFlink Job in Zeppelin efficiently to solve the environmental problems of PyFlink.

To summarize the theme of this article, use Conda in Zeppelin notebook to create Python env to deploy it to a Yarn cluster automatically. You do not need to install any PyFlink packages on the cluster manually, and you can use multiple versions of PyFlink isolated from each other in a Yarn cluster at the same time. Eventually, you will see:

1. A third-party Python library, such as matplotlib, can be used on the PyFlink client:

2. Users can use third-party Python libraries in PyFlink UDF, such as:

Next, let's learn how to implement it:

1. Preparations

Step 1

We will not describe the details of building the latest version of Zeppelin here. If you have any questions, you can join the Flink on Zeppelin DingTalk group (34517043) for consultations.

Note: The Zeppelin deployment cluster needs to be performed on a Linux system. If you use macOS, the Conda environment on macOS cannot be used in the Yarn cluster because the Conda packages are incompatible with different systems.

Step 2

Download Flink 1.13.

Note: The functions introduced in this article can only be used in Flink 1.13 or above. Then:

Copy the flink-Python-*.jar package to Flink's lib folder
Copy the opt/Python folder to Flink's lib folder

Step 3

Install the following software (which is used to create Conda env):

miniconda: https://docs.conda.io/en/latest/miniconda.html
conda pack: https://conda.github.io/conda-pack/
mamba: https://github.com/mamba-org/mamba

2. Build the PyFlink Environment

Next, users can build and use PyFlink in Zeppelin.

Step 1 – Create a PyFlink Conda Environment on the JobManager

Since Zeppelin inherently supports Shell, users can use Shell in Zeppelin to create a PyFlink environment. Note: The Python third-party packages here are required by the PyFlink client (JobManager), such as Matplotlib. Please ensure that at least the following packages are installed:

Python (Version 3.7 is used here.)
apache-flink (Version 1.13.1 is used here.)
jupyter, grpcio, and protobuf (These three packages are required by Zeppelin.)

The remaining packages can be specified as needed:

%sh

# make sure you have conda and momba installed.
# install miniconda: https://docs.conda.io/en/latest/miniconda.html
# install mamba: https://github.com/mamba-org/mamba

echo "name: pyflink_env
channels:
  - conda-forge
  - defaults
dependencies:
  - Python=3.7
  - pip
  - pip:
    - apache-flink==1.13.1
  - jupyter
  - grpcio
  - protobuf
  - matplotlib
  - pandasql
  - pandas
  - scipy
  - seaborn
  - plotnine
 " > pyflink_env.yml
    
mamba env remove -n pyflink_env
mamba env create -f pyflink_env.yml

Run the following code to package the Conda environment of PyFlink and upload it to HDFS. Note: The file format packaged here is tar.gz:

%sh

rm -rf pyflink_env.tar.gz
conda pack --ignore-missing-files -n pyflink_env -o pyflink_env.tar.gz

hadoop fs -rmr /tmp/pyflink_env.tar.gz
hadoop fs -put pyflink_env.tar.gz /tmp
# The Python conda tar should be public accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyflink_env.tar.gz

Step 2 – Set up the PyFlink Conda Environment on TaskManager

Run the following code to create a PyFlink Conda environment on TaskManager. The PyFlink environment on TaskManager contains at least the following two packages:

A specific version of Python (Version 3.7 is used here.)
apache-flink (Version 1.13.1 is used here.)

The remaining packages are the packages that Python UDF needs to rely on. For example, the pandas package is specified here:

echo "name: pyflink_tm_env
channels:
  - conda-forge
  - defaults
dependencies:
  - Python=3.7
  - pip
  - pip:
    - apache-flink==1.13.1
  - pandas
 " > pyflink_tm_env.yml
    
mamba env remove -n pyflink_tm_env
mamba env create -f pyflink_tm_env.yml

Run the following code to package the Conda environment of PyFlink and upload it to HDFS. Note: Zip format is used here.

%sh

rm -rf pyflink_tm_env.zip
conda pack --ignore-missing-files --zip-symlinks -n pyflink_tm_env -o pyflink_tm_env.zip

hadoop fs -rmr /tmp/pyflink_tm_env.zip
hadoop fs -put pyflink_tm_env.zip /tmp
# The Python conda tar should be public accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyflink_tm_env.zip

Step 3 – Use the Conda Environment in PyFlink

Now, users can use the Conda environment created above in Zeppelin. First, users need to configure Flink in Zeppelin. The main configuration options are:

The flink.execution.mode is yarn-application, and the methods described in this article are only applicable to yarn-application mode.
Specify the yarn.ship-archives, zeppelin.pyflink.Python, and zeppelin.interpreter.conda.env.name to configure the PyFlink Conda environment on the JobManager side.
Specify Python.archives and Python.executable to specify the PyFlink Conda environment on the TaskManager side.
Specify other optional Flink configurations, such as flink.jm.memory and flink.tm.memory here.

%flink.conf


flink.execution.mode yarn-application

yarn.ship-archives /mnt/disk1/jzhang/zeppelin/pyflink_env.tar.gz
zeppelin.pyflink.Python pyflink_env.tar.gz/bin/Python
zeppelin.interpreter.conda.env.name pyflink_env.tar.gz

Python.archives hdfs:///tmp/pyflink_tm_env.zip
Python.executable  pyflink_tm_env.zip/bin/Python3.7

flink.jm.memory 2048
flink.tm.memory 2048

Next, you can use PyFlink and the specified Conda environment in Zeppelin mentioned at the beginning. There are two scenarios:

1. In the following example, the JobManager Conda environment created above can be used on the PyFlink client (JobManager side), such as Matplotlib:

2. The following example uses the library in the Conda environment on the TaskManager side created above in the PyFlink UDF, such as Pandas in the UDF:

3. Summary and Future Outlook

This article uses Conda in a Zeppelin notebook to create Python env and deploy it to a Yarn cluster automatically. Users do not need to install any Pyflink packages on the cluster manually, and you can use multiple versions of PyFlink in a Yarn cluster at the same time.

Each PyFlink environment is isolated and can be customized to change the Conda environment at any time. You can download the following note and import it into Zeppelin to review the content we introduced today: http://23.254.161.240/#/notebook/2G8N1WTTS

In addition, there are many areas for improvement:

Currently, we need to create two Conda env. Zeppelin supports the tar.gz format, but Flink only supports the zip format. Then, create a Conda env when both sides are unified.
The apache-flink now includes Flink's jar package, which leads to a very large Conda env. The yarn container takes a long time to initialize. This requires the Flink community to provide a lightweight Python package (excluding Flink jar package), which can reduce the size of Conda env significantly.

Community

Zeppelin Notebook: An Important Tool for PyFlink Development Environment

1. Preparations

Step 1

Step 2

Step 3

2. Build the PyFlink Environment

Step 1 – Create a PyFlink Conda Environment on the JobManager

Step 2 – Set up the PyFlink Conda Environment on TaskManager

Step 3 – Use the Conda Environment in PyFlink

3. Summary and Future Outlook

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Message Queue for Apache Kafka