All Products
Search
Document Center

Realtime Compute for Apache Flink:Use Python dependencies

Last Updated:Aug 26, 2024

You can use custom Python virtual environments, third-party Python packages, JAR packages, and data files in Flink Python deployments. This topic describes how to use these dependencies in Python deployments.

Background information

You can use Python dependencies based on the instructions in the following sections:

Use a custom Python virtual environment

Note

In Ververica Runtime (VVR) 4.X, you can use only virtual environments of Python 3.7. In VVR 6.X or later, you can use virtual environments of later Python versions.

You can build Python virtual environments. Each Python virtual environment provides a complete Python runtime environment. You can install a series of Python dependency packages in a virtual environment. The following section describes how to prepare a Python virtual environment.

  1. Prepare a Python virtual environment.

    1. Prepare the setup-pyflink-virtual-env.sh script on your on-premises machine. The following code shows the content of the script.

    2. set -e
      # Download the miniconda.sh script of Python 3.10. 
      wget "https://repo.continuum.io/miniconda/Miniconda3-py310_24.7.1-0-Linux-x86_64.sh" -O "miniconda.sh"
      
      # Add the execution permissions to the miniconda.sh script of Python 3.10. 
      chmod +x miniconda.sh
      
      # Create a Python virtual environment. 
      ./miniconda.sh -b -p venv
      
      # Activate the Conda Python virtual environment. 
      source venv/bin/activate ""
      
      # Install the PyFlink dependency. 
      # update the PyFlink version if needed
      pip install "apache-flink==1.17.2"
      
      # Deactivate the Conda Python virtual environment. 
      conda deactivate
      
      # Delete the cached packages. 
      rm -rf venv/pkgs
      
      # Package the prepared Conda Python virtual environment. 
      zip -r venv.zip venv
      Note

      In this topic, the deployment uses VVR 8.X and runs in a virtual environment of Python 3.10. If you want to use a different VVR version or install a virtual environment of another Python version, you must modify the following items in the code based on your business requirements:

      • URL for downloading the miniconda.sh script: Change the URL to the URL for downloading the miniconda.sh script of your desired version.

      • apache-flink: Change the value of this parameter to the Flink version that corresponds to the VVR version of your deployment. For more information about how to view the Flink version, see How do I query the engine version of Flink that is used by a deployment?

    3. Prepare the build.sh script on your on-premises machine. The following code shows the content of the script.

      #!/bin/bash
      set -e -x
      
      sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
      sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
      
      yum install -y zip wget
      
      cd /root/
      bash /build/setup-pyflink-virtual-env.sh
      mv venv.zip /build/
    4. In the command-line interface (CLI), run the following command to install the Python virtual environment:

      docker run -it --rm -v $PWD:/build  -w /build quay.io/pypa/manylinux2014_x86_64 ./build.sh

      After you run the command, a file named venv.zip is generated. In this example, the virtual environment of Python 3.10 is used.

      You can also modify the preceding script to install the required third-party Python package in the virtual environment.

  2. Use the Python virtual environment in Python deployments.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click Artifacts. On the Artifacts page, click Upload Artifact. In the dialog box that appears, select the venv.zip package.

    4. On the Deployments page, click the name of the desired deployment.

    5. On the Configuration tab, click Edit in the upper-right corner of the Basic section and select the venv.zip package from the Python Archives drop-down list.

      If the deployment is an SQL deployment that needs to use Python user-defined functions (UDFs), click Edit in the upper-right corner of the Parameters section and add the following configuration to the Other Configuration field:

      python.archives: oss://.../venv.zip
    6. In the Parameters section, add the configuration information about the path for installing the specified Python virtual environment based on the VVR version of your deployment to the Other Configuration field.

      • VVR 6.X or later

        python.executable: venv.zip/venv/bin/python
        python.client.executable: venv.zip/venv/bin/python
      • An engine version earlier than VVR 6.X

        python.executable: venv.zip/venv/bin/python

Use a third-party Python package

Note

Zip Safe, PyPI, and manylinux in the following description are provided at third-party websites. When you visit the websites, the websites may fail to be accessed or access to the websites may be delayed.

The following two scenarios show you how to use a third-party Python package:

  • Use a third-party Python package that can be directly imported

    If your third-party Python package is a Zip Safe package, you can perform the following steps to directly use the package in Python deployments without installation:

    1. Download a third-party Python package that can be directly imported.

      1. Visit PyPI on your web browser.

      2. Enter the name of a third-party Python package, such as apache-flink1.12.2, in the search box.

      3. In the search results, click the name of the package that you want to use.

      4. In the left-side navigation pane of the page that appears, click Download files.

      5. Click the name of the package whose name contains cp37-cp37m-manylinux1 to download the package.

    2. Log on to the Realtime Compute for Apache Flink console.

    3. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    4. In the left-side navigation pane, click Artifacts. On the Artifacts page, click Upload Artifact. In the dialog box that appears, select the required third-party Python package.

    5. In the left-side navigation pane, click Deployments. On the Deployments page, click Create Deployment. In the Create Deployment dialog box, select PYTHON for Deployment Type and select the third-party Python package that you upload for Python Libraries.

    6. Click Save.

  • Use a third-party Python package that requires compilation

    If a third-party Python package meets the following conditions, the package must be compiled before it can be used: The third-party Python package is a compressed package in the tar.gz format or a source package that you downloaded from another location, and the setup.py file exists under the root directory of the compressed package. You must compile the third-party Python package in an environment that is compatible with Flink before you call the third-party Python package in a Python deployment.

    We recommend that you use Python 3.7 in the quay.io/pypa/manylinux2014_x86_64 image to compile third-party Python packages. The packages generated by the image are compatible with most Linux operating systems. For more information about the image, see manylinux.

    Note

    Python 3.7 is installed in the /opt/python/cp37-cp37m/bin/python3 directory.

    The following example shows how to compile and use the third-party Python package opencv-python-headless.

    1. Compile a third-party Python package.

      1. Prepare the requirements.txt file on your on-premises machine. The following code shows the content of the file:

        opencv-python-headless
      2. Prepare the build.sh script on your on-premises machine. The following code shows the content of the script:

        #!/bin/bash
        set -e -x
        
        sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
        sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
        
        yum install -y zip
        
        PYBIN=/opt/python/cp37-cp37m/bin
        #PYBIN=/opt/python/cp38-cp38/bin
        #PYBIN=/opt/python/cp39-cp39/bin
        #PYBIN=/opt/python/cp310-cp310/bin
        
        "${PYBIN}/pip" install --target __pypackages__ -r requirements.txt
        cd __pypackages__ && zip -r deps.zip . && mv deps.zip ../ && cd ..
        rm -rf __pypackages__
      3. In the CLI, run the following command:

        docker run -it --rm -v $PWD:/build  -w /build quay.io/pypa/manylinux2014_x86_64 /bin/bash build.sh

        After you run the command, a file named deps.zip is generated. This file is the compiled third-party Python package.

        You can also modify the content of the requirements.txt file to install other required third-party Python packages. In addition, multiple Python dependencies can be specified in the requirements.txt file.

    2. Use the third-party Python package deps.zip in Python deployments.

      1. Log on to the Realtime Compute for Apache Flink console.

      2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

      3. In the left-side navigation pane, click Artifacts. On the Artifacts page, click Upload Artifact. In the dialog box that appears, select deps.zip.

      4. On the Deployments page, click the name of the desired deployment. On the Configuration tab, click Edit in the upper-right corner of the Basic section and select the deps.zip package from the Python Libraries drop-down list.

    1. Click Save.

Use a JAR package

If you use Java classes, such as a connector or a Java user-defined function (UDF), in your Flink Python deployment, you can perform the following operations to specify the JAR package of the connector or Java UDF.

  1. Log on to the Realtime Compute for Apache Flink console.

  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

  3. In the left-side navigation pane, click Artifacts. On the Artifacts page, click Upload Artifact. In the dialog box that appears, select the JAR package that you want to use.

  4. On the Deployments page, click the name of the desired deployment. On the Configuration tab, click Edit in the upper-right corner of the Basic section and select the required JAR package from the Additional Dependencies drop-down list.

  5. On the Deployments page, click the name of the desired deployment. On the Configuration tab, click Edit in the upper-right corner of the Parameters section and add the following configuration to the Other Configuration field.

    For example, if the draft depends on the two JAR packages that are named jar1.jar and jar2.jar, add the following configuration information:

    pipeline.classpaths: 'file:///flink/usrlib/jar1.jar;file:///flink/usrlib/jar2.jar'
  6. Click Save.

Use data files

Note

Fully managed Flink does not allow you to debug Python deployments by uploading data files.

The following scenarios show you how to use data files:

  • Select a package from the Python Archives drop-down list

    If you have a large number of data files, you can package the data files into a ZIP file and perform the following operations to use them in Python deployments:

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click Artifacts. On the Artifact page, click Upload Artifact. In the dialog box that appears, select the ZIP package of the desired data file.

    4. On the Deployments page, click the name of the desired deployment. On the Configuration tab, click Edit in the upper-right corner of the Basic section and select the required ZIP package from the Python Archives drop-down list.

    5. In Python UDFs, run the following command to access a data file. In this example, the name of the package that contains the data files is mydata.zip.

      def map():
          with open("mydata.zip/mydata/data.txt") as f:
          ...
  • Select a data file from the Additional Dependencies drop-down list

    If you have a small number of data files, you can perform the following operations to access these files in Python deployments:

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click Artifacts. On the Artifact page, click Upload Artifact. In the dialog box that appears, select the desired data file.

    4. On the Deployments page, click the name of the desired deployment. On the Configuration tab, click Edit in the upper-right corner of the Basic section and select the required data file from the Additional Dependencies drop-down list.

    5. In Python UDFs, run the following command to access a data file. In this example, the data file is named data.txt.

      def map():
          with open("/flink/usrlib/data.txt") as f:
          ...

References