Community Blog Apache Flink ML 2.2.0 Release Announcement

Apache Flink ML 2.2.0 Release Announcement

This short article highlights the release of Apache Flink ML 2.2.0.

By Dong Lin

The Apache Flink community is excited to announce the release of Flink ML 2.2.0! This release focuses on enriching Flink ML's feature engineering algorithms. The library now includes 33 feature engineering algorithms, making it a more comprehensive library for feature engineering tasks.

With the addition of these algorithms, we believe the Flink ML library is ready for use in production jobs that require feature engineering capabilities, whose input can then be consumed by both offline and online machine learning tasks.

We encourage you to download the release and share your feedback with the community through the Flink mailing lists or JIRA! We hope you like the new release. We are eager to learn about your experience with it!

Notable Features

Introduced API and Infrastructure for Online Serving

In machine learning, one of the main goals of model training is to deploy the trained model to perform online inference where the model server must respond to incoming requests with millisecond-level latency. However, prior releases of Flink ML only supported nearline inference using the Flink runtime, which may not meet the requirements of online inference use cases.

With FLIP-289, Flink ML now provides an API and infrastructure for users to load a ModelServable from model data generated by an Estimator. This ModelServable can be replicated across multiple model servers to process online inference requests in parallel. As the ModelServable is effectively a UDF that does not rely on Flink runtime, it can be integrated as a UDF into other serving or processing frameworks to serve the model trained by Flink ML.

As a first step, the LogisticRegressionModelServable has been added to serve the logistic regression model online. More servables will be added in the future. This new feature enables Flink ML to be used for both offline and online machine learning tasks, making it more versatile for a wider range of use cases.

Added 27 Feature Engineering Algorithms

Flink ML 2.2.0 significantly expanded the coverage of feature engineering algorithms, increasing the number from 6 to 33. Now, Flink ML covers 28 of the 33 feature engineering algorithms provided in Spark ML, making it a more comprehensive library for feature engineering tasks.

Feature engineering is a critical step in modern AI infrastructures, as it can preprocess data for traditional machine learning algorithms (like GBT) and deep learning algorithms and large language models (like Transformer), which are increasingly popular. With the addition of these algorithms, we hope Flink ML can be more useful in machine-learning tasks for Flink users.

All feature engineering algorithms can be easily accessed through the drop-down list on the left side of this Flink ML page. We have provided Python and Java examples for each algorithm to demonstrate how to use them.

Added Two Production-Validated Online Learning Algorithms

Flink ML offers a significant advantage over other machine learning libraries in terms of its ability to perform online learning using Flink's streaming runtime. In order to leverage this strength, we implemented two online algorithms in Flink ML and successfully used them in a production machine learning job at Alibaba.

This job involves dynamically clustering similar logs and detecting errors in the logs to help site reliability engineers. The job can update models more frequently with a much simpler infrastructure setup using OnlineStandardScaler and AgglomerativeClustering to standardize and cluster logs in real-time. We presented this work at Flink Forward Asia in 2022. It will be integrated into the open-source project SREWorks soon.

With these online algorithms, Flink ML allows users to continuously update models using new data in real-time, resulting in more accurate and up-to-date predictions. This can be particularly useful in use cases where data is constantly streaming in, and it's important to make quick decisions based on the latest available information.

Upgrade Notes

This release is fully backward compatible with Flink ML 2.1. Users should be able to upgrade to Flink ML 2.2.0 without worrying about any incompatibilities or breaking changes.

Release Notes and Resources

Please take a look at the release notes for a detailed list of changes and new features.

The source artifacts are now available on the updated Downloads page of the Flink website. The most recent distribution of the Flink ML Python package is available on PyPI.

List of Contributors

The Apache Flink community would like to thank each one of the contributors below that have made this release possible:

Zhipeng Zhang, Dong Lin, Fan Hong, JiangXin, Zsombor Chikan, huangxingbo, taosiyuan163, vacaly, weibozhao, and yunfengzhou-hub

0 2 1
Share on

Data Geek

9 posts | 0 followers

You may also like


Data Geek

9 posts | 0 followers

Related Products