By Dong Lin
The Apache Flink community is excited to announce the release of Flink ML 2.2.0! This release focuses on enriching Flink ML's feature engineering algorithms. The library now includes 33 feature engineering algorithms, making it a more comprehensive library for feature engineering tasks.
With the addition of these algorithms, we believe the Flink ML library is ready for use in production jobs that require feature engineering capabilities, whose input can then be consumed by both offline and online machine learning tasks.
We encourage you to download the release and share your feedback with the community through the Flink mailing lists or JIRA! We hope you like the new release. We are eager to learn about your experience with it!
In machine learning, one of the main goals of model training is to deploy the trained model to perform online inference where the model server must respond to incoming requests with millisecond-level latency. However, prior releases of Flink ML only supported nearline inference using the Flink runtime, which may not meet the requirements of online inference use cases.
With FLIP-289, Flink ML now provides an API and infrastructure for users to load a ModelServable from model data generated by an Estimator. This ModelServable can be replicated across multiple model servers to process online inference requests in parallel. As the ModelServable is effectively a UDF that does not rely on Flink runtime, it can be integrated as a UDF into other serving or processing frameworks to serve the model trained by Flink ML.
As a first step, the LogisticRegressionModelServable has been added to serve the logistic regression model online. More servables will be added in the future. This new feature enables Flink ML to be used for both offline and online machine learning tasks, making it more versatile for a wider range of use cases.
Flink ML 2.2.0 significantly expanded the coverage of feature engineering algorithms, increasing the number from 6 to 33. Now, Flink ML covers 28 of the 33 feature engineering algorithms provided in Spark ML, making it a more comprehensive library for feature engineering tasks.
Feature engineering is a critical step in modern AI infrastructures, as it can preprocess data for traditional machine learning algorithms (like GBT) and deep learning algorithms and large language models (like Transformer), which are increasingly popular. With the addition of these algorithms, we hope Flink ML can be more useful in machine-learning tasks for Flink users.
All feature engineering algorithms can be easily accessed through the drop-down list on the left side of this Flink ML page. We have provided Python and Java examples for each algorithm to demonstrate how to use them.
Flink ML offers a significant advantage over other machine learning libraries in terms of its ability to perform online learning using Flink's streaming runtime. In order to leverage this strength, we implemented two online algorithms in Flink ML and successfully used them in a production machine learning job at Alibaba.
This job involves dynamically clustering similar logs and detecting errors in the logs to help site reliability engineers. The job can update models more frequently with a much simpler infrastructure setup using OnlineStandardScaler and AgglomerativeClustering to standardize and cluster logs in real-time. We presented this work at Flink Forward Asia in 2022. It will be integrated into the open-source project SREWorks soon.
With these online algorithms, Flink ML allows users to continuously update models using new data in real-time, resulting in more accurate and up-to-date predictions. This can be particularly useful in use cases where data is constantly streaming in, and it's important to make quick decisions based on the latest available information.
This release is fully backward compatible with Flink ML 2.1. Users should be able to upgrade to Flink ML 2.2.0 without worrying about any incompatibilities or breaking changes.
Please take a look at the release notes for a detailed list of changes and new features.
The source artifacts are now available on the updated Downloads page of the Flink website. The most recent distribution of the Flink ML Python package is available on PyPI.
The Apache Flink community would like to thank each one of the contributors below that have made this release possible:
Zhipeng Zhang, Dong Lin, Fan Hong, JiangXin, Zsombor Chikan, huangxingbo, taosiyuan163, vacaly, weibozhao, and yunfengzhou-hub
How to Create an Alibaba Cloud Elasticsearch Cluster and Log on to the Kibana Console
Apache Flink Community China - April 17, 2023
Apache Flink Community China - November 8, 2023
Apache Flink Community China - April 17, 2023
Apache Flink Community China - December 25, 2019
Apache Flink Community China - September 27, 2020
Apache Flink Community - May 7, 2024
A platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by Data Geek