Analysis and Application of New Features of Flink ML

This article covers an overview of Flink ML and discusses the design and application of online learning, online inference, and feature engineering algorithms.

This article is organized from the sharing of the AI feature engineering session of FFA 2023 by Weibo Zhao. The content is divided into the following four parts:

Overview of Flink ML
Design and application of online learning
Design and application of online inference
Feature engineering algorithm and application

1. Overview of Flink ML

Flink ML is a sub-project of Apache Flink that follows the Apache community specifications with the vision of becoming the standard for real-time traditional machine learning.

The Flink ML API was released in January 2022, followed by the launch of a complete, high-performance Flink ML infrastructure in July 2022. In April 2023, feature engineering algorithms were introduced, and support for multiple Flink versions was added in June 2023.

2. Design and Application of Online Learning

2.1 Online Machine Learning Workflow Sample

There are two models A and B which are trained by online learning and used for online inference. In the inference process, this model is in the form of a stream. That is the model stream which continuously flows the model into the chain so that the model has better real-time performance. After the inference is completed, the inference sample will be recommended to some front customers. Then the customers will give feedback on the results. Some samples will be spliced, and finally returned to the training data stream to form a closed loop. That is the workflow sample.

Next, the design of online learning is introduced with a workflow sample. The training data is cut into different windows after splitting. Each window needs to update the model when it passes through the Estimator, and then the model will flow to the inference chain below. As the data continues to flow, the model will flow to the inference chain one by one. This is the model stream whose idea is to support inference by turning the model into a queue to achieve better real-time performance.

Problems:

How do you make data splitting more reasonable? There are different requirements for different businesses. Some want to use time and others want to use size. All need some strategies.
Since both the data and the models are flowing, and both flow to the same place, how do you decide which model to use for inference after a sample comes?
How do you ensure the consistency of the models? Since there are two models in the chain, if the training data of the two models is inconsistent, some problems will occur.
Which model is the data inferred from? Each sample is inferred by a model, and the prediction needs to be traced back to the source.

2.2 Design of Online Machine Learning

There are four design requirements for four questions:

You can divide the input data into multiple windows for training to generate a model stream.
You can use the input model stream to predict data.
You can specify the time difference between the inference data and the current model data. After each sample comes, we want to use the latest model for inference. However, the latest model may not have been trained yet. That's the reason why we need to set a time difference to allow it to use non-latest models for inference.
You can expose the model version used when predicting each piece of data in the output data. You can trace the requirements of the model from the prediction results.

To fulfill these requirements, our design plan includes:

1. Add the HasWindows interface.

Users are allowed to declare different policies for partitioning data.

2. Add the model version and timestamp for ModelData. The value of the model version starts from 0 and increases by 1 each time. The timestamp of the model data is the maximum timestamp of the data from which the model is trained.

3. Add HasMaxAllowedModelDelayMs interface.

When users are allowed to specify the prediction data D, the time when the model data M is earlier than D is less than or equal to the set threshold.

4. Add HasModelVersionCol interface. During inference, users are allowed to output the model version used when predicting each piece of data.

Let's come back and look at the problem after we have the plan:

Way to split window: With the window policies provided, users can do some splitting according to their own needs for their business scenarios.
Selection of a model to infer the current data: Setting the threshold parameter allows the model to infer how far away from the current data. In theory, the latest model can be used, but it may cause problems such as waiting.
Consistency of the model: Each sample takes a model version when it is predicted, and the version number will be automatically obtained when it is predicted by the first model and then inferred by the second model. Both sides use the same version for inference, and the final output will have a version number.

Then, the first four problems can be solved.

2.3 Application of Online Learning in Alibaba Cloud Real-time Log Clustering

Alibaba Cloud ABM O&M Middle Platform collects logs from all Alibaba platforms, clusters the error logs, and sends the error logs to the corresponding department for subsequent processing.

Traditional algorithm engineering chains first input data and use Flink job for data processing. The data will be dropped to the disk. Then, the clustering algorithm will be pulled up through timing scheduling. The model is written to pull up the Flink job for data prediction through loading. However, the whole chain has limitations of complex processes, high O&M cost, low real-time performance, and the performance cannot be guaranteed.

The log clustering algorithm process splits words after pre-processing and encoding the system logs. It extracts keywords through feature selection, performs feature representation and standardization of the logs, and then performs hierarchical clustering and log classifications. It finally writes them to the database to guide word splitting.

For this process, we can use Flink ML to build streaming log clustering to string this process. A Flink job is used to concatenate full data between SLS and the database. Next, log data is cleansed and encoded. Then, word splitting and standardization are performed to calculate the clustering results. Finally, typical representative logs in the cluster are selected.

You can extract the operators in this case, such as SLS streaming reading, word splitting, log vectorization, feature selection, and feature standardization. These operators are not unique to the business but are required by many online learning businesses. If you extract them into an independent component, customers can reuse these operators when they need to do online learning processes.

_10

Benefits of log clustering algorithm chain upgrade:

• In terms of chain delay, the original delay of 5 minutes is reduced to 30 seconds.

• Operating costs are reduced, and now only one Flink job needs to be maintained.

• The analysis costs are reduced.

• The algorithm performance is improved.

3. Design and Application of Online Inference

_11

The inference is mainly divided into the following:

Batch inference: for example, 1 million pieces of data are dropped on the disk, and then a batch task is started to infer the 1 million pieces of data. Finally, they are dropped on the disk.
Near-line inference: a Flink-based task reads Kafka data and performs inference on streaming data by using Transformer. A major problem with this kind of inference is that the delay is relatively high, generally on the order of hundreds of milliseconds. In actual business scenarios, inference requires a very low delay, generally tens of milliseconds or even milliseconds, which requires us to build an inference framework to adapt to high-demand business scenarios.

_12

Before doing this, we surveyed the Spark ML inference. Later, it is found that Spark ML itself does not have an inference module. It has an Mleap which makes Spark inference into an inference framework. This inference framework has nothing to do with the engine runtime and reduces dependency conflicts. It is a lighter framework. In addition, this new framework can rewrite computing logic code for inference and has more optimization space.

3.1 Design Requirements

The design requirements draw on Mleap's approach:

1. Data representation (independent of Flink Runtime)

• A single piece of data representation: Row

• Batch data representation: DataFrame

• Data type representation: providing support for Vector and Matrix

2. Inference logic representation

3. Model loading

• It supports loading from files of Model/Transformer# save

• It supports dynamic loading of model data without the need to restart

4. Utils

• It supports checking whether Transformer/PipelineModel supports the online inference

• It concatenates multiple inference logic into a single inference logic

_13

Under this design requirement, on the left is the inference data structure DataFrame which contains column names, column types, and rows. After entering the inference logic, the output is of the same data structure, so that the entire inference structure can be strung together without data structure conversion.

_14

On the model loading side, the model is written to disk through the save function. The save() on the left is what Flink ML does, and the loadServable() on the right is what the inference framework does. Through these two functions, the model is saved, loaded, and inferred.

_15

Next, let's take logic regression as an example to see the implementation of the code. The save function is used to write the model to the specified directory. The following load is what the inference framework does, and the file of the load model is used for inference.

_16

The data update of the model is to write a model into Kafka, and Kafka is set into the Servable of the model. When the model is written into Kafka, the model will naturally flow into the Servable and finally the dynamic update of the model is realized.

The codes are as follows

_17

The input of setModelData is InputStream which can be read from Kafka. When the data in Kafka is updated, it can be updated to the model.

_18

In addition, we also support PipelineModel inference. You can build a Servable from the model data of PipelineModel to check whether PipelineModel supports online inference. You can determine whether PipelineModel supports online inference without the need to execute training jobs.

3.2 Usage Scenarios

_19

Finally, let's look at the usage scenarios. This is a simplified process for ML model training, prediction, and deployment. The initial step involves data ingestion, feature engineering, followed by evaluation and deployment. In this instance, the PipelineModel is used to encapsulate the two models of standardization and GBT classification into a Pipeline, enabling the execution of online inference services.

The codes are as follows

_20

The standardization and GBT models are written through Pipeline, and the Pipeline inference is finally implemented in the inference module. The inference supports writing and dynamic loading.

4. Feature Engineering Algorithm and Application

4.1 Feature Engineering Algorithm

_21

27 new algorithms have been added, bringing the total to 33, which now encompasses common algorithms.

4.2 Application of Feature Engineering

_22

The first application scenario involves recommendation and advertising evaluation, both of which require feature processing. The second application scenario is to implement complex algorithms. For example, GBT is used to process numerical features and process classification-type features. Additionally, Flink ML has incorporated some designs for the large language model.

_23

Next, let's use the large language model as an example to examine the business of feature engineering. A high-quality text input can result in a better large language model, and the approximate removal of duplicate texts can enhance text quality. In Internet data, the text duplication ratio typically falls between 20% and 60%, with larger text sizes associated with higher duplication ratios.

_24

To solve this problem, we have designed an approximate deduplication process:

• Unlike exact deduplication, it does not require exact consistency or substring relationships.

• For locality-sensitive hashing based on local sensitivity, similar samples are more likely to be hashed into the same buckets.

• For text data, MinHashLSH is usually used to find similar texts based on the Jaccard distance after text characterization.

_25

You can use these components to complete the text deduplication process:

• Tokenizer: it is used to split words

• HashingTF: it is used to transform the text into Binary features.

• MinHash: it is used to calculate the text signature.

• MinHashLSH: it is used to perform SimilarityJoin to find similar pairs.

_26

Finally, the performance test involved a manually constructed benchmark dataset obtained directly through copy and delete operations. For a dataset containing 500 million records with a 50% duplication rate, the deduplication process took approximately 1.5 hours.

References

Code：https://github.com/apache/flink-ml
User documentation：https://nightlies.apache.org/flink/flink-ml-docs-stable/

Community

Analysis and Application of New Features of Flink ML

1. Overview of Flink ML

2. Design and Application of Online Learning

2.1 Online Machine Learning Workflow Sample

2.2 Design of Online Machine Learning

2.3 Application of Online Learning in Alibaba Cloud Real-time Log Clustering

3. Design and Application of Online Inference

3.1 Design Requirements

3.2 Usage Scenarios

4. Feature Engineering Algorithm and Application

4.1 Feature Engineering Algorithm

4.2 Application of Feature Engineering

References

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Function Compute