Why Flink + AI is worth looking forward to

At Flink Forward Asia 2019 in November last year, the Flink community proposed several main directions for future development, one of which is to embrace AI [1]. In fact, in recent years, AI has continued to be hot, and various computing frameworks, models, and algorithms have emerged in an endless stream. From a certain perspective, this track is already a bit crowded. In this case, how will Flink embrace AI, and what new value will it bring to users? What are the advantages and disadvantages of Flink AI? This article will analyze the development direction of Flink AI through the discussion of these issues.

Lambda architecture, stream-batch unification and AI real-time
The value of Flink in AI is actually related to the two concepts of Lambda architecture [2] and stream-batch unification in big data. The value that Flink brings to real-time big data will also benefit AI.

Let us briefly review the development process of big data. For a long period of time after the publication of Google's groundbreaking "Troika" 3 [5] paper, only batch computing has appeared on the main line of big data development. Later, as everyone realized the important role of data timeliness, Twitter's open source stream computing engine Storm [6] became a smash hit, and various stream computing engines also appeared one after another, including Flink. Due to considerations of cost, calculation accuracy, and fault tolerance, various companies have adopted a solution called the Lambda architecture, which integrates batch computing and stream computing under the same architecture, so as to improve cost, fault tolerance, and data timeliness. A balance between sex.

Lambda architecture also has some problems while solving data timeliness, among which the most criticized is its system complexity and maintainability. Users need to maintain a set of engines and codes for the Batch Layer and Speed Layer, and also need to ensure that the calculation logic between the two is completely consistent (Figure 1).

In order to solve this problem, various computing engines have started to unify stream batches, trying to use the same set of engines to perform stream and batch tasks (Figure 2). After several years of hard work, Spark [7] and Flink have become the two mainstream computing engines currently in the first echelon. Flink has gradually entered batch computing from stream computing. A very typical successful case is to use the same set of standard SQL statements to query streams and batches and ensure the consistency of the final results [8]. Spark, on the other hand, proposes Spark Streaming from batch computing to stream computing in the way of Micro Batch, but it is always inferior in terms of delay performance.

It can be seen that in the development process of big data, the original driving force behind the integration of Lambda architecture and stream-batch is real-time data. It is also asking for value from data. AI's requirements for data timeliness are consistent with big data. Therefore, real-time AI will also be an important development direction. When observing the current mainstream AI scenarios and technical architectures, we will also find that they have many connections and similarities with big data platforms.

The current AI can be roughly divided into three main stages: data preprocessing (also known as data preparation/feature engineering, etc.), model training, and reasoning and prediction. Let's take a look at the AI real-time requirements in each stage one by one, and what problems need to be solved. In order to make an analogy with the architecture of big data, let's assume that flow computing and batch computing, as a type of computing, have divided all data-based computing into two, and nothing has been missed. Each stage of AI can also be classified as one of the two depending on the scene.

Data preprocessing (data preparation/feature engineering)

The data preprocessing stage is the pre-link of model training and inference prediction. In many cases, it is more of a big data problem. Depending on the downstream after data preprocessing, data preprocessing may be batch computing or stream computing, and the computing type is consistent with the downstream. In a typical offline training (batch computing) and online prediction (stream computing) scenario, training and prediction require the same preprocessing logic for generating input data (such as the same sample splicing logic). The requirements here are consistent with the Lambda architecture The requirements are the same in , so a stream-batch unified engine will be particularly advantageous. This can avoid using two different engines for batch jobs and stream jobs, and save the trouble of maintaining two sets of codes with consistent logic.

model training

At present, the AI training stage is basically a process in which batch computing (offline training) generates a static model (Static Model). This is because most of the current models are implemented based on the statistical law of independent and identical distribution (IID), that is, to find the statistical correlation (Correlation) between features and labels from a large number of training samples. These statistical correlations are usually will not change suddenly, so the data trained on one batch of samples is still applicable to another batch of samples with the same feature distribution. However, the static model produced by such offline model training may still have some problems.

First of all, the distribution of sample data may change over time. In this case, the distribution of online predicted samples and the distribution of training samples will deviate, thereby deteriorating the effect of model prediction. Therefore, static models usually need to be retrained, which can be a periodic process or by monitoring the prediction effect of samples and models (note that the monitoring itself is actually a typical stream computing requirement).

In addition, in some scenarios, the sample distribution in the prediction stage may not be known in the training stage. For example, how to quickly update the model to get better prediction results is very valuable in scenarios where sample distribution may change unpredictably, such as Alibaba Double Eleven, hot searches on Weibo, high-frequency trading, etc. .

Therefore, in an ideal AI computing architecture, how to update the model in time should be taken into consideration. Stream computing also has some unique advantages in this regard. In fact, Alibaba is already using online machine learning in its search recommendation system, and has achieved good results in scenarios like Double Eleven.

inference prediction

The environment and calculation types of the reasoning and prediction link are relatively rich, including batch processing (offline prediction) and stream processing. Streaming forecasting can be roughly divided into online forecasting and nearline forecasting. Online prediction is usually in the key link (Critical Path) of user access, so the requirement for latency is extremely high, such as millisecond level. The near-line prediction requirements are slightly lower, usually at the sub-second level to second level. At present, most pure streaming distributed computing (Native Stream Processing) engines can meet the needs of near-line data preprocessing and prediction, while online data preprocessing and prediction usually need to write prediction code into the application to meet the extremely low delay request. Therefore, big data engines are rarely seen in online prediction scenarios. In this regard, Flink's Stateful Function [9] is a unique innovation. The original intention of Stateful Function is to build an online application through several stateful functions on Flink, through which an online prediction service with ultra-low latency can be achieved. In this way, users can use the same set of code and the same engine for data preprocessing and prediction in offline, near-line and online scenarios.

To sum up, it can be seen that in every major stage of machine learning, there is an important requirement for real-time AI. So what kind of system architecture can effectively meet such requirements?

Flink and AI real-time architecture
The most typical example of AI architecture at present is offline training combined with online reasoning and prediction (Figure 3).

As mentioned before, there are two problems with this architecture:

The cycle of model update is usually relatively long.

Offline and online preprocessing may require maintaining two sets of codes.

To solve the first problem, we need to introduce a real-time training link (Figure 4).

In this link, in addition to being used for reasoning and prediction, online data will also generate samples in real time and be used for online model training. In this process, the model is dynamically updated, so it can better fit the changes in the sample.

Neither purely online nor purely offline links are suitable for all AI scenarios. Similar to the idea of Lambda, we can combine the two (Figure 5).

Similarly, in order to solve the problems of system complexity and operability (that is, the second problem mentioned above), we hope to use a stream-batch unified engine in the data preprocessing part to avoid maintaining two sets of codes ( Figure 6). Not only that, but we also need data preprocessing and inference prediction to support offline, near-line and online latency requirements, so using Flink is a very suitable choice. Especially for data preprocessing, Flink's comprehensive and complete SQL support on streams and batches can greatly improve development efficiency.

In addition, in order to further reduce the complexity of the system, Flink has also made a series of efforts in the model training process (Figure 7).

Stream-batch integrated algorithm library Alink

At last year's FFA 2019, Alibaba announced that it has open-sourced the Flink-based machine learning algorithm library Alink [10], and plans to gradually contribute it back to Apache Flink and release it as Flink ML Lib along with Apache Flink. In addition to offline learning algorithms, a major feature of Alink is that it provides users with online learning algorithms, helping Flink play a greater role in real-time AI.

Deep Learning on Flink (flink-ai-extended [11])

Help users integrate the currently popular deep learning frameworks (TensorFlow, PyTorch) into Flink. It enables users other than deep learning algorithm developers to implement the entire AI architecture based on Flink.

Unified iterative semantics and high-performance implementation of stream-batch

Iterative convergence in AI training is the core calculation process. Flink has used native iteration from the beginning to ensure the efficiency of iterative calculations. In order to help users better develop algorithms, simplify codes, and further improve operating efficiency. The Flink community is also unifying the semantics of stream and batch iteration, while further optimizing iteration performance. The new optimization will avoid synchronization overhead between iteration rounds as much as possible, allowing different batches of data and different rounds of Iterations are performed concurrently.

Of course, in a complete AI architecture, in addition to the three main stages mentioned above, there are many other tasks to be completed, including the docking of various data sources, the docking of existing AI ecosystems, online model and sample monitoring And all kinds of peripheral support systems, etc. A picture (Figure 8) in the keynote speech of FFA in 2019 by Wang Feng (named Mowen), the head of Alibaba's real-time computing, sums up many of these works well.

The Flink community is also working on this. Generally speaking, these AI-related jobs can be divided into three categories: supplement, improvement and innovation. Some of the work in progress is listed below. Some of the work may not be directly related to AI, but it will have an impact on Flink's better service to AI real-time.

Supplement: People have me but I have nothing

Flink ML Pipeline [12]: Helps users conveniently store and reuse a complete computing logic for machine learning.

Flink Python API (PyFlink [13]): Python is the native language of AI, and PyFlink provides users with the most important programming interface in AI.

Notebook Integration [14] (Zeppelin): Provides a friendly API for users' AI experiments.

Native Kubernetes support [15]: Integrate with Kubernetes to support cloud-native-based development, deployment, and O&M.

Improvement: I am stronger than others

Connector redesign and optimization [16]: Simplify Connector implementation and expand Connector ecology.

Innovation: I have what no one else has

AI Flow: Big data + AI top-level workflow abstraction and supporting services that take into account flow computing (open source soon).

Stateful Function[9]: Provides ultra-low latency data preprocessing and inference prediction comparable to online applications.

Some of them are Flink's own functions as a popular big data engine, such as enriching the Connector ecosystem to connect to various external data sources. Others have to rely on ecological projects other than Flink to complete, the more important of which is AI Flow. Although it originates from the AI real-time architecture, it does not bind Flink at the engine layer, but focuses on the top-level streaming batch unified workflow abstraction, aiming to serve AI real-time for different platforms, different engines and different systems. Architecture provides environment support. Due to space constraints, I won’t go into details here, and I will introduce it to you in another article.

Write at the end

Apache Flink started from a simple flow computing idea, and has grown into a popular real-time computing open source project today, benefiting everyone. This process is inseparable from hundreds of code contributors and tens of thousands in the Flink community. counted users. We believe that Flink can also make a difference in AI, and welcome more people to join the Flink community to create and share the value of real-time AI with us.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us