Entering Apache Flink

1. What is Apache Flink

Apache Flink is an open source stream-based stateful computing framework. It is executed in a distributed manner, has excellent performance of low latency and high throughput, and is very good at handling stateful and complex computing logic scenarios.

1. The origin of Flink

Apache Flink is a top-level project of the Apache Open Source Software Foundation. Like many top-level Apache projects, such as Spark originated from the laboratory of UC Berkeley, Flink also originated from the laboratory of a very famous university - the laboratory of Berlin Technical University.

The original name of the project was Stratosphere, and the goal was to make the processing of big data look more concise. Many of the original code contributors to the project are still active today on Apache's project management committee and continue to make contributions in the community.

The Stratosphere project was launched in 2010. From its Git commit log, it can be seen that its first line of code was written on December 15, 2010.

In May 2014, the Stratosphere project was contributed to the Apache Software Foundation, incubated as an incubator project, and changed its name to Flink.

2. Development of Flink

The Flink project is very active. On August 27, 2014, the first version v0.6-incubating in the incubator was released.

Since the Flink project has attracted a lot of contributors and is very active in terms of activity, it became the top Apache project in December 2014.
After becoming a top-level project, it released the first Release version Flink 0.8.0 a month later. After that, Flink basically maintained the rhythm of 4 months and 1 version, and has developed to this day.

3. Current status of Flink – the most active project in the Apache community

So far, Flink has become the most active big data project in the Apache community. Its user and developer mailing list was ranked #1 in the 2020 Apache Annual Report.

As shown in the right figure above, compared with the very active Spark project, you can see the activity of the user mailing list, and Flink is higher than Spark. In addition, in terms of the number of developer code submissions and the number of Github user visits, Flink ranks second among all Apache projects, and Flink ranks first among big data projects.

From April 2019 to the present, the Flink community has released 5 versions, and each version has more Commits and contributors.

2. Why learn Apache Flink

1. Real-time trend of big data processing

With the rapid development of the network, the processing of big data presents a very obvious real-time trend.

As shown in the figure above, we have listed some common scenarios in real life. For example, the live broadcast of the Spring Festival Gala has a real-time large screen, and the Double 11 Shopping Festival also has real-time turnover statistics and media reports.

The city brain can monitor traffic in real time, and banks can monitor risk control in real time. When we open Taobao, Tmall and other application software, it will make real-time personalized recommendations according to the different habits of users. From the above examples, we can see that big data processing presents an obvious real-time trend.

2. Flink has become the de facto standard for real-time computing at home and abroad

Under the general trend of real-time, Flink has become the de facto standard for real-time computing at home and abroad.

As shown in the figure above, many companies at home and abroad are currently using Flink.

3. Evolution of stream computing engine

The stream computing engine has undergone many generations of evolution. The first-generation stream computing engine Apache Storm is a pure stream design with very low latency, but its problem is also obvious, that is, there is no way to avoid repeated processing of messages, resulting in data There are certain problems with correctness.

Spark Streaming is the second-generation stream computing engine, which solves the problem of semantic correctness of stream computing. However, its design concept is based on batches. The biggest problem is that the delay is relatively high, and the delay at the level of 10 seconds can only be achieved. The end-to-end cannot achieve a delay of less than a second.

Flink is the third-generation stream computing engine and the latest generation of stream computing engine. It can not only guarantee low latency, but also guarantee the consistent semantics of messages. For the management of built-in state, it also greatly reduces the complexity of applications.

3. Typical application scenarios of Apache Flink

1. Event-driven applications

The first type of application scenario is an event-driven application.

Event-driven means that one event will trigger another or many subsequent events, and then this series of events will form some information, based on which certain processing needs to be done.

In social scenarios, take Weibo as an example, when we click on a follower, the number of fans of the follower will change. Afterwards, if the person followed sends a Weibo, the fans who follow him will also receive a message notification, which is a typical event-driven.

In addition, in the context of online shopping, if users make comments on products, these comments will affect the star rating of the store on the one hand, and on the other hand, there will be detection of malicious bad reviews. In addition, users can also see product delivery or other status by clicking on the information flow, which may trigger a series of subsequent events.

There is also a scene of financial anti-fraud, where scammers defraud through text messages and then steal other people's money at ATMs. In this scenario, after we shoot through the camera, we quickly respond to identify it, and then deal with the criminal behavior accordingly. This is also a typical event-driven application.

To summarize, an event-driven application is a class of stateful applications that trigger calculations, update states, or perform external system operations based on events in the event stream. Event-driven applications are common in real-time computing services, such as: real-time recommendation, financial anti-fraud, real-time rule warning, etc.

2. Data analysis type application

The second type of typical application scenarios are data analysis applications, such as real-time summary of Double 11 turnover, including PV and UV statistics.

As shown in the figure above, it is a download of Apache open source software in different regions of the world, and it is actually a summary of information.

It also includes some large-scale marketing screens, the rise and fall of sales, and the comparison of the results of marketing strategies on a quarter-on-quarter and year-on-year basis. All of these involve real-time analysis and aggregation of a large amount of information. These are very typical usage scenarios of Flink.

As shown in the figure above, taking Double 11 as an example, during the 2020 Tmall Double 11 shopping festival, Ali’s Flink-based real-time computing platform processed 4 billion messages per second, the data volume reached 7TB, and the number of orders created reached 580,000/second, and the calculation scale has exceeded 1.5 million cores.

It can be seen that the scenarios of these applications are large in size and have very high requirements for real-time performance, which is also a scenario that Apache Flink is very good at.

3. Data Pipeline Application (ETL)

The third type of scenario that Apache Flink is good at is the data pipeline application, that is, ETL.

ETL (Extract-Transform-Load) is the process of extracting/transforming/loading/data from the data source to the destination.

Traditional ETL uses offline processing, often doing hour-level or day-level ETL.

However, with the real-time trend of big data processing, we will also have the demand for real-time data warehouses, which require that the data can be updated at the minute or second level, so as to perform timely queries and see real-time indicators, and then Do more real-time judgment and analysis.

Under the above scenarios, Flink can meet the real-time requirements to the greatest extent.

The main reasons behind it are as follows. On the one hand, Flink has a very rich connector, supports multiple data sources and data sinks, and covers all mainstream storage systems. In addition, it also has some very general built-in aggregation functions to complete the writing of ETL programs, so ETL-type applications are also very suitable application scenarios.

4. Basic concepts of Apache Flink

1. The core concept of Flink

There are four main concepts of Flink: Event Streams, State, (Event) Time and Snapshots.

1.1 Event Streams

That is, the event stream, which can be real-time or historical. Flink is stream-based, but it can process not only streams, but also batches. The input of streams and batches are both event streams. The difference lies in real-time and batch.

1.2 State

Flink is good at handling stateful computations. Usually complex business logic is stateful. It not only needs to process a single event, but also needs to record a series of historical information, and then perform calculations or judgments.

1.3 (Event) Time

The main problem to be dealt with is how to ensure the consistency when the data is out of order.

1.4 Snapshots

It realizes data snapshot, fault recovery, guarantees data consistency and job upgrade and migration, etc.

2. Flink job description and logical topology

Next, let's take a closer look at Flink's job description and logical topology.


As shown above, the code is a simple Flink job description. It first defines a Kafka Source, indicating that the data source comes from the Kafka message queue, and then parses each piece of data in Kafka. After the parsing is complete, the delivered data will be KeyByed according to the ID of the event, and each group will perform window aggregation every 10 seconds. After the aggregation is processed, the message will be written to the custom Sink. The above is a simple job description, which will be mapped to an intuitive logical topology.

It can be seen that there are four units called operators or operations in the logical topology, namely Source, Map, KeyBy/Window/Apply, and Sink. We call the logical topology Streaming Dataflow.

3. Flink physical topology

The logical topology corresponds to the physical topology, and each of its operators can be processed concurrently for load balancing and processing acceleration.

The processing of big data is basically distributed, and each operator can have different degrees of concurrency. When there is the KeyBy keyword, the data will be grouped according to the key, so after the operator in front of KeyBy is processed, the data will be shuffled and sent to the next operator. The figure above represents the physical topology corresponding to the example.

4. Flink state management and snapshot

Next, let's take a look at state management and snapshots in Flink.

When performing the aggregation logic of Window, the aggregation function is processed on the data every 10 seconds. The data within these 10 seconds needs to be stored first, and processed when the time window is triggered. These state data are stored locally in the form of embedded storage. The embedded storage here can be either in the memory of the process or a persistent KV storage similar to RocksDB. The main difference between the two is the processing speed and capacity.

In addition, each concurrency of these stateful operators will have a local storage, so its state data itself can be dynamically expanded and contracted according to the concurrency of the operator, so that a large amount of data can be processed by increasing concurrency.

On the other hand, jobs have the potential to fail in many situations. How do we ensure data consistency when re-running after a failure?

Based on the Chandy-Lamport algorithm, Flink will save the state of each distributed node in the distributed file system as a checkpoint (checkpoint). The process is roughly as follows. First, inject Checkpoint Barrier from the data source, which is a special message.

Then it will flow with the data flow like ordinary events. When the barrier reaches the operator, the operator will take a snapshot of its current local state. When the barrier flows to the sink, all the states are saved completely. , which forms a global snapshot.

In this way, when the job fails, it can be rolled back through the Checkpoint saved in the remote file system: first roll back the Source to the offset recorded by the Checkpoint, and then roll back the current state of the stateful node to the corresponding time point for re-running calculate. In this way, it is not necessary to start calculations from scratch, but also to ensure the consistency of data semantics.

5. Time definition in Flink
image.png

Another important definition in Flink is Event Time.

There are three different times in Flink. Event Time refers to the time when the event occurs. Ingestion Time refers to the time when the event arrives at the Flink data source, or the time when it enters the Flink processing framework. Processing Time refers to the processing time, which is the time when the event arrives at the operator. Time, what is the difference between these three?

In the real world, the interval between the occurrence of this event and the writing into the system may be relatively long. For example, when the signal in the subway is weak, if we perform operations such as forwarding, commenting, and liking on Weibo, due to network reasons, these operations may not be completed until we get out of the subway, so some events that happened first may be delayed. After reaching the system. However, Event Time can more truly reflect the time when an event occurs, so in many scenarios, we use Event Time as the time when an event occurs.

But in this case, due to the delay, it takes a long time to wait for its arrival in the window, and the end-to-end delay may be large.
We also need to deal with the out-of-order problem. If Processing Time is used as the event time, the processing is faster and the delay is lower, but it cannot reflect the actual occurrence of events. Therefore, when actually developing an application, it is necessary to make corresponding trade-offs according to the characteristics of the application.

6. Flink API

Flink can be divided into four levels of APIs. The lowest-level API is a customizable Process Function, which handles some of the most basic elements, such as time and status, in detail and implements its own logic.

The next layer up is the DataStream API, which can perform stream and batch processing. On the other hand, it is a logical expression. There are many built-in functions in Flink, which is convenient for users to write programs.

The top-level APIs are Table API and Stream SQL. This is a very high-level expression and is very concise. Let us illustrate with examples.

6.1 Process Function

It can be seen that in processElement, custom logic processing can be performed on this event and state. In addition, we can register a timer and customize what processing to do when the timer is triggered or the time arrives, which is a very fine underlying control.

6.2 DataStream API

The DataStream API is the description of the job. You can see that it has many built-in functions, such as Map, keyBy, timeWindow, sum, and so on. There are also some ProcessFunctions we just customized, such as MyAggregationFunction.

6.3 Table API & Stream SQL

The same logic is more intuitive if described with Table API and Stream SQL. Data analysts do not need to understand the underlying details, and can use a descriptive language to write logic. The contents of Table API and Stream SQL will be introduced in detail in Lesson 5.

7. Flink runtime architecture

The architecture of the Flink runtime mainly has three roles.

The first one is the client. The client will submit its application program. If it is a SQL program, it will also optimize the SQL optimizer and generate the corresponding JobGraph. The client will submit obGraph to JobManager, which can be considered as the master control node of the entire job.

The JobManager will pull up a series of TaskManagers as work nodes, and the work nodes will be connected in series according to the job topology, as well as the corresponding calculation logic processing. The JobManager mainly performs some control flow processing.

8. Flink physical deployment

Finally, let's take a look at the environments in which Flink can be deployed.

First, it can manually submit jobs to YARN, Mesos and Standalone clusters. In addition, it can also be submitted to the K8s cloud-native environment through mirroring.

Currently, Flink can be deployed in many physical environments.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us