Developer Content

As one of the most active big data projects, Flink has been a top project of the Apache Software Foundation for eight years.

Apache Flink is a real-time big data analysis engine, which supports the streaming batch execution mode and can be seamlessly connected with the Hadoop ecosystem. In 2014, it was accepted as an Apache incubator project. Only a few months later, it became the top project of Apache.

For Flink, Alibaba has a very suitable streaming scenario. As the leading force of Flink, Alibaba began to research Flink in 2015, and launched Flink in the search scene for the first time in 2016. At the same time, Alibaba has made a lot of modifications and improvements to Flink to adapt it to large-scale business scenarios. In 2017, Alibaba has become the largest user of Flink community, and the Flink team has reached hundreds of people. Some of these early improvements have been explained in Alibaba's 2018 article "Is Flink Enough Strong? Alibaba said: Not Enough.".

In 2019, Ali announced that it had acquired the enterprises behind Flink, and officially opened the internal Flink version of Blink, contributing more than a million lines of code, greatly promoting the healthy development of the community. In the Double 11 in 2021, the peak real-time computing carried by Flink reached 4 billion records per second, and the data volume reached 7 TB per second, which is equivalent to reading 5 million Xinhua Dictionaries in one second.

In recent years, Flink community has been continuously promoting at domestic and international technical conferences, which has enabled Flink to be widely adopted, and various application scenarios have become more extensive, with rapid ecological development. Flink has become powerful, and its design goal is no longer just a stream computing engine, but to enable most data analysts to build real-time data integration, analysis, risk control and online machine learning scenario solutions using Flink's streaming and batching API.

On November 26-27, 2022, Flink Forward Asia 2022 will be held online, which is an opportunity to summarize the important functions recently released. This time, Flink streaming data warehouse function is more mature, and CDC can also access multiple databases InfoQ took this opportunity to interview Wang Feng, the founder of Apache Flink Chinese community and the head of Alibaba's open source big data platform (don't ask), to understand the progress of Flink's core technology and understand Flink's future plans.

From stream computing to stream batch computing

After defeating Storm and Spark Streaming, Flink has become the only standard for stream computing, and there are no competitors in technology.

At the beginning of its birth, Flink was able to quickly defeat the previous generation stream computing engine Storm, relying on the core concept and characteristics of "stateful stream computing". Flink not only provides high-performance pure stream computing, but also provides users with accurate data consistency assurance through distributed consistency snapshot technology on the framework layer through two technologies: converged computing and state management. In Mo Wen's opinion, this is the key reason why Flink quickly became the new mainstream in the field of stream computing after its debut.

Although Spark Streaming can also be a choice for some stream computing scenarios through its powerful Spark ecosystem, its essence is still based on the Spark Batch engine. The non pure stream execution mode will still limit its execution performance and stream semantic expression.

In terms of batch computing, Flink has completed most of its work and is increasingly mature. "At present, Flink has been able to run through the batch processing standard test set TPC-DS completely, and its performance is also very good. It has reached the level of the mainstream batch processing engine. Next, Flink will continue to improve and polish the maturity of batch processing, and strive to bring users the best streaming and batch computing experience in the industry by combining the natural advantages of its own stream processing."

Why do we need streaming batch integration? Why is Flink based streaming batch technology more advantageous?

Let's first look at this issue from a business perspective. Early enterprises were basically offline businesses, running reports once a day based on batch processing. However, the digital world is evolving, and the demand for real-time will become more and more. Real time risk control, real-time BI statistics, real-time recommendations, and real-time monitoring cannot be carried out at night (the goods may have been sold out at night, and users may have left). Real time data analysis can bring value to users. Gradually offline and real-time will become two parallel split links. As the proportion of real-time data traffic continues to increase, more and more tasks will need to be developed twice, and developers will begin to face the problem of development efficiency.

In addition, the separation of real-time and offline links also has the problem of consistency of business caliber. Under the previous technical scheme, real-time and offline are equivalent to working with two sets of tools. Different languages and engines are used, and the data caliber cannot be consistent. Such analysis results will interfere with business decisions and even mislead decision-making errors.

At this time, streaming batch integration naturally becomes a "new means" to solve the real-time offline fragmentation. The two real-time offline business processes developed with a set of computing engines are naturally consistent without error. Especially in some highly efficient business scenarios, such as search, recommendation, advertising, and marketing analysis in the data platform, the demand for streaming and batching will naturally be high. In addition, in the search recommendation scenario, Flink streaming batch tasks and online tasks can be mixed together to share a resource pool for unified scheduling, so as to maximize the use of server resources, which is also an advanced practice in the industry.

The benefits brought by the new architecture of streaming batch integration are obvious, but it does not mean that it is a "one size fits all" technology architecture. Mo Wen thinks, "If the current data business is basically offline digital warehouse, and there is no real time business of a certain scale, there is no need to do the streaming batch integration transformation too early, because the benefits are not large. When the real-time business volume is becoming the mainstream, the proportion of offline business is increasing, or there are increasingly strong and consistent requirements for data consistency, then the streaming batch integration architecture is an inevitable choice for the future."

Streaming digital warehouse: a new digital warehouse architecture based on streaming batch integration

Stream batch integration is a technical concept.

Flink provides the semantic expression capability of streaming and batching in the SQL layer, that is, users can write a set of SQL, which can be used in both real-time and offline scenarios, so as to get a full incremental and integrated data development experience.

Is this the end of the concept of streaming and approving? Obviously not enough. Because on the data storage chain, there is still a lot of complexity. For example, on the real-time link, Flink needs to write data to Kafka and other streaming storage. On the offline link, Flink often writes data to Hive/Iceberg/Hudi and other batch storage. The two storage links are separated, and users still have to maintain two data links at the same time, causing greater management difficulty.

However, the main reason why we need to maintain two sets of storage at the same time is that the industry currently does not have a more productive and available streaming and batch storage. At the same time, it supports efficient streaming read, streaming write, batch read and batch write capabilities. In order to meet different business needs (timeliness, analyticity, etc.), users can only combine multiple links, and even synchronize data between different stores. This will inevitably make the entire link increasingly complex.

Is there any streaming batch storage available in the industry to solve this problem? You may think of the mainstream lake storage project of Apache Hudi. Hudi is indeed the most perfect technology in the industry in terms of streaming batch storage capacity, but the design of Hudi's storage structure is not suitable for large-scale updates. Therefore, the key direction of Flink community in the next stage is to solve this user pain point, further improve the concept of streaming and batch integration, and provide a truly available streaming and batch integrated storage technology, so as to launch a complete new streaming digital warehouse architecture based on streaming and batch integrated computing and storage. This is also the background for Flink community to launch an independent sub project of Flink Table Store at the end of 2021.

In 2022, Flink Table Store has completed the incubation from 0 to 1, and released two release versions. In addition to Alibaba, many companies, including ByteDance, have participated in the contribution of this project, and many companies have begun to try it out. The next key evolution direction of Flink community is the new streaming digital warehouse architecture, which provides users with a more concise and real-time digital warehouse architecture and a more integrated experience. This is also a complete implementation scenario of the concept of streaming and batch integration advocated by Flink for many years, and a perfect combination of streaming and batch integration computing and storage.

On today's Flink Forward Asia 2022, Mo Wen showed you a complete production demo, which is based on Alibaba's real-time computing platform and runs through a complete flow batch data processing and analysis process under the TPC-H business background, including Flink CDC data entering the lake (written into the Table Store) from the source of the database Flink SQL real-time streaming analysis (subscribing to the Table Store), as well as batch data revision and real-time interactive query, presents a complete new streaming data warehouse architecture. In addition, Flink streaming digital warehouse architecture is also an open system, which supports docking with all other storage systems with streaming and batch capabilities, such as Alibaba Cloud Hologres. Alibaba has also internally completed Flink SQL+Hologres, an enterprise level self-developed streaming digital warehouse product, which will be officially released soon.

Full incremental integrated data integration based on Flink

Data integration is a very important application scenario in the real-time stream processing platform, which can also be confirmed in the market guidance report of the stream processing platform released by Gartner in January 2022. From the global market perspective, about 1/3 of the stream processing scenarios are related to real-time data integration, that is, synchronizing data from various constantly changing data sources to the analysis database, data warehouse and data lake through the stream processing capability, This ensures that users can analyze the latest digital world in real time.

With the popularization of real-time data analysis technology, users' data synchronization needs are further upgraded. It is expected to use an integrated full data synchronization tool to achieve data synchronization with one click. However, under the traditional data integration technology system, full and real-time data synchronization often requires two sets of tools (batch and stream based), and users need to collaborate between the two sets of tools. Therefore, it is very difficult and challenging to truly realize the seamless connection of full incremental synchronization processes and ensure data consistency. However, if Flink's streaming batch integration feature can be used, it will become feasible to realize full incremental integration of real-time data.

In addition, Flink itself has a rich connector ecosystem, which can connect various mainstream storage in the industry, as well as an excellent distributed integration framework, including fault tolerance and distributed consistent snapshots. Therefore, doing full incremental integrated data integration on the basis of Flink is equivalent to "standing on the shoulders of giants", which will be faster and easier.

This is the background of the birth of Flink CDC project. With a large number of advantages of Flink, it uses the streaming batch integrated execution mode to achieve full incremental synchronization automatic switching. Based on Flink Checkpointing capabilities, it realizes the data synchronization breakpoint continuation feature, and based on the incremental snapshot consistency reading algorithm, it ensures that the data synchronization will not have any impact on the production business in the whole process of online database lockless operation.

As another innovative application scenario integrating streaming and batching, the CDC project has also developed rapidly. Netease, Tencent, Oceanbase, Bilibili, Xtransfer and other companies have all participated in community contributions. GitHub Star has now exceeded 3000. It supports many mainstream databases ecologically, including MySQL, PostgreSQL, MongoDB, TiDB, PolarDB and OceanBase. Mo Wen said that Flink CDC will further use the innovative achievements of the Flink community to access more data sources and become a new generation of fully incremental integrated data integration engine.

Flink in the cloud native era

With the popularity of cloud native, more and more enterprise applications have been migrated in containers and managed through K8s. In recent years, Spark, Kafka, etc. in the big data field have begun to support K8s, which has transformed big data applications from the traditional Yarn era to the cloud native era.

Flink community has long been designed based on cloud origin, including Flink resource scheduling and streaming shuffle, which are naturally suitable for cloud origin. As a streaming computing engine, Flink does not need to download the data, but streams the data. The data flow between distributed computing is through the network and memory, and does not rely on the local disk. Therefore, it is a natural architecture of storage and computing separation. In addition, Flink comes with a state store. The computing operator and state access are integrated, and state access is supported within the operator. In fact, this is also evolving towards the separation of storage and computing. That is to say, Flink can turn off RocksDB services at any time and snap state data to persistent HDFS or cloud storage.

Flink, as a product of the cloud native architecture, has always been designed towards the cloud native architecture. The community started Flink on K8s five or six years ago. After supporting K8s, it is very helpful for Flink. For example, deployment does not depend on Hadoop: as long as K8s are available, Flink can be deployed without any dependency. The operation and maintenance scheme is also very standardized, and the K8s operation and maintenance system will also operate and maintain Flink. At the same time, Flink can also be deployed based on containers. Containers bring better isolation to Flink, including task isolation, multi tenant management, and even serverless in the next step.

In the development trend of cloud native, adaptability is very important. Better resource elasticity makes business fluctuations more flexible, and the resources on the cloud are also massive. Users can constantly adjust the resource scale flexibly according to business needs. Especially in the Serverless environment, users do not even need to consider machine resources. Flink itself will also increase more adaptive capabilities to achieve automatic task concurrency management and state data management, so that Flink can better use the elastic mechanism on the cloud.

Apache Flink is booming and becoming indispensable in the vast big data analysis ecosystem, becoming a key pillar of enterprise data strategy. However, for some traditional enterprises, it is difficult to build a data analysis platform with open source software without a strong big data technology team. Therefore, Alibaba Cloud Flink's technical team is also doing something to provide product based services and lower the technical threshold.

Alibaba Cloud has launched a cloud native real-time computing Flink product, which provides a development, operation and maintenance platform with Flink SQL as the core. It has made the Flink production, operation and maintenance experience and enterprise level capabilities accumulated in Alibaba available to small and medium-sized enterprises in the form of productization, providing solutions such as real-time data warehouse, real-time data integration, real-time risk control and real-time feature engineering, Help digital enterprises accelerate the real-time upgrading of big data technology.

In addition, Flink products provided by Alibaba Cloud also use the most advanced Serverless architecture. Users can use Flink simply by purchasing computing resources on demand, making real-time computing more affordable. Mo Wen said that in the next few months, Flink based multi cloud PaaS Serverless service will also be publicly tested worldwide. As the core R&D team promoting continuous technological innovation in the Flink community, Alibaba Cloud hopes to further promote the Flink technology ecology globally.

Is Flink strong enough?

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse