Apache Flink is more than computing

At the beginning of 2021, in the annual technology trend outlook planned by the InfoQ editorial department, we mentioned that the field of big data will accelerate the embrace of the new direction of "convergence" (or "integration") evolution. The essence is to reduce the technical complexity and cost of big data analysis, while meeting higher requirements for performance and ease of use. Today, we see the popular stream processing engine Apache Flink (hereinafter referred to as Flink) taking a new step along this trend.

On the morning of January 8, Flink Forward Asia 2021 kicked off in the form of an online conference. This year is the fourth year that Flink Forward Asia (hereinafter referred to as FFA) has landed in China, and it is also the seventh year that Flink has become a top project of the Apache Software Foundation. With the development and deepening of the real-time wave, Flink has gradually evolved into a leading role and de facto standard for stream processing. Looking back on its evolution, on the one hand, Flink continues to optimize its core capabilities of stream computing, and continuously improves the stream computing processing standards of the entire industry. On the other hand, it gradually promotes architecture transformation and application scenarios along the idea of stream-batch integration. But in addition to these, Flink's long-term development still needs a new breakthrough.

In the keynote speech of Flink Forward Asia 2021, Wang Feng (Human Mowen), founder of the Apache Flink Chinese community and head of Alibaba's open source big data platform, highlighted Flink's latest progress in the evolution and implementation of the stream-batch integration architecture, and The next development direction of Flink - Streaming Warehouse (Streamhouse for short) is proposed. As the title of the keynote speech "Flink Next, Beyond Stream Processing" said, Flink will move from Stream Processing to Streaming Warehouse to cover larger scenarios and help developers solve more problems. To achieve the goal of a streaming data warehouse, it means that the Flink community needs to expand the data storage suitable for streaming and batch integration. This is a technical innovation of Flink this year. The community-related work has started in October. A key direction for the Flink community to advance in the coming year.

So, how to understand the streaming data warehouse? What problems does it try to solve with existing data architectures? Why did Flink choose this direction? What will be the implementation path of the streaming data warehouse? With these questions in mind, InfoQ exclusively interviewed Mo Wen to further understand the thinking path behind the streaming data warehouse.

Flink has been repeatedly emphasizing stream-batch integration over the past few years, that is, using the same set of APIs and the same set of development paradigms to realize stream computing and batch computing of big data, thereby ensuring the consistency of the processing process and results. Mo Wen said that the integration of streaming and batching is more of a technical concept and capability. It does not solve any problems of users by itself. Only when it is actually implemented in actual business scenarios can it reflect the value of development efficiency and operating efficiency. The streaming data warehouse can be understood as the thinking of landing solutions under the general direction of streaming batch integration.

Two application scenarios of stream-batch integration

At last year's FFA, we have seen the application of Flink's stream-batch integration on Tmall's Double Eleven. Now that a year has passed, Flink's stream-batch integration has made new progress in both technical architecture evolution and landing applications.

At the level of technological evolution, Flink's stream-batch integration API and architectural transformation have been completed. On the basis of the original stream-batch integration SQL, two sets of APIs, DataStream and DataSet, have been further integrated to realize a complete stream-batch integration API at the Java semantic level. A set of codes can undertake stream storage and batch storage at the same time.

In the Flink 1.14 version released in October this year, it is already possible to support the mixed use of bounded and unbounded streams in the same application: Flink now supports applications that are partially run and partially terminated (some operators have processed bounded input data The end of the flow) do Checkpoint. In addition, Flink will trigger the final Checkpoint when processing to the end of the bounded data stream to ensure that all calculation results are successfully submitted to the Sink.

Batch execution mode now supports mixed use of DataStream API and SQL/Table API in the same application (previously only supported DataStream API or SQL/Table API alone).

In addition, Flink updated the unified Source and Sink APIs, and began to integrate the connector ecosystem around the unified API. The new hybrid source can transition between multiple storage systems, enabling operations such as reading old data from Amazon S3 and then seamlessly switching to Apache Kafka.

At the landing application level, there are also two more important application scenarios.

The first one is the full incremental integrated data integration based on Flink CDC.

Data integration and data synchronization between different data sources are just needed for many teams, but traditional solutions are often too complex and time-sensitive. Traditional data integration solutions usually use two sets of technology stacks for offline data integration and real-time data integration, which involve many data synchronization tools, such as Sqoop, DataX, etc. These tools can only do full or incremental, developers You need to control the full incremental switching by yourself, which is more complicated to cooperate.

Based on Flink's stream-batch integration capability and Flink CDC, you only need to write a single SQL to synchronize the historical data in full, and then automatically resume incremental data transfer at breakpoints to achieve one-stop data integration. There is no need for user judgment and intervention in the whole process. Flink can automatically complete the switching between batches and ensure data consistency.

As an independent open source project, Flink CDC Connectors has maintained a fairly high-speed development since its open source in July last year, with an average of one version every two months. It can be seen that more and more companies are using Flink CDC in their own business scenarios, and XTransfer, which InfoQ interviewed not long ago, is one of them.

The second application scenario is the core data warehouse scenario in the field of big data.

Most scenarios use Flink+Kafka to process real-time data streams, that is, the real-time data warehouse, and write the final analysis results to an online service layer for display or further analysis. At the same time, there must be an asynchronous offline data warehouse structure in the background to supplement real-time data, regularly run large-scale batch or even full-scale analysis every day, or perform regular corrections to historical data.

However, there are some obvious problems in this classic architecture: first, the technology stacks used by the real-time link and the offline link are different, and there must be two sets of APIs, so two sets of development processes are required, which increases the development cost; second, the real-time offline technology The stacks are different, so the consistency of the data caliber cannot be guaranteed; again, the intermediate queue data of the real-time link is not conducive to analysis. If users want to analyze the data of a detailed layer in the real-time link, it is actually very inconvenient. Many users may use the method of exporting the data in this detailed layer first, such as importing it to Hive for offline analysis, but the time limit The performance will be greatly reduced, or in order to speed up the query, the data is imported into other OLAP engines, but this will increase the complexity of the system, and data consistency is also difficult to guarantee.

The concept of Flink's stream-batch integration can be fully applied in the above scenarios. In Mo Wen's view, Flink can make the current mainstream data warehouse architecture in the industry one more level, and realize the real-time analysis capability of the real end-to-end full link, that is, when the data changes at the source, it can capture this changes, and supports layer-by-layer analysis, allowing all data to flow in real time, and real-time query of all flowing data. With the help of Flink's complete stream-batch integration capabilities, the same set of APIs can support flexible offline analysis at the same time. In this way, real-time, offline and interactive query analysis, short query analysis, etc. can be unified into a complete set of solutions, becoming the ideal "Streaming Warehouse".

Understanding Streaming Data Warehouse

Streaming Warehouse To be more precise, it is actually "make data warehouse streaming", which is to make the data of the entire data warehouse flow in real time, and in a pure stream instead of a micro-batch (mini-batch) way flow. The goal is to implement an end-to-end real-time pure streaming service (Streaming Service), using a set of APIs to analyze all the flowing data. When the source data changes, such as capturing the Log of the online service or the Binlog of the database, it will be According to the pre-defined Query logic or data processing logic, the data is analyzed, and the analyzed data falls into a certain layer of the data warehouse, and then flows from the first layer to the next layer, and then all the data warehouse The layers will all flow, and finally flow into an online system, and users can see the full real-time flow effect of the entire data warehouse. In this process, data is active, while queries are passive, and analysis is driven by changes in data. At the same time, in the vertical direction, for each data detail layer, users can execute Query to actively query, and can obtain query results in real time. In addition, it is also compatible with offline analysis scenarios, and the API is still the same to achieve true integration.

At present, the industry does not have such a mature solution for end-to-end full streaming links. Although there are pure streaming solutions and pure interactive query solutions, users need to add the two solutions by themselves, which will inevitably increase the complexity of the system. If the offline data warehouse solution is to be added, the problem of system complexity will be even greater. What the streaming data warehouse needs to do is to achieve high timeliness without further increasing the complexity of the system, so that the entire architecture is very simple for developers and operation and maintenance personnel.

Of course, the streaming data warehouse is the final state. To achieve this goal, Flink needs a supporting stream-batch integrated storage support. In fact, Flink itself has a built-in distributed RocksDB as a state store, but this store can only solve the problem of storing the stream data state within a task. The streaming data warehouse needs a table storage service between computing tasks: the first task writes data into it, the second task can read it from it in real time, and the third task can execute user queries on it analyze. Therefore, Flink needs to expand a storage that matches its own concept, step out from the State storage, and continue to go outward. To this end, the Flink community proposed a new Dynamic Table Storage, which is a storage solution with duality of flow tables.


Flink Dynamic Table (see FLIP-188 for community discussion) can be understood as a set of stream-batch storage that seamlessly connects to Flink SQL. Originally, Flink could only read and write external tables like Kafka and HBase. Now, with the same set of Flink SQL syntax, a Dynamic Table can be created just like the original source and target tables. All the layered data of the streaming data warehouse can be placed in the Flink Dynamic Table, and the layers of the entire data warehouse can be connected in real time through Flink SQL, and real-time query and analysis can be performed on the data of different detailed layers in the Dynamic Table. It is also possible to do batch ETL processing on different layers.

From the perspective of data structure, there are two core storage components inside Dynamic Table, namely File Store and Log Store. As the name implies, Flie Store uses the classic LSM architecture to store Table files, and supports streaming updates, deletions, additions, etc.; at the same time, it adopts an open column storage structure and supports compression and other optimizations; it corresponds to the batch mode of Flink SQL. Supports full batch reading. The Log Store stores the operation records of the Table, which is an unchangeable sequence. It corresponds to the flow mode of Flink SQL. It can subscribe to the incremental changes of the Dynamic Table through Flink SQL for real-time analysis. Currently, it supports plug-in implementation.

Writing to Flie Store is encapsulated in the built-in Sink, which shields the complexity of writing. At the same time, Flink's Checkpoint mechanism and Exactly Once mechanism can ensure data consistency.

At present, the implementation plan of the first phase of Dynamic Table has been completed, and the community is also discussing more in this direction. According to the plan of the community, the final state in the future will realize the service of Dynamic Table, truly form a set of Dynamic Table Service, and realize the integrated storage of stream and batch in real time. At the same time, the Flink community is also discussing the operation and release of Dynamic Table as an independent sub-project of Flink. It is not ruled out that it will be completely independent in the future as a general-purpose storage project integrating streaming and batching. Finally, using Flink CDC, Flink SQL, and Flink Dynamic Table can build a complete set of streaming data warehouses to achieve real-time offline integration experience. See the demo video below for the entire process and effect.

Although the whole process is initially smooth, but to truly achieve a full real-time link and be stable enough, the community needs to gradually improve the quality of the implementation solution, which includes the optimization of Flink SQL in OLAP interactive scenarios, dynamic table storage performance and consistency optimization and building dynamic table service capabilities and many other tasks. The direction of the streaming data warehouse has just started, and there have been preliminary attempts. In Mo Wen's view, there is no problem with the design, but a series of engineering problems still need to be solved in the future. This is like designing an advanced process chip or ARM architecture. Many people can design it, but it is actually very difficult to produce the chip under the premise of ensuring the yield rate. The streaming data warehouse will be the most important direction of Flink in the big data analysis scenario, and the community will also invest heavily in this direction.

Flink goes beyond computing

Under the general trend of real-time transformation of big data, Flink can not only do one thing, but also do more.
The industry's original positioning of Flink is more of a stream processor or stream computing engine, which is not the case. Mo Wen said that Flink is not just native computing. You may think that Flink is computing in a narrow sense, but in a broad sense, Flink already has storage. "Flink can break through the siege by stream computing, relying on stateful storage, which is a greater advantage over Storm."

Now Flink hopes to go further and realize a solution that covers a wider range of real-time problems, and the original storage is not enough. However, the external storage system or other engine systems are not completely consistent with the goals and characteristics of Flink, and cannot be well integrated with Flink. For example, Flink has been integrated with data lakes including Hudi and Iceberg, and supports real-time lake entry and real-time incremental analysis of lake entry. However, these scenarios still cannot fully utilize the full real-time advantages of Flink, because the essence of the data lake storage format is still Mini-Batch , where Flink also degenerates to Mini-Batch mode. This is not the architecture that Flink most wants to see or is most suitable for Flink, so it naturally needs to develop a storage system that matches Flink's stream-batch integration concept.

In Mo Wen's view, for a set of big data computing and analysis engine, if there is no support of a storage technology system supporting its concept, it is impossible to provide a set of data analysis solutions with the ultimate experience. This is similar to that any excellent algorithm needs to be accompanied by a corresponding data structure in order to solve the problem with the best efficiency.

Why is it more appropriate to say that Flink is a streaming data warehouse? This is determined by the concept of Flink. The core concept of Flink is to solve the problem of data processing by prioritizing Streaming. Streaming is essential to allow the data in the entire data warehouse to flow in real time. After the data flows, the flow-table duality of aggregated data and Flink's flow-batch integration analysis capability can analyze the data in any link in the flow, whether it is second-level analysis of short queries or offline Flink has the corresponding capabilities for ETL analysis. Mo Wen said that the biggest limitation of Flink's stream-batch integration is that there is no supporting storage data structure in the middle, which will make the scene difficult to implement. As long as the storage and data structure are supplemented, many chemical reactions of stream-batch integration will naturally appear .

Will Flink's self-built data storage system have a certain impact on existing data storage projects in the big data ecosystem? In this regard, Mo Wen explained that the Flink community launched a new stream-batch integrated storage technology to better meet its own stream-batch integrated computing needs. It will maintain an open protocol for storage and data, open API and SDK, and there are plans in the future Develop this project independently. In addition, Flink will still actively connect with mainstream storage projects in the industry to maintain the compatibility and openness of the external ecology.

The boundaries between different components of the big data ecosystem are becoming more and more blurred. Mowen believes that the current trend is from a single component capability to an integrated solution. "Everyone is actually following this trend. For example, you can see many database projects. They used to be OLTP and later added OLAP, and finally called HTAP. In fact, they are a combination of row storage and column storage. Supporting analysis is to provide users with a complete set of data analysis experience.” Mo Wen further added: “At present, many systems are beginning to expand their boundaries, from real-time to offline, or from offline to real-time, and interpenetrate each other. Otherwise, Users need to manually combine various technical components by themselves, and face various complexities, and the threshold is getting higher and higher. Therefore, the integration trend is very obvious. There is no right or wrong who combines who, the key is Can you use a good integration method to provide users with the best experience. Whoever does it will win the final users. The community must have vitality and continuous development, and it is only necessary to maximize the field that you are best at Not enough, we must continue to innovate and break through the boundaries based on user needs and scenarios, and the needs of most users are not necessarily in the gap of a single ability from 95 to 100 points.”

According to Mo Wen's estimation, it will take about a year to form a relatively mature streaming data warehouse solution. For users who have adopted Flink as the real-time computing engine, it is naturally suitable to try the new streaming data warehouse solution, and the user interface is fully compatible with Flink SQL. According to reports, the latest Flink 1.15 version will release the first preview version, and users who are using Flink can try it first. Mo Wen said that the Flink-based streaming data warehouse has just started, and the technical solution needs further iterations, and it will take some time to polish before it matures. I hope that more companies and developers will participate in the construction with their own needs. It is the value of the open source community.

epilogue

The problems of numerous open source ecological components and high architectural complexity of big data have been criticized for many years. Now the industry seems to have reached a consensus to a certain extent, that is, to promote the evolution of data architecture in the direction of simplification through integration and integration. Although different enterprises There are different sayings and implementation paths.

In Mo Wen's view, it is normal for open source ecology to flourish. Every technical community has its own areas of expertise, but to really solve business scenario problems, a one-stop solution is needed to provide users with easy-to-use experience. Therefore, he also agrees that the overall trend will go in the direction of integration and integration, but the possibility is not unique. In the future, there may be a dedicated system responsible for integrating all components, or it is possible that each system will gradually evolve into an integration. Which possibility is the final outcome, maybe we can only wait for time to give us the answer.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us