More Than Computing: A New Era Led by the Warehouse Architecture of Apache Flink

Mowen discusses the future of Apache Flink regarding its core capabilities of stream computing and improving the processing standards of the entire industry.

By Cai Fangfang

Interviewee: Wang Feng (Mowen)

There is a description of Apache Flink on Wikipedia. "Flink does not provide its own data storage system but provides data-source and sink connectors to systems, such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache Cassandra, and Elasticsearch." Soon, the first half of this sentence may no longer be applicable.

At the beginning of 2021, in the annual technology trend outlook by the InfoQ editorial department, the big data we mentioned will accelerate to embrace the new direction of convergence (or integration). It aims to reduce the technical complexity and cost of big data analysis while meeting higher requirements for performance and ease of use. The popular stream processing engine Apache Flink (Flink) is taking another step along with this trend today.

The Flink Forward Asia 2021 meeting was held online on the morning of January 8, 2022. This is the fourth year Flink Forward Asia (FFA) was held in China, and the seventh year Flink has become the top project of the Apache Software Foundation. With the development and deepening of the real-time trend, Flink has become the leading role and factual standard of stream processing. Looking back on its evolution, on the one hand, Flink continues to optimize its core capabilities of stream computing and improve the processing standards of the entire industry. On the other hand, it promotes architecture transformation and application scenarios along with the idea of integration of stream and batch. However, Flink also needs a breakthrough for its long-term development.

During the keynote speech of Flink Forward Asia 2021, Wang Feng (Mowen), the founder of the Apache Flink Chinese community and the head of Alibaba's open-source big data platform, highlighted the latest progress in the evolution and implementation of Flink's integrated stream and batch architecture. He proposed the next development direction of Flink-Streaming Warehouse (Streamhouse). As the title of the keynote speech Flink Next, Beyond Stream Processing said, Flink will move from Stream Processing to Streaming Warehouse to cover larger scenarios and help developers solve more problems. The Flink community needs to expand data storage suitable for the integration of stream and batch to achieve the goal of streaming data warehouses. This is an innovation in technology in 2021. Community-related work started in October 2021, and this will be the key direction in the coming year.

How can we understand the streaming data warehouse? What problems does it solve with the existing data architecture? Why does Flink choose this direction? What will be the implementation path of a streaming data warehouse? With these questions in mind, InfoQ has an exclusive interview with Mowen to help understand the thinking path behind the streaming data warehouse.

In recent years, Flink has always emphasized the integration of stream and batch, which means using the same set of APIs and development paradigms to implement stream computing and batch computing of big data, thus ensuring the consistency of processing and results. Mowen said the integration of stream and batch is more of a technical concept and capability. It does not solve any problems of users. Only after it is applied to actual business scenarios can it reflect the value of development efficiency and operational efficiency. However, the streaming data warehouse can be understood as the overall thinking of the landing solution.

Two Application Scenarios of Stream-Batch Integration

At FFA 2020, we saw the application of streaming and batch integration in Tmall Double 11. It was the first time Alibaba applied the stream-batch integration in the core data business. Now, one year has passed, and Flink has made new progress in both the evolution and application of technical architecture.

Flink's stream-batch integration API and architecture transformation have been completed at the level of technical evolution. Based on the original stream-batch integration SQL, DataStream and DataSet have been further integrated to realize a complete Java semantic-level stream-batch integration API. At the architecture level, a set of codes can undertake stream storage and batch storage at the same time.

In Flink 1.14 (released in October 2021), it is possible to support the mixed use of bounded and unbounded streams in the same application. Flink supports checkpoints for partially running and ending applications. (Some operators have been processed to the end of the bounded input data stream.) In addition, Flink triggers a final checkpoint when processing to the end of a bounded data stream to ensure that all computing results are submitted to the sink.

The batch execution mode supports the combination of DataStream API and SQL/Table API in the same application. Previously, only DataStream API or SQL/Table API were supported.

In addition, Flink has updated the unified source and sink APIs and started to integrate the connector ecosystem around unified APIs. The new hybrid source can transition between multiple storage systems to enable operations, such as reading old data from Amazon S3 and switching to Apache Kafka.

There are two important application scenarios at the application level:

The first one is a full and incremental data integration based on Flink CDC.

Data synchronization between Data Integration and different data sources is needed for many teams. However, traditional solutions are complex and not time-sensitive. Traditional data integration solutions usually use two sets of technology stacks for offline Data Integration and real-time Data Integration, which involve many data synchronization tools, such as Sqoop and DataX. These tools can only do full or incremental tasks. Developers need to control the switching of full and incremental tasks, which is complicated to cooperate with.

Based on the stream-batch integration capability and Flink CDC, one SQL statement can synchronize historical data and then automatically resume uploading incremental data to realize one-stop data integration. Flink can switch between batches and streams and ensure data consistency without user judgment and intervention.

As an independent open-source project, Flink CDC Connectors has maintained a high-speed development since it was open-source in July 2021, with an average of one version every two months. Flink CDC has been updated to version 2.1. Many mainstream databases have been adapted, such as MySQL, PostgreSQL, MongoDB, and Oracle. More databases are also in connection, such as TiDB and DB2. More enterprises are using Flink CDC in their business scenarios. XTransfer, which InfoQ interviewed recently, is one of them.

The second application scenario is the core data warehouse scenario of big data.

The mainstream data warehouse architecture that integrated real-time and offline work is shown in the following figure:

In most scenarios, Flink+Kafka is used to process real-time data streams, which are part of the real-time data warehouse. The final analysis results are written to an online service layer for display or further analysis. At the same time, there will be an asynchronous offline data warehouse architecture in the background to supplement real-time data, running large-scale batch or full analysis regularly every day or regularly correcting historical data.

However, there are some problems with this classic architecture. First, the technology stacks used by real-time links and offline links are different, and there will be two sets of APIs. Therefore, two sets of development processes are required, which increases the development cost. Second, real-time and offline technology stacks are different, which fails to guarantee the consistency of data caliber. Third, the middle queue data of real-time links is not conducive to analysis. If users want to analyze data in a detail layer in the real-time link, it is inconvenient. Many users may export data in this detail layer first. For example, import them to hive for offline analysis, but the timeliness will be reduced. People may also import data to other OLAP engines to accelerate the query, but this will increase the complexity of the system, and data consistency is also difficult to ensure.

The concept of stream-batch integration of Flink can be fully applied in the preceding scenarios. In Mowen's view, Flink can advance the mainstream data warehouse architecture in the industry to realize the real-time analysis of end-to-end in the comprehensive procedure. It can capture this change when data changes at the source and support layer-by-layer analysis, allowing all data to flow and querying all data in real-time. With the complete stream-batch integration capability of Flink, you can use the same set of APIs to support flexible offline analysis at the same time. As a result, real-time, offline, interactive query and analysis, and short query analysis can be unified into a complete set of solutions, making it an ideal Streaming Warehouse.

Understanding Streaming Warehouses

To be precise, a streaming warehouse is to make data warehouse streaming, which allows the data of the entire warehouse to flow in real-time in a stream mode rather than a mini-batch mode. The goal is to realize a Streaming Service with end-to-end real-time performance. A set of APIs is used to analyze all flowing data. When the source data changes, such as capturing the Log of the online service or the Binlog of the database, the data is analyzed according to the Query logic or data processing logic defined in advance, and the analyzed data falls to a certain hierarchy of the data warehouse. Then, they flow from the first layer to the next layer, which makes all the layers of the data warehouse flow and finally flows to an online system. Users can see the full real-time flow effect of the entire data warehouse. In this process, the data is active, the query is passive, and the analysis is driven by changes in the data. At the same time, users can query and obtain query results in real-time for each data detail layer in the vertical direction. In addition, it can be compatible with offline analysis scenarios, and the API is still the same set to achieve true integration.

Currently, the industry does not have such a mature solution for end-to-end full streaming links. Although there are solutions that use streaming and interactive queries, users need to add up the two solutions themselves, which will increase the complexity of the system. If offline data warehouse solutions are added, the system complexity problem will be even greater. The streaming warehouse needs to achieve high timeliness without increasing the system complexity, so the entire architecture is simple for development and O&M personnel.

Streaming warehouses are in the final state. Flink needs a supporting stream-batch integrated storage support to achieve this goal. Flink has built-in distributed RocksDB as the state storage. However, this storage can only solve the problem of storing the state of streaming data within the task. Streaming warehouses need a table storage service between computing tasks. When the first task writes data in, the second task can read it again in real-time, and the third task can also perform user query analysis on it. Therefore, Flink needs to expand storage that matches its concept, step out of State storage and continue to go outward. Therefore, the Flink community has proposed a new Dynamic Table Storage, a storage scheme with stream table duality.

Integrated Stream-Batch Storage: Flink Dynamic Table

Flink Dynamic Table (please see the FLIP-188 for community discussion) can be understood as a set of stream-batch integrated storage and seamlessly connected to Flink SQL. Flink could only read and write external tables previously, such as Kafka and HBase. Now, the same Flink SQL syntax can be used to create a dynamic table, just like creating source tables and destination table tables. The hierarchical data of a streaming warehouse can be stored in the Flink Dynamic Table. Flink SQL can be used to connect the hierarchical data of the entire data warehouse in real-time. You can query and analyze the data of different detail layers in Dynamic Table in real-time and perform batch ETL processing on different layers.

Dynamic Table has two core storage components in terms of data structure, namely File Store and Log Store. File Store stores tables use the classic LSM architecture and support streaming updates, deletions, and additions. It uses an open column storage structure and supports optimization such as compression. It corresponds to the batch mode of Flink SQL and supports full batch read. Log Store stores the operation records of tables, which is a non-changeable sequence corresponding to the flow mode of Flink SQL. You can use Flink SQL to subscribe to incremental changes of dynamic tables for real-time analysis. Currently, plug-ins are supported.

Writes to the File Store are encapsulated in the built-in Sink, which shields the complexity of writing. The Checkpoint mechanism and Exactly Once mechanism of Flink can ensure data consistency.

The implementation of the first phase of Dynamic Table has been completed, and the community is discussing this direction. According to the community plan, the final state will realize the service of Dynamic Table, truly form a set of services of Dynamic Table, and realize full real-time stream-batch integrated storage. The Flink community is also discussing the operation and release of Dynamic Table as an independent sub-project of Flink. This does not rule out the subsequent development of Dynamic Table as a stream-batch integrated general-purpose storage project. Finally, Flink CDC, Flink SQL, and Flink Dynamic Table can be used to build a complete set of streaming data warehouses to realize the integrated experience of real-time offline.

Although the entire process has been initially completed, the community still needs to improve the quality of the implementation solution to achieve full real-time links and be stable enough. This includes the optimization of Flink SQL in OLAP interactive scenarios, the optimization of dynamic table storage performance and consistency, and the construction of dynamic table service capabilities. The project of streaming data warehouses has just started, and there has been a preliminary attempt. In Mowen's view, there is no problem with the design, but a series of engineering problems need to be solved in the future. This is like designing an advanced process chip or ARM architecture. Many people can design it, but it is difficult to produce the chip under the premise of ensuring the quality. Streaming warehouses will be the most important direction for Flink in the next big data analysis scenario, and the community will invest heavily in this direction.

Flink: More Than Computing

Flink can do more things under the general trend of real-time transformation of big data.

The industry used to position Flink as a stream processor or stream computing engine, but this is not the case. Mowen said that Flink is not just computing. People may think that Flink is computing in a narrow sense, but in a broad sense, Flink already has storage. "Flink relies on the stateful storage of stream computing to break out of the siege, which is a greater advantage over Storm."

Now, Flink hopes to go further and implement a solution that covers a wider range of real-time problems. The original storage is not enough. However, external storage systems or other engine systems are not completely consistent with Flink's goals and features and cannot be integrated well with Flink. For example, Flink is integrated with data lakes, including Hudi and Iceberg, and supports real-time incremental analysis into the lake. However, these scenarios still fail to fully realize the real-time advantages of Flink because the data lake storage format is still Mini-Batch, and Flink will also degenerate into this mode. This is not the architecture that Flink hopes to see or is most suitable for Flink, so it needs to develop a storage system compatible with the stream-batch integration concept of Flink.

In Mowen's view, it is impossible to provide a set of data analysis solutions with extreme experience without the support of a storage technology system for a set of big data computing and analysis engines. This is similar to the fact that any good algorithm needs a corresponding data structure to match it to solve the problem with the best efficiency.

Why is Flink more suitable for streaming warehouses? This is determined by the concept of Flink. The core concept of Flink is to use streaming to solve the problem of data processing first. If you want to allow the data of the entire data warehouse to flow in real-time, streaming is essential. After all the data flows, Flink can analyze the data in any link in the flow, whether it is the second-level analysis of short queries or offline ETL analysis. Mowen said that the biggest limitation of stream-batch integration is that there is no matching storage data structure in the process, which will make the scene difficult to achieve. As long as the storage and data structure are added, there will be many chemical reactions of stream-batch integration.

Will Flink's self-built data storage system have a certain impact on existing data storage projects in the big data ecosystem? Mowen explained that the Flink community has introduced a new stream-batch integrated storage technology to help meet its requirements. It will maintain open protocols for storage and data, open APIs, and SDKs. There are also plans to develop this project independently in the future. Flink will still connect with mainstream storage projects in the industry to maintain compatibility and openness to external ecosystems.

The boundaries between different components of the big data ecosystem are becoming blurred. Mowen believes that the current trend is to move from a single component capability to an integrated solution. "Everyone is following this trend. For example, you can see many database projects. It turned out that OLTP was added with OLAP and finally called HTAP. It combines row storage and column storage, supporting both serving and analysis, to provide users with a complete data analysis experience." Mowen added, "At present, many systems are beginning to expand their boundaries, from real-time to offline, or from offline to real-time, and penetrate each other. Otherwise, users will need to manually combine various technical components and face various complexity issues. The threshold is getting high. Therefore, the trend of integration is obvious. There are no right or wrong combinations. The key is whether a good integration method can provide users with the best experience. Whoever does it will win the users. To have vitality and sustainable development, it is not enough for the community to achieve the ultimate in the areas where it is best at. It is necessary to innovate and break through boundaries based on user needs and scenarios. The needs of most users are not on the gap of a single ability from 95 to 100."

According to Mowen's estimation, it will take about a year or so to form a mature streaming warehouse scheme. It is suitable for users who have adopted Flink as the real-time computing engine to try a new streaming warehouse solution. The user interface is fully compatible with Flink SQL. It was revealed that the first preview version will be issued in the latest Flink 1.15. Flink users can try it first. Mowen said that the streaming warehouse based on Flink has just started, and the technical scheme needs further iteration. It still needs to polish before maturity. He hopes that more enterprises and developers can participate in the construction together. This is the value of the open-source community.

Summary

The problems of a large number of big data open-source ecosystem components and high architectural complexity have been criticized for many years. Although different enterprises have different opinions and implementation paths, the industry seems to have reached a consensus to a certain extent, which means promoting the evolution of data architecture in the direction of simplification through integration.

In Mowen's view, it is normal for an open-source ecology to be prosperous. Each technical community has its areas of expertise. However, if you want to solve business scenario problems, you still need a comprehensive solution to provide users with a simple and easy-to-use experience. Therefore, he agrees that the overall trend will go in the direction of integration, but the possibility is not unique. In the future, there may be a dedicated system to integrate all components, or each system may evolve into an integrated system. Perhaps time can give us the final answer.

Community

More Than Computing: A New Era Led by the Warehouse Architecture of Apache Flink

Two Application Scenarios of Stream-Batch Integration

Understanding Streaming Warehouses

Integrated Stream-Batch Storage: Flink Dynamic Table

Flink: More Than Computing

Summary

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase