Flink + Iceberg: How to Construct a Whole-scenario Real-time Data Warehouse

By Su Shu, Senior Engineer of Tencent's Data Platform department; edited by Lu Peijie (Flink community volunteer)

Apache Flink is a prevalent stream-batch computing engine in the big data field. Data Lake is a new technical architecture trending in the cloud era. This led to the rise of solutions based on Iceberg, Hudi, and Delta. Iceberg currently supports Flink to write data into Iceberg tables through DataStream API/Table API and provides integration support for Apache Flink 1.11.x.

This article mainly introduces the real-time data warehouse construction by Tencent's Big Data department based on Apache Flink and Apache Iceberg, as follows:

1) Background and pain points
2) Introduction to Apache Iceberg
3) Real-time data warehouse construction with Flink and Iceberg
4) Future plan

1) Background and Pain Points

Figure 1 shows the booster users of some internal applications, among which WeChat Applets and WeChat Video generated data above PB or EB level per day or month.

Figure 1

Users of these applications often use the architecture in figure 2 to build their own data analysis platforms.

1.1) Data Platform Architecture

The business side, such as Tencent Kandian or WeChat Video users, usually collects data such as front-end business dot data and service log data. The data will be put in the data warehouse or real-time computing engine through message-oriented middleware (Kafka/RocketMQ) or data synchronization service (Flume/NiFi/DataX).

A data warehouse includes various big data components, such as Hive, HBase, HDFS, S3, and computing engines such as MapReduce, Spark, and Flink. Users build the big data storage and processing platform based on their needs. After processing and analyzing data on the platform, the result data is saved to relational and non-relational databases that support fast queries, such as MySQL and Elasticsearch. Then, users can perform BI report development and user profiling based on the data at the application layer and interactive queries based on OLAP tools such as Presto.

Figure 2

1.2) Lambda Architecture: Pain Points

Some offline scheduling systems are often used in the whole process to perform some Spark analysis tasks and data input, data output, or ETL jobs regularly (T+1 or every few hours). Data latency will inevitably exist throughout the whole offline data processing. The data latency can be relatively large for both data access and intermediate analysis, ranging from hours to days. In other scenarios, real-time processing is also built for real-time requests. For example, Flink and Kafka are often used to build a real-time stream processing system.

Generally, the data warehouse architecture has many components, which significantly increases the complexity of the entire architecture and the cost of O&M.

The following figure shows the lambda architecture that many companies used or are using. The lambda architecture divides the data warehouse into the offline and real-time layers. Accordingly, there are two independent data processing procedures: batch processing and stream processing. The same data will be processed more than twice, and the same set of business logic code has to be developed twice. Since the lambda architecture is familiar, the following will focus on some pain points we encountered when using the lambda architecture to build a data warehouse.

Figure 3

For example, in real-time computing of some user-related indexes, the data will be put into the real-time layer for computing in order to see the current PV and UV. These index values will be displayed in real time. But to know the user growth trend, the data for the past day needs to be calculated, which requires the scheduling tasks of batch processing. For example, to start a Spark scheduling task in the scheduling system at two or three the next morning to run all the day's data again.

Given that the two procedures run the same data at different times, it may result in data inconsistency. Due to the update of one or more pieces of data, the entire offline analysis procedure needs to be rerun - making the data update costly. Furthermore, maintaining both the offline and real-time analysis computing platforms simultaneously results in high development and O&M costs.

Therefore, the kappa architecture is developed to solve various problems related to the lambda architecture.

1.3) Kappa Architecture: Pain Points

As figure 4 shows, message queues are used in the middle of the kappa architecture, connecting the entire link through Flink. The kappa architecture solves the problem of high O&M and development costs caused by different engines used at the offline processing layer and real-time processing layer in the lambda architecture. However, there are also some pain points in the kappa architecture.

First, when building a real-time business scenario, the kappa architecture will be used to build a near-real-time scenario. However, for some simple OLAP analysis in the middle layer of the data warehouse - such as the ODS layer - or further data processing - such as writing data to the Kafka in the DWD layer - Flink needs to be connected separately. Moreover, the complexity of the entire architecture increases when data needs to be exported from Kafka in the DWD layer to ClickHouse, Elasticsearch, MySQL, or Hive for further analysis.
Second, the kappa architecture relies heavily on message queues. The accuracy of data computing for a message queue itself on the entire link is strictly in line with the order of the upstream data. The more the message queues, the greater the probability of disorder. Commonly, the data in the ODS layer is absolutely accurate. Data in the ODS layer may be out of order when being sent to the next Kafka. Also, the same case when data in the DWD layer is sent to DWS, resulting in severe data inconsistency.
Third, users cannot directly use some optimization policies of OLAP analysis on Kafka since it is a sequential storage system. For example, it is challenging to implement optimization policies like predicate push-down in Kafka's sequential storage system.

Is there an architecture that can meet both real-time and offline computing requirements and reduce O&M and development costs to resolve some of the pain points in building a kappa architecture through message queues? The answer is yes, as the following sections describe.

Figure 4

1.4) Summary: Pain Points

Traditional T+1 Task

Massive TB-Level T+1 task latency leads to unstable downstream data output time.
The recovery is costly when the task fails.
The data architecture has difficulty in dealing with deduplication and exactly-once semantics.
The architecture is complex, involving the coordination of multiple systems. The scheduling system constructs the dependencies between tasks.

Lambda: Pain Points

Both real-time and offline platform engines are maintained, resulting in high O & M costs.
Two sets of code with different frameworks but the same business logic needs to be maintained in both real-time and offline platforms, which is costly.
Data is generated in two different procedures, which may easily cause data inconsistency.
Data update is costly, and the procedure needs to be re-run.

Kappa: Pain Points

Due to the high requirements for storing message queues, message queues are inferior to offline storage in terms of backfill capability.
Message queues are time-sensitive to data storage. Currently, users cannot use the OLAP engine directly to analyze the data in message queues.
Comprehensive-procedure real-time computing relying on message queues may return incorrect results due to the data's time sequence.

Figure 5

1.5) Real-time Data Warehouse Construction: Requirements

Is there a storage technology that can support efficient data backfill and data updating, stream-batch data read and write, and realize data access from minute-level to second-level?

This is also urgently required in the construction of real-time data warehouses (figure 6). In fact, upgrading the Kappa architecture can solve some problems in the kappa architecture. The following will focus on a popular data lake technology: Iceberg.

Figure 6

2) Apache Iceberg

2.1) What is Iceberg?

Officially, Iceberg is described as follows:

Apache Iceberg is an open table format for massive analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works as a SQL table.

Officially, Iceberg is defined as a table format. It can be simply interpreted as a middle layer based on the computing layer (Flink, Spark) and the storage layer (ORC, Parquet, and Avro). Data is written into Iceberg through Flink or Spark, and then the table is accessed through Spark, Flink, Presto, etc.

Figure 7

2.2) Iceberg Table Format

Iceberg, designed to analyze massive data, is defined as a table format. The table format is between the computing and storage layers.

The table format is mainly used to manage the files in the storage system downwards and provide corresponding interfaces for the computing layer upwards. File storage on the storage system adopts a specific organizational form. When accessing a Hive table, for example, the HDFS file system will produce some partitions containing information about data storage format, data compression format, data storage HDFS directory, etc. Metastore with the information stored in can be called a file organization format.

An excellent file organization format, such as Iceberg, can more efficiently support the upper computing layer to access files on the disk and perform some list, rename, or search operations.

2.3) Summary: Iceberg's Capabilities

Iceberg currently supports three file formats: Parquet, Avro, and ORC. As figure 7 shows, files in HDFS and in S3 can be stored inline or column, which we will discuss in detail later. The features of Iceberg itself are summarized as follows, as figure 8 shows. These capabilities are essential for building a real-time data warehouse with Iceberg.

Figure 8

Snapshot-based read-write separation and backfill
Stream-batch unified write and read
No forcing binding between the computing and storage engines
Multi-version ACID semantics and data
Table, schema, and partition change

2.4) Iceberg's File Organization Format

The following figure shows the entire file organization format of Iceberg. Examine from top to bottom:

The first is the snapshot. A snapshot in Iceberg is a basic data unit that a user can read. Accordingly, all the data in a table format a user reads each time is under the same snapshot.
The second is the manifest. A snapshot contains multiple manifest files. Figure 9 shows that Snapshot-0 contains two manifest files, and Snapshot-1 contains three manifest files. Each manifest manages one or multiple DataFiles.
The third is the DataFiles. The meta-information of the data is stored in the manifest file. In the manifest file, there are lines of DataFiles file paths.

The figure also shows that Snapshot-1 contains Snapshot-0 data, while Snapshot-1 only contains manifest2 data. Actually, this provides good support for incremental data read.

Figure 9

2.5) Iceberg Read-Write Process

Apache Iceberg Read and Write

Firstly, in a write operation, the S1 dashed box in figure 10 suggests that Snapshot-1 is not committed, which means Snapshot-1 is unreadable. The reason is that users can only read after committing the snapshot. It is also the same case with Snapshot-2 and Snapshot-3.

Read-write separation is an important feature of Iceberg. Snapshot-4 writing does not affect Snapshot-2 and Snapshot-3 reading at all. This is one of the most important capabilities in building real-time data warehouses.

Figure 10

Similarly, reads can be concurrent. Snapshot S1, S2, and S3 data can be read simultaneously, which provides the ability to trace back to the Snapshot-2 or Snapshot-3 data reading. A commit operation will be performed when Snapshot-4 is written. Then Snapshot-4, as the solid box in figure 10 indicates, becomes readable. By default, the snapshot that the Current Snapshot pointer points to is indeed what a user reads when reading a table. However, the read operation to the previous snapshot is left unaffected.

Apache Iceberg Incremental Read

This section describes the incremental read operation of Iceberg. First, the Iceberg read operation is only based on Snapshot-1 that has been submitted. Then, there is a Snapshot-2. Each snapshot contains all the data of the previous snapshot. If full data is read each time, the cost of reading is very high for the computing engines on the comprehensive procedure.

The incremental data from Snapshot-1 to Snapshot-2 should be selected to read according to Iceberg's snapshot backfill mechanism if only the newly added data at the current time is needed, as the purple parts in figure 11 show.

Figure 11

Similarly, when it comes to S3, only the data colored in yellow and incremental data from S1 to S3 can be read. Such an incremental read function has already been available on the Streaming Reader based on the Flink source internally and has been run online. Now, here is the point: Since Iceberg has been equipped with read-write separation, users must implement concurrent read, incremental read, and the Iceberg Sink to realize the connection between Iceberg and Flink.

Small File Issues in Real-time Processing

The community has now restructured Flink Iceberg Sink in Flink and implemented the global committee function. Our architecture is consistent with that of the community. The dashed box of figure 12 shows Flink Iceberg Sink.

With multiple IcebergStreamWriters and one IcebergFileCommitter, each writer writes DataFiles when upstream data is written into IcebergStreamWriter.

Figure 12

When each writer completes writing the current batch of small DataFiles, a message will be sent to IcebergFileCommitter to indicate that the DataFiles are ready for commit. When IcebergFileCommitter receives the message, it commits the DataFiles at one time.

The commit operation itself is only a modification for some original information, making the data written on the disk visible from the invisible state. Therefore, Iceberg needs the commit operation only once to complete such a transition.

Merging Real-time Small Files

Flink real-time tasks often run in clusters on a long-term basis. Usually, the Iceberg commit is set to perform a commit operation every 30 seconds or 1 minute to ensure the timeliness of the data. If the commit operation is performed every minute, there will be 1,440 commits in total for Flink to run for a day. This number may be even more significant if Flink runs for a month. The shorter the interval for a snapshot commit operation, the more the snapshots generated. A streaming task generates a large number of small files when it runs.

If this problem is unsolved, the Iceberg Sink operation on the Flink processing engine will be invalid. We internally implemented a function called DataCompactionOperator that runs along with the Flink Sink. Each time a commit is completed, it sends a message to the downstream FileScanTaskGen, telling the FileScanTaskGen that the commit has been completed.

Figure 13

The FileScanTaskGen tool has the relevant logic for generating a file merging task based on the user's configuration or the current disk's characteristics. FileScanTaskGen sends the list of files generated in FileScanTaskGen that need to be merged to DataFileRewrite. Similarly, the file merge takes a certain amount of time, so the files need to be distributed asynchronously to different task rewrite operators.

Iceberg's commit operation mentioned above requires a new snapshot for the files after rewrite. Here, it is also a commit operation for Iceberg, so a single concurrent event like a commit operation is used.

For the comprehensive procedure, the commit operation is currently adopted to merge small files. If the commit operation is blocked, the previous write operation will be affected. Such a problem will be solved later. Now, a design doc has been created in the Iceberg community for discussing merging-related work with the community.

3) Real-time Data Warehouse Construction with Flink and Iceberg

3.1) Near-real-time Data Access

As described above, Iceberg supports read-write separation, concurrent read, incremental read, merging small files, and seconds to minutes of latency. Based on these advantages, we try to use Iceberg to build a Flink-based real-time data warehouse architecture featuring real-time comprehensive-procedure and stream-batch processing.

As the following figure shows, each commit operation of Iceberg changes the visibility of data. That is, to make the data visible from the invisible state. In this process, users can implement near-real-time data recording.

Figure 14

3.2) Real-time Data Warehouse: Data Lake Analysis System

Previously, users needed to access data first. For example, users can use the Spark offline scheduling task to run, pull, and extract some data, then the data is written to the Hive table. The latency is relatively high in this process. With the Iceberg table format, near-real-time data access can be realized using Flink or Spark streaming.

Based on the above functions, we review the kappa architecture discussed earlier. Despite the pain points of the kappa architecture, is it possible to replace Kafka with Iceberg since Iceberg is an excellent table format supporting Streaming Reader and Streaming Sink?

Iceberg's bottom layer relies on cheap storage like HDFS or S3. In addition, Iceberg supports columnar storage such as Parquet, ORC, and Avro. Columnar storage can support basic OLAP analysis optimization and perform computing directly in the middle layer. For example, the most basic OLAP optimization policy of predicate push-down - based on the Streaming Reader function of the Iceberg snapshot - can greatly reduce the offline tasks' day to hour level latency and transfer them into a near-real-time data lake analytic system.

Figure 15

In the intermediate processing layer, users can use Presto for some simple queries. As Iceberg supports Streaming Read, Flink can also be directly connected in the middle layer to perform batch processing or stream computing tasks. The intermediate results are further calculated and then output to the downstream.

Replacing Kafka: Advantages and Disadvantages

In general, the main advantages of replacing Kafka with Iceberg are as follows:

Unify stream-batch in the storage layer
Support OLAP in the middle layer
Support efficient data backfill
Lower storage costs

Of course, there are also some disadvantages, such as:

Near real-time rather than completely real-time
Extra work needed to connect with other data systems

Figure 16

Second-level Analysis: Data Lake Acceleration

Iceberg stores all data files in HDFS. But HDFS read-write cannot fully meet the requirements for second-level analysis scenarios. Next, the cache Alluxio will be supported at the bottom layer of the Iceberg data lake acceleration through cache. This architecture is also in our future plan and construction.

Figure 17

3.3) Best Practices

Real-time Small Files Merging

As figure 18 shows, Tencent has implemented the full SQL of Iceberg. Users can set some parameters for merging small files in the table properties, such as the number of snapshots needed for merging. As such, the bottom layer can start the Flink data lake entry task directly with an insert statement. Thus, the entire task can keep running, and the DataFiles of backend data are automatically merged in the background.

Figure 18

The following figure shows the data files and the corresponding metafiles in Iceberg. As the open-source Iceberg Flink Sink in the community does not have the file merging function, when a small streaming task is run in the computer, the number of files in the related directory will surge after the Flink task is run for a while.

Figure 19

We can conclude that the number of files can be controlled at a stable level through Iceberg's real-time small file merging.

Flink Real-time Incremental Read

To achieve real-time data incremental read, users can configure the data to Iceberg's table properties parameter and specify the snapshot for consumption. Then, each time the Flink task is started, it reads only the latest data added to the snapshot.

Figure 20

In this example, the file merging feature is enabled. Finally, a Flink Sink task to enter the lake is started by SQL.

File Management with SQL Extension

Currently, users are inclined to solve all task problems with SQL. Actually, the small file merging is fit only for online Flink tasks. The number of generated files or the file size during each commit cycle is not very large for offline tasks.

However, when the user's task has run for a long time, there may be thousands of files in the bottom layer. In this circumstance, it is inappropriate to merge files directly online with real-time tasks, and it may affect the timeliness of online real-time tasks. Therefore, users can utilize the SQL extension to merge small files and delete leftover files or expired snapshots.

Internally, we have been managing Iceberg's data and its meta-information on the disk using the SQL extension for a long time. We will continue to add more functions to the SQL extension to improve Iceberg's usability and user experience.

Figure 21

4) Future Plan

Figure 22

4.1) Improve Iceberg's Kernel

Row-level delete: What to do with data update when using Iceberg to build the entire data link? Iceberg currently supports only the copy on write for update, which to some degree enhances the write ability. An efficient merge on read for update is still required to build real-time data processing on the entire link - this is very important. We will continue to cooperate with the community later. Tencent will also develop some practices to improve the function of row-level delete.
Improve SQL Extension: We will further improve the SQL extension.
Establish a unified index to accelerate data retrieval: For now, There are no unified indexes available in Iceberg to speed up data retrieval. We are working with the community and proposed the Bloom filter - an index. We can accelerate the file retrieval ability of Iceberg by building a unified index.

Given the need to improve the kernel of Iceberg, we prioritize improving the above features first.

4.2) Platform Construction

In terms of platform construction, we will:

First of all, extract and create tables based on automatic schema identification. We hope that the platform can automatically generate the table format based on the data schema in the frontend, which will help users enter the data lake.
Second, facilitate data meta-information management. The current meta-information in Iceberg is stored directly in the Hive metastore. To view the meta-information of the data, users need to execute SQL statements. Thus, we expect to improve the information management in the platform construction.
Third, build a data acceleration layer based on Alluxio. We expect it to use Alluxio to build a data lake acceleration layer that can enable the upper layer to perform second-level analysis.
Fourth, connect with internal systems. We have many internal systems for real-time offline analysis. Therefore, it is necessary to connect the entire platform and the internal systems.