Data Lake for Stream Computing: The Evolution of Apache Paimon

Uncover the advancements from Apache Hive to Hudi and Iceberg in stream computing, as our expert navigates the transformative landscape of real-time data lakes.

Author: Jingsong Lee, Alibaba Cloud, Staff Engineer, PMC Chair of Apache Paimon, PMC member of Apache Flink

Introduction

I have long been in the field of distributed computing and storage, with contributions to multiple open-source projects. In this article, I will review how the scenarios of stream computing gradually expanded over the years and introduces the evolution of Apache Paimon.

(The closer to vertexes, the better)
I referenced this figure from Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google in a long-ago article he published. The figure illustrates the observation that big data systems are always products of trade-offs (not to mention the factor of how much work is needed by developers). There is no silver bullet, only solutions tailored to specific scenarios. This also holds water for the solutions described in this article.
(The article is based on past experience of the author, and therefore may not be free from personal opinions and errors.)

Scenario 1 for Stream Computing and Storage: Real-time Data Preprocessing

About a decade ago, Apache Storm was open-sourced. This ingenious stream computing system consists of only 8,000 lines of Clojure code and is implemented at-least-once delivery smartly with the ACK mechanism. With Storm, users can create a functional data stream pipeline without worrying about data loss. However, as business expands, users started paying more attention to exactly-once guarantee and SQL API support.
Back when, distributed stream computing hadn't been devised, people tried implementing real-time computing by scheduling batch tasks. An example was Apache Hive, which enabled minute-level scheduling, but could not achieve shorter latency due to the high cost of scheduling and start/stop of processes. Another example was Apache Spark. With the outstanding design of resilient distributed datasets (RDDs), Spark provides Spark Streaming, process persistence, and mini-batch processing, delivering stream computing capabilities on top of batch processing with exactly-once guarantee. However, due to the high overhead in scheduling, it is not factually a stream processing engine, and the latency cannot go past the minute-level mark.
In 2014, Apache Flink was launched. It comes with built-in storage for states and global, consistent snapshots, therefore able to offer an exactly-once guarantee in early-fire mode and strike a balance between data consistency and low latency.
Also in 2014, I joined Alibaba's team of Galaxy (a stream computing architecture similar to Apache Flink), which specialized in the stream computing business within the company. Specifically, I handled the push of data streams to Tair, a KV store engine in-house to Alibaba. Our stream computing solution back then was a combination of a stream computing engine and a real-time KV store.

This solution is centered around multi-dimensional data preprocessing. After data is preprocessed, it is written to a simple KV store like HBase or Redis or a small-sized relational database. What the stream computing engine does in the solution is process stream data, manage the data states, and push the data out to the KV store.
The solution has its pros and cons:

Pros: fast response. The data needed by the application is all prepared, able to be queried from a KV store or a relational database at millisecond-level latency.
Cons: low flexibility and high development cost. The solution is limited by dataset size when data is stored in a relational database or by the data storage format when a KV store is used. What's more, a completely new process must be developed each time a new business scenario arises.
In summary, this solution, with a KV store or a relational database in place, can achieve semi-exactly-once delivery while ensuring real-time data update in seconds. It is currently a common choice for real-time data warehouses, and the go-to choice for organizations' pivotal business.

Nevertheless, is there a solution that is more flexible, relieving users of the need to create a new process for every new business scenario?

Scenario 2 for Stream Processing and Storage: Real-time Data Warehousing

Developed by Yandex, ClickHouse was open-sourced in 2016. Though designed as a standalone OLAP engine, it can also serve as the building block of a distributed OLAP system. One of its major advantages is that it delivers outstanding query performance thanks to the use of vectorized computing. This extends its use cases beyond predefined computing, since users can put their business data in ClickHouse databases and perform queries based on their needs.
The popularity of ClickHouse inspired many excellent OLAP projects in China, such as Doris, StarRocks, and Hologres (by Alibaba Cloud). At Alibaba Group, teams for JStorm, Galaxy, and Blink have been merged to concentrate efforts for Blink. Developed on top of Flink, Blink has an advanced architecture, high-quality code, and a vibrant community.
The combination of Flink and Hologres gave birth to a new pipeline: preprocessing data with Flink without finalizing the business data and then passing the preprocessed data to Hologres, which in turn stores the data and provides high-performance query to data consumers. This solution delivers higher flexibility compared with the KV store solution.

It has the following advantages:

Fast query speed: Vectorized computing together with SSDs delivers millisecond-level responses.
High flexibility: OLAP systems store data in schemas, allowing users to choose the optimal query mode based on actual business scenarios. Since OLAP databases are commonly better at aggregation operations than joins in distributed systems, wide tables are preferred in this case to ensure query performance.
However, this architecture is not without flaws. The hefty price tag of OLAP storage is one of them. Though OLAP providers have been working to reduce the cost through technologies like storage-compute decoupling, it is still a costly solution since it is expected to ensure real-time responses of queries.

The trade-off in this solution is as follows:

Flink does more than perform joins. It also performs certain ETL processing to control the amount of data stored in the OLAP system, which incurs cost.
OLAP systems cannot keep all the historical data. Storing data from the past 7 days and removing earlier data based on TTL rules is common practice.
Thus, users need to make the choice between putting heavier workloads for preprocessing on the stream engine and storing more data in the OLAP database. Surely, the decision depends on the specific business scenarios.

Is there a solution that lets us store all data at affordable cost?

Scenario 3 for Stream Processing and Storage: Real-time data lakehousing

Real-time with Apache Hive

People have been looking for a storage system that is less costly than OLAP systems, so that they can keep all the data in the storage and perform ad-hoc queries, even though the responses may not be as fast.
I started my work in the Blink team focusing on its batch processing capabilities. As time goes by, I started to look beyond integrating the compute modes for stream and batch processing, so as to deliver higher business value.
Therefore, my colleagues and I started the work on a Flink + Hive solution.

A Hive Streaming Sink supports Parquet, ORC, and CSV formats, provides exactly-once guarantee, and supports partition-commit, allowing log data to be written into data warehouses in streams. Here are the pros and cons of this solution:

Pros: Data is queryable at near-real-time latency. The actual latency depends on the checkpoint and is mostly around minutes. Also, this is a cost-effective solution, so that most raw data can be retained, delivering higher flexibility.
Cons: Data is stored in columnar storage mode on economic servers and disks, leading to slow data reads and lackluster query performance.

Real-time with Apache Iceberg

As Snowflake and Databricks gained growing popularity, data lake has been taking the place of traditional Hive-based data warehouses.
A typical data lake used in stream computing scenarios is Apache Iceberg. It provides the following advantages over Hive:

ACID guarantee: Iceberg delivers better data integrity, allowing users more freedom in performing data modifications using commands like INSERT INTO, DELETE, UPDATE, and MERGE_INTO. With Hive, they have to perform INSERT OVERWRITE on the entire partition if they wish to make adjustments to the data without taking risks.
Scalable metadata management
1. The list files of object storage are slow to operate. With the metadata management capabilities of Iceberg, however, the performance bottlenecks brought by list files can easily be circumvented, ensuring a smooth experience while delivering the cost-efficiency brought by tiered storage of object storage.
2. Also, Hive Metastore (HMS) is subject to availability issues caused by large numbers of partitions. This is not an issue in Iceberg.
3. Data skipping based on metadata improves data freshness and query speeds (thanks to filtering based on order key) in batch-processing data warehouses. This feature also brings more possibilities for index acceleration.
Smoother and easier read/write in streams.
In light of the preceding advantages, my colleagues and I contributed to the Iceberg community with the integration of Iceberg to a Flink pipeline:

In this solution, data can be injected to the data lake in real time, and data managed by Iceberg can be read in real time. This allows for higher flexibility over Hive-based data warehouses.
However, the solution has the following drawbacks in UPSERT operations:

Moving change data to data warehouses has been a major headache for batch-processing data warehouses. The traditional solution of using full table partitions and incremental table partitions is costly, compute-heavy, and complex.
Change data generated during stream computing cannot be properly handled.

Change Data Capture to Data Warehouses: Full Data and Incremental data

Traditional Hive-based data warehouses use a combination of full tables and incremental tables to handle incoming change data, as shown in the following figure.

Procedure:

After 24:00, the system creates a partition of the incremental data based on the binary logs of the past day.
The incremental partition of the past day is merged with the full table partition of the day before yesterday, generating the full table partition of the past day.
The full table partition is used for queries on all data, and the incremental table partition is used for queries only on data of that particular day.
It's not hard to notice that this is a very costly solution, especially in the case where there is a full table but only small amounts of incremental data. Its drawbacks are as follows:
High storage cost: A full table partition and an incremental table partition are generated each day, without reusing any data.
High computing cost: Data reads, data writes, and a table merge are performed on all data each day.
Data cannot be queried until the next day.
The merging of full table partitions and incremental table partitions takes a long time.
Considering the limitations, we would prefer a data lake into which data can be directly upserted based on primary keys. In this case, the table in the data lake mirrors the one in the business database, so that both real-time and batch processing can be handled.

Attempts at Upsertable Data Lakes

In 2020, our team at Alibaba Cloud delved into three major lake house solutions: Iceberg, Hudi, and Delta. HU Zheng, my then colleague and currently an Iceberg PMC member, wrote an article comparing the three solutions. In that article, he drew an interesting house-building analogy to summarize their differences.

The Delta house has a fairly solid foundation and many stories (features). But the house is still proprietary to Databricks. The open-source version is much inferior to the proprietary version.
The Iceberg house has a very solid foundation and can be easily expanded (which means it supports a wide array of computing engines or file systems), but it doesn't have that many stories yet (falling behind in some major features). It's worth mentioning that Iceberg has been rapidly gaining popularity in North America, where there is a growing interest for SaaS providers to choose Iceberg for their data warehouse solutions.
The foundation of the Hudi house is somewhat flimsy, which limits its extendibility. For example, if you want to use it as an Apache Flink Sink, you'll need to completely refurbish the house while worrying about the impact on the other functionalities. But Hudi has its strengths. Its full-fledged feature set, especially upsert support, helped it set foot in the Chinese market.
At first glance, Hive-ACID looks like a luxurious villa, equipped with the most comprehensive set of features. But with a closer look, you will find small issues here and there. As a side note, Hive-ACID has not been widely used in China.
A solid foundation is more of what we value, and we believe lakehouse dedicated to the use of data streams can be built on top of Icerberg, so we have chosen Iceberg as our R&D focus.

After the community's efforts, we have developed a preliminary CDC (Change Data Capture) data writing solution with Flink + Iceberg, supporting near real-time data writing and reading.

For now, the combination of Flink and Iceberg is able to be put into production environments, data processed by Flink can be stored in data lakes, and the stream of change data to data lakes is basically functional.
However, the streaming solution for change data is still far from ideal in terms of large-scale updates and near-real-time latency, not to mention reading data in streams. This discrepancy is caused by the following factors:

The majority of Iceberg users use it as part of a batch processing solution in replacement of Hive. Considering the formidable competitor Delta, Icerberg may not refactor its architecture for CDC streams.
Iceberg is already working well with a wide array of compute engines, which may lead to complex compatibility work in future updates for existing projects.
But its advantages as a highly extendible batch-processing data lake have won Iceberg a host of organizational users in the international market.

Another upsertable solution involves using Flink and Hudi. CHEN Yuzhao, my then colleague and currently a Hudi PMC member, created Flink + Hudi Connector with the Hudi community.

By default, Huid uses a Flink state backend to store the indexes that map keys to file groups. This method is starkly different from that of Spark.

On the good side, this is a fully automated method. You can simply adjust the concurrency if you want a scale-up.
On the bad side, this method uses data lakes to handle real-time point queries, which delivers poor performance, especially when massive data, such as more than 500 million data records, is stored. In addition, it incurs high storage cost by using RocksDB as a state backend to store all indexes. What's more, data inconsistency can easily occur, and the data cannot be read or written by other engines, for this will corrupt the indexes stored in the state backend.
To address the issues innate to the Flink State Index solution, engineers at ByteDance put forward the Bucket Index solution in Hudi community.

This is a simple but on-point solution. It stores data partitions in buckets based on hash algorithms. Each bucket corresponds to a file group.

Pros: It eliminates the performance issues caused by indexes.
Cons: An appropriate bucket number needs to be determined by the user. Excessive operations on small-sized files will be needed if too many buckets are involved, and performance will be undermined if there aren't enough buckets.
The solution has been the choice of most Hudi users.

There are also a lot of Alibaba Cloud users using Hudi. As the userbase continues to expand, issues start to emerge:

Users are lost in the myriad of choices for Hudi.
（1） State index or bucket index? One is easier-to-use but less performant, while the other performs well but is difficult to use.

（2） Copy-on-Write or Merge-on-Read? One causes poor write throughput, and the other leads to underperformance.

Low update efficiency. At a checkpoint interval of one to three minutes, backpressures can easily emerge. By default, data is merged after five checkpoints, and only after a merge can the business side query the latest data. Therefore, the unification of full and incremental data is not truly implemented.
The system design is complicated, which makes it hard to perform troubleshooting. I used to be head of Alibaba Cloud's Hudi team for half a year, and we had to deal with endless tickets from users of our managed services. Also, there are a large number of parameters that need to be configured to make Hudi work with other engines, which highlights its poor compatibility.
Hudi is designed for Spark-based batch processing and thus is not able to fully fit data stream scenarios. Attempts at adding stream computing capabilities on top of the batch processing architecture make the system more complex and harder to maintain.

Though Hudi has been much stabler in recent updates, if you take a look at the latest Hudi Roadmap (https://hudi.apache.org/roadmap), you may notice that little planning is made for Flink or data streams in general. Commonly, new Hudi features are supported only by Spark, not Flink. If Flink is to support a new feature, it has to go through heavy refactoring, and new bugs can easily be introduced in the process, not to mention the possibly substandard end result. Unquestionably, Hudi is an outstanding system, but it is not meant for real-time data lakes.
Then what are the characteristics of an ideal storage system?

A solid foundation like that of Iceberg and a feature set that can meet the basic requirements of data lake storage.
Powerful upsert capabilities with support of the log-structured merge (LSM) that is commonly used by OLAP systems, stream state backends, and KV stores.
Designed with streaming scenarios, particularly Flink, in mind, instead of starting with a complex system and progressively adding support for data streams.
Unable to find an ideal solution, we decided to create one ourselves

The Birth of Apache Paimon

Start of Flink Table Store (FTS): Database or Data Lake
In 2021, I started a discussion on the Flink community: FLIP-188: Introduce Built-in Dynamic Table Storage (https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage), raising the idea of a Flink-oriented storage system named Flink Table Store (FTS). It addresses issues like those in real-time processing, near-real-time processing, queueing, and table formats. The combination of Flink and FTS will be able to achieve automated stream processing of materialized views and provide queries. It will be a full-fledged "StreamingDB".

The project sounds rosy, but it's easier said than done.

Though an all-in-one solution is easier to use, it may not deliver the high availability eagerly needed in the field of big data analytics.
In real-time preprocessing scenarios, the data storage system needs to provide high SLA guarantees. But this is tough for a complex system like FTS that encompasses computing, storage, and pipeline management. Users gravitated to this solution can enjoy better usability but have to make compromises in terms of stability guarantee.
In real-time data warehousing scenarios, FTS focuses on stream processing, while OLAP can provide high-performance query. Users that opt for FTS benefit from the better usability at the expense of fast respond speed.
The all-in-one concept of FTS is far from feasible. It requires heavy investment but yields trivial returns. In this context, a step-by-step approach targeted at existing pain points in production environments would be a more viable choice.

Preliminary Stage of FTS: Data lake + LSM

After several months of development, we formulated a roadmap. We decided to start from a data lake architecture and create a full-fledged data pipeline that addresses pain points in real-world scenarios together with tools from the ecosystem. In May 2021, we released version 0.1. It was an unusable version, but a preliminary demo of this starting-from-scratch project.
In September, version 0.2 was released. It provided better functionalities. Some community members took notice and put FTS in their production environments.

(Compared with the current architecture of Paimon, log systems are no longer a preferred choice. With the exception of response latency, a data lake provides much better capabilities then a log system.)
FTS is a data lake system, used for the real-time writes of change logs in streams (such as those from Flink CDC) and high-performance queries. It combines lake storage and the LSM structure, boasts high compatibility with Flink, and supports real-time data updates. It can handle influx of change data with high throughput, while delivering excellent query performance.
Then, FTS 0.3 was released.

Till version 0.3, a lakehouse solution designed for streaming had taken shape, and we could comfortably recommend FTS for production use.
In addition to supporting writing data to data lakes, FTS also supports features like partial-update, which offers users more flexibility in finding the balance between latency and cost.

Incubation of Paimon

After three versions, FTS has gained decent maturity, but as part of the Apache Flink community, it cannot work with other ecosystems, such as Spark. To put it on a bigger stage, Flink PMC decided to donate it to Apache Software Foundation as an independent project. On March 12, 2023, FTS was officially adopted by the Apache Incubator and got its name changed to Apache Paimon.
After being included in the Apache Incubator, Paimon attracted widespread attention from the public, including renowned companies like Alibaba Cloud, ByteDance, BiliBili, Autohome, and Ant Group.

Meanwhile, we are constantly improving Paimon's compatibility with Flink. It comes with Flink CDC and supports fully automated synchronization of data, schemas, and databases, thereby improving the performance, cost-efficiency, and user experience. The following image illustrates its architecture.

What exactly are the benefits of Apache Paimon?
Apache Paimon: Real-time Streaming of Change Data to Data Lakes
As mentioned above, when handling the transfer of change data to data lakes, a traditional data warehouse solution involves both full data tables and incremental data tables. In contrast, a solution with Paimon is much simpler, as shown in the following figure.

Overview:

A non-partitioned primary-key table is created, into which change data is written in streams.
The latest data can be queried in real-time.
Materialized views can be created through CREATE TAG commands. A tag is a reference to a snapshot.
Incremental views can be queried through INCREMENTAL commands. For example, you can query the difference between two tags.
This architecture has the following advantages:
Low storage cost: The LSM data structure allows data to be reused between tags. In scenarios without substantial incremental data, storage costs can be reduced by hundreds of times.
Low computational overhead: The performance of real-time upsertion can be ten times higher compared with Hive.
High data freshness: Data update latency is cut from hours to minutes, reaching a near-real-time level.
Fast creation of batch query tables: The time to complete CREATE TAG operations is reduced from hours to minutes.
The following capabilities allow Paimon to deliver its benefits:
Paimon 0.4 and 0.5 added the support of change data from Flink CDC and even Kafka CDC. Unlike the data transfer approach using Flink SQL, Paimon enables synchronization of not only the data changes but also schema modifications and the entirety of databases. This capability can significantly reduce maintenance efforts and lower resource consumption.
Excellent primary key-based updates: Thanks to the LSM-based design and high-quality coding, Paimon can provide decent throughput and query performance even if the update latency is as low as 1 minute.
The LSM structure prevents data bloats, thereby significantly reducing storage cost.
The CREATE TAG command is supported, and incremental views can be queried.

Apache Paimon: Partial-update and Changelog-based Reads

Paimon is not only a technology for writing change data to data lakes, but a data lake solution that is designed for data stream scenarios, delivering features like near-real-time partial-update and changelog-based data reads.
Paimon supports the partial-update table type. Multiple stream jobs can write to the same partial update table and update different columns. The column updates can even be performed on a version-specific basis. The consumers of data can query all columns of the table.
In future updates, Paimon is expected to have better computational capabilities in near-real-time scenarios by providing addition of columns to tables based on foreign keys. Stream engines and OLAP systems have never been good at handling joins. Hopefully, streaming data lakes can serve as a solution on this front.

When there is data update, there is a need for changlog-based data reads. If downstream consumers require complete rows after the table is updated, the change logs need to be generated by the storage system.
The full-compaction changelog-producer may cause rewrite of all data, and therefore is a very costly change log production mode. Sometimes, even Avro, the more performant format for full data read/write, cannot meet the requirements.
Thanks to the LSM file structure, Paimon also provides the lookup changelog-producer, which generates complete change logs based on original values.

The lookup mode allows you to build a real-time data stream pipeline while ensuring the integrity of stream processing.
You can also specify consumer IDs in Paimon, which is similar to group-id in Kafka:

SELECT * FROM t /*+ OPTIONS('consumer-id' = 'myid') */;

When tables in Paimon are read in streams, the ID of the involved snapshot is recorded. This is beneficial in the following ways:

After a job stops, a new job can pick up from where it left off, without the need to restore the status. The new job continues the reading progress based on the snapshot ID.
When determining whether a snapshot has expired, Paimon checks all consumers of the file system. If there are still consumers that use this snapshot, the snapshot will not be deleted when expired.
Based on consumer IDs, watermarks are automatically passed downstream, so that the computational progress can be determined based on the watermarks in snapshots.

Advantages of Apache Paimon

Apache Paimon is based on Data Lake + LSM, with strong upsert feature and natural DataSkipping capability.
Apache Paimon emerged from Flink, supporting all features of Flink SQL, including Flink CDC. Apache Spark is also an essential part of the ecosystem, so Apache Paimon is designed from the beginning to be compatible with multiple computing engines.
Paimon is natively designed for real-time data lake scenarios, significantly improving the data freshness across the entire data lake pipelines and enabling rapid iteration and development.

The greatest benefit is actually the lack of burden. Moving forward from scratch, there are still many problems to be solved in designing a streaming data lake today. If you are dragging a heavy cart forward, progress is slow and difficult. However, Paimon has only one mission: streaming data lake.

Summary and Future Considerations
This article roughly outlines the history and development of stream computing + lake storage through my experiences.

Storm: inaccurate real-time preprocessing
Spark: mini-batch preprocessing
Flink + HBase/Redis/MySQL: accurate real-time preprocessing
Flink + OLAP database: real-time data warehouses (balanced choice for performance and cost)
Flink + data lake: partial real-time capabilities of batch data warehouses
Flink + streaming data lake: CDC real-time streaming (suitable for more use cases)
Next step: Streaming Lakehouse, an all-in-one solution

These are what we want to achieve with the streaming lakehouse:

Data moves across the pipeline in real-time. All data is stored for ad-hoc queries.
A data warehouse that suits both batch and stream scenarios.
There is still a long way ahead, but we're going in full speed to get there.

If you are interested in learning more about Apache Paimon, please follow Apache Paimon on GitHub.
To experience Apache Paimon firsthand on Alibaba Cloud, visit Realtime Compute for Apache Flink and start your free trial here.

Community

Data Lake for Stream Computing: The Evolution of Apache Paimon

Introduction

Scenario 1 for Stream Computing and Storage: Real-time Data Preprocessing

Scenario 2 for Stream Processing and Storage: Real-time Data Warehousing

Scenario 3 for Stream Processing and Storage: Real-time data lakehousing

Real-time with Apache Hive

Real-time with Apache Iceberg

Change Data Capture to Data Warehouses: Full Data and Incremental data

Attempts at Upsertable Data Lakes

The Birth of Apache Paimon

Preliminary Stage of FTS: Data lake + LSM

Incubation of Paimon

Apache Paimon: Partial-update and Changelog-based Reads

Advantages of Apache Paimon

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink

ApsaraMQ for RocketMQ