Build a streaming data lake with Flink Hudi

1. Background

near real time

Since 2016, the Apache Hudi community has been exploring use cases for near real-time scenarios through Hudi's UPSERT capability [1]. Through the batch processing model of MR/Spark, users can inject hour-level data into HDFS/OSS. In pure real-time scenarios, users can achieve end-to-end second-level (5-minute) real-time analysis through the architecture of flow computing engine Flink + KV/OLAP storage. However, there are still a large number of use cases in the second-level (5-minute level) to hour-level scenarios, which we call NEAR-REAL-TIME (near real-time).

In practice, there are a large number of cases that belong to the near real-time category:

Large screen at the minute level;
Various BI analysis (OLAP);
Minute-level feature extraction for machine learning.
Incremental calculation
The solution to near real-time is currently relatively open.

The latency of stream processing is low, but the pattern of SQL is relatively fixed, and the capabilities of the query end (index, ad hoc) are lacking;
The data warehouse of batch processing has rich capabilities, but the data delay is large.
So the Hudi community proposed an incremental computing model based on mini-batch:

Incremental data set => incremental calculation result merge stored result => external storage

This model pulls incremental data sets (data sets before two commits) through the snapshot stored in the lake, calculates incremental results (such as simple count) through batch processing frameworks such as Spark/Hive, and then merges them into the stored results middle.

key problem
The core problems that the incremental model needs to solve:

UPSERT capability: Similar to KUDU and Hive ACID, Hudi also provides minute-level update capabilities;
Incremental consumption: Hudi provides incremental pull through multiple snapshots stored in the lake.
The incremental computing model based on mini-batch can improve the latency of some scenarios and save computing costs, but there is a big limitation: there is a requirement for SQL patterns. Because the calculation is performed in batches, the batch calculation itself does not maintain the state, which requires that the calculated indicators can be easily merged. Simple count and sum can be done, but avg and count distinct still need to pull the full amount of data and recalculate.

With the popularity of stream computing and real-time data warehouses, the Hudi community is also actively embracing changes, and continuously optimizes and evolves the original mini-batch-based incremental computing model through stream computing: In version 0.7, streaming data into the lake was introduced, The native CDC format is supported in version 0.9.

2. Incremental ETL

DB data into the lake

With the maturity of CDC technology, CDC tools such as debezium are becoming more and more popular, and the Hudi community has also integrated the capabilities of stream writing and stream reading. Users can write CDC data to Hudi storage in real time through Flink SQL:

Users can directly import DB data into Hudi through the Flink CDC connector;
You can also import CDC data into Kafka first, and then import it into Hudi through the Kafka connector.
The fault tolerance and scalability of the second scheme will be better.

Data Lake CDC
In the upcoming version 0.9, Hudi natively supports the CDC format, and all change records of a record can be saved. Based on this, the combination of Hudi and the stream computing system is more perfect, and CDC data can be read in a stream [2]:

All message changes of the source CDC stream are saved after entering the lake and used for streaming consumption. Flink's stateful calculation accumulates the calculation results (state) in real time, and synchronizes the calculation changes to the Hudi lake storage through streaming writing to Hudi, and then continues to connect to Flink's streaming consumption of the changelog stored in Hudi to achieve the next level of stateful computing. Near real-time end-to-end ETL pipeline:

This architecture shortens the end-to-end ETL latency to the minute level, and the storage format of each layer can be compressed into column storage (Parquet, ORC) through compaction to provide OLAP analysis capabilities. Due to the openness of the data lake, after compression The format can be connected to various query engines: Flink, Spark, Presto, Hive, etc.

A Hudi data lake table has two forms:

Table form: Query the latest snapshot results while providing an efficient column storage format
Streaming form: Streaming consumption changes, you can specify the changelog after stream reading at any point

3. Demonstration

We demonstrate the two forms of the Hudi table through a demo.

Environmental preparation
Flink SQL Client
Hudi master packages hudi-flink-bundle jar
Flink 1.13.1

Prepare a piece of CDC data in debezium-json format in advance

Create tables through Flink SQL Client to read CDC data files

Execute the SELECT observation result, you can see that there are a total of 20 records, there are some UPDATE s in the middle, and the last message is DELETE

Create a Hudi table, here set the form of the table to MERGE_ON_READ and open the changelog mode attribute changelog.enabled

Import data into Hudi through the INSERT statement, turn on the stream reading mode, and execute the query to observe the results

It can be seen that Hudi keeps the change record of each row, including the operation type of the change log. Here we enable the TABLE HINTS function to facilitate dynamic setting of table parameters.

Continue to use the batch read mode, execute the query and observe the output results, and you can see that the changes in the middle are merged.


Calculate count(*) in Bounded Source read mode
Calculate count(*) in Streaming read mode

It can be seen that the calculation results in batch and streaming modes are consistent.

The current data lake CDC format is still in the rapid iteration period, and the community is also actively promoting production scenarios. Students who are interested in Hudi scenarios and cases can scan the code to join the group.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us