Lakehouse: AnalyticDB for MySQL Ingests Data from Multiple Tables to Data Lakes with Flink CDC + Hudi

By Yanliu, Hanfeng, and Fengze

Lakehouse, an emerging concept in big data, integrates stream and batch processing and merges data lakes with data warehouses. Alibaba Cloud AnalyticDB for MySQL has developed a next-generation Lakehouse platform using Apache Hudi, facilitating one-click data ingestion from various sources like logs and CDC into data lakes, along with integrated analysis capabilities of both online and offline computing engines. This article mainly explores how AnalyticDB for MySQL uses Apache Hudi to ingest complete and incremental data from multiple CDC tables into data lakes.

1. Background

1.1. Data Lake Ingestion of Multiple CDC Tables

Users of data lakes and traditional data warehouses often encounter several challenges:

• Full data warehousing or direct analysis connected to source databases significantly burdens the source databases, highlighting the need for load reduction to avoid failures.

• Data warehousing is usually delayed by T+1 day, while data lake ingestion can be quicker at just T+10 minutes.

• Storing vast amounts of data in transactional databases or traditional data warehouses is costly, necessitating more affordable data archiving solutions.

• Traditional data lakes lack support for updates and tend to have many small files.

• High operational and maintenance costs for self-managed big data platforms drive the demand for productized, cloud-native, end-to-end solutions.

• Storage in common data warehouses is not openly accessible, leading to a need for self-managed, open-source, and controllable storage systems.

• There are various other pain points and needs.

To address these issues, AnalyticDB Pipeline Service uses Apache Hudi for the full and incremental ingestion of multiple CDC tables. It offers efficient full data loading during ingestion and analysis, real-time incremental data writing, ACID transactions, multi-versioning, automated small file merging and optimization, automatic metadata validation and evolution, efficient columnar analysis format, optimized indexing, and storage for very large partition tables, effectively resolving the aforementioned customer pain points.

1.2. Introduction to Apache Hudi

AnalyticDB for MySQL uses Apache Hudi as the storage foundation for ingesting CDC and logs into data lakes. Hudi was originally developed by Uber to address several pain points in their big data system. These pain points include:

• Limited scalability of HDFS: A large number of small files put significant strain on the NameNode, causing it to become a bottleneck in the HDFS system.

• Faster data processing on HDFS: Uber required real-time data processing and was no longer satisfied with a T+1 data delay.

• Support for updates and deletions: Uber's data was partitioned by day, and they needed efficient ways to update and delete old data to improve import efficiency.

• Faster ETL and data modeling: Uber wanted downstream data processing tasks to only read relevant incremental data instead of the full dataset from the data lake.

To address these challenges, Uber developed Hudi (Hadoop Upserts Deletes and Incrementals) and contributed it to the Apache Foundation. As the name suggests, Hudi's core capabilities initially focused on efficient updates, deletions, and incremental reading APIs. Hudi, along with Iceberg and DeltaLake shares similar overall functions and architectures, consisting of three main components:

Data objects to be stored
Auxiliary data to support upsert
Metadata to manage the dataset

In terms of data objects, Lakehouse typically employs open-source columnar storage formats like Parquet and ORC, with little difference in this respect. For auxiliary data, Hudi offers efficient write indexes, such as Bloom filters and bucket indexes, making it better suited for scenarios with extensive CDC data updates.

1.3. Introduction to Industry Solutions

Before constructing the data lake ingestion for multiple CDC tables using Hudi, the Alibaba Cloud AnalyticDB for MySQL team researched various industry implementations for reference. Here's a brief overview of some industry solutions:

1.3.1. Single Table Data Lake Ingestion with Spark/Flink + Hudi

The following figure shows the architecture of end-to-end ingestion of CDC data to data lakes by using Hudi.

The first component in the diagram is the Debezium deployment, comprising the Kafka cluster, the Schema Registry (Confluence or Apicurio), and the Debezium connector. It continuously reads binary log data from the database and writes it into Kafka.

Downstream in the diagram is the Hudi consumer. In this instance, Hudi's DeltaStreamer tool is utilized to consume data from Kafka and write data into the Hudi data lake. In practice, for similar single-table CDC ingestion scenarios, one could replace the binary log source from Debezium + Kafka with Flink CDC + Kafka or choose Spark or Flink as the computing engine according to business needs.

While this method effectively synchronizes CDC data, there is a challenge: every table requires an independent ingestion link to the data lake. This approach poses several issues when synchronizing multiple tables in a database:

Managing multiple data lake ingestion links can complicate operations and maintenance.
Dynamic addition and deletion of databases and tables can be cumbersome.
Tables with small data volumes or infrequent updates also necessitate separate synchronization links, leading to resource wastage.

Currently, Hudi supports multi-table data lake ingestion through a single link, but this solution is not yet mature enough for production use.

1.3.2. Data Lake Ingestion of Multiple Tables Based on Flink VVP

Alibaba Cloud Realtime Compute for Apache Flink (Flink VVP) is a fully managed serverless service built on Apache Flink. It is readily available and supports various billing methods. It provides a comprehensive platform for development, operation, maintenance, and management, offering powerful capabilities for the entire project lifecycle, including draft development, data debugging, operation and monitoring, Autopilot, and intelligent diagnostics.

Alibaba Cloud Flink supports the data lake ingestion of multiple tables (binlog > Flink CDC > downstream consumer). You can consume the binary logs of multiple tables in a Flink task and write them to the downstream consumer.

By executing the CREATE TABLE AS TABLE statement, Flink SQL can synchronize all tables in the MySQL database that match regular expressions to a single Hudi table, establishing a many-to-one mapping relationship. The table will merge the partitioned databases and tables.
By executing the CREATE DATABASE AS DATABASE statement, Flink SQL can synchronize all table schemas and data from the MySQL database to the downstream database with just one click. For now, Hudi tables are not supported, but support is being planned.

The following figure shows the topology after the task starts. A binary log source operator distributes data to all downstream Hudi sink operators.

You can use Flink VVP to implement the data lake ingestion of multiple CDC tables. However, this solution has the following problems:

There is no mature service-oriented management interface for data ingestion. For example, you need to directly operate Flink jobs to add or delete databases and tables and modify configurations. To add a unified database or table name prefix, you need to write SQL hints. VVP is more of a fully managed Flink platform than a data lake service.
Only the deployment form of Flink is provided. Without additional complicated configurations, TableService such as Compaction and Clean must run in the link, which affects the write performance and stability.

After comprehensive consideration, we decided to use a solution similar to the data lake ingestion of multiple CDC tables based on Flink VVP. We provide a service-oriented feature of full and incremental ingestion of multiple CDC tables on AnalyticDB for MySQL.

2. Implementing the Data Lake Ingestion of Multiple CDC Tables Based on Flink CDC + Hudi

2.1. Overall Architecture

The main design objectives for the data lake ingestion of multiple CDC tables in AnalyticDB for MySQL are as follows:

• Support one-click ingestion of data from multiple tables to Hudi, reducing consumer management costs.

• Provide a service-oriented management interface where users can start, stop, and edit data lake ingestion tasks, and support features such as unified prefixes for database and table names and primary key mapping.

• Minimize costs and reduce the components required for data lake ingestion.

Based on these design objectives, we initially chose the technical solution of using Flink CDC as the binary log and full data source, directly writing to Hudi without any intermediate cache.

Flink CDC is a source connector for Apache Flink, capable of reading both snapshot and incremental data from databases like MySQL. With Flink CDC 2.0, enhancements such as full lock-free reading, concurrent full data reading, and resumable checkpointing have been implemented to better achieve unified stream and batch processing.

By using Flink CDC, there's no need to worry about switching between full and incremental data. It allows the use of Hudi's universal upsert interface for data consumption, with Flink CDC handling the switchover and offset management for multiple tables, thereby reducing task management overhead. Since Hudi does not natively support multi-table data consumption, it is necessary to develop code to write data from Flink CDC to multiple downstream Hudi tables.

The benefits of this implementation are:

• Shorter data ingestion pathways, fewer components to maintain, and lower costs as there's no reliance on independently deployed binary log source components like Kafka or Alibaba Cloud DTS.

• Existing industry solutions that can be referenced, where the Flink CDC + Hudi combination for single-table data lake ingestion is a well-established solution, and Alibaba Cloud Flink VVP also supports writing multiple tables to Hudi.

The following section describes the practical experience of AnalyticDB MySQL based on this architecture selection.

2.2. Dynamic Schema Change Support with Flink CDC + Hudi

Currently, the process of writing CDC data to Hudi through Flink is as follows:

Data consumption: At the source, CDC clients consume binary log data, performing tasks such as deserialization and filtering.
Data transformation: Data in CDC format is transformed into a Hudi-supported format, like Avro, Parquet, or Json, according to a specific schema.
Data writing: Data is written to the destination table by multiple Hudi write clients deployed in Task Managers (TM), using the same schema.
Data committing: A single-point commit is carried out by a Hudi coordinator located in the Flink Job Manager, with commit metadata including details of the committed files and the write schema.

In steps 2 to 4, the write schema is required, and it is determined before the task is deployed in the current implementation. There is no capability to dynamically change schemas during the task runtime.

To solve this problem, we have designed and implemented a solution that can dynamically and non-intrusively update the schema in the data lake ingestion link based on Flink + Hudi. The overall idea is to identify DDL binary log events in Flink CDC. When a DDL event occurs, the consumption of incremental data is stopped, and the task is restarted with the new schema after the savepoint is completed

The advantage of this implementation is that the schemas in the link can be dynamically updated without manual intervention. The disadvantage is that the consumption of all databases and tables needs to stop and then restart. Frequent DDL operations have a great impact on link performance.

2.3. Tuning of Reading and Writing Multiple Tables in Flink

2.3.1. Tuning of Full Import Based on Flink CDC + Hudi Bucket Indexes

First, let's briefly introduce the process of full read and full/incremental switchover in Flink CDC 2.0. In the full data stage, Flink CDC divides a single table into multiple chunks based on the degree of parallelism and distributes them to TaskManagers for parallel reading. After the full reading is completed, lock-free switching to incremental data can be implemented while ensuring consistency. This achieves a true integration of streaming and batch processing.

During the import process, we encountered two problems:

1. The data written in the full data stage is in log files, but for faster querying, it needs to be compacted into Parquet files, which leads to write amplification.

Since Hudi is not aware of the full and incremental switchover, we must use Hudi's upsert interface in the full data stage to deduplicate data. However, the upsert operation on Hudi bucket indexes generates log files, and compaction is required to obtain Parquet files, resulting in write amplification. Moreover, if compaction is performed multiple times during the full import, the write amplification becomes more severe.

Can we compromise the read performance and only write log files? The answer is also no. Increasing the number of log files not only degrades read performance but also affects the performance of listing OSS files, making the writing process slower. During writing, both the log and base files in the current file slice will be listed.

Solution: Increase the checkpoint interval or use different compaction policies in the full data stage and incremental data stage (or perform no compaction in the full data stage).

2. In Flink CDC, fully imported data is parallel within tables but serial between tables, and the maximum concurrency for writing to Hudi is determined by the number of buckets. Sometimes, the available concurrency resources in the cluster cannot be fully utilized.

In Flink CDC, when importing a single table, if the number of concurrent reads and writes is less than the cluster's concurrency, resources may be wasted. If the cluster has a large number of available resources, you may need to increase the number of Hudi buckets to increase the concurrent writing. However, when importing multiple small tables in series, concurrency is not fully utilized, resulting in resource waste. Supporting concurrent import of small tables would improve the performance of the full import.

Solution: Increase the number of Hudi buckets to improve import performance.

2.3.2. Tuning of Incremental Import Based on Flink CDC + Hudi Bucket Index

1) Checkpoint backpressure tuning

In the process of full and incremental import, we found that the link Hudi checkpoint encountered backpressure, which caused jitter when writing:

As shown, the write traffic fluctuated greatly.

We checked the write link in detail and found that the backpressure is mainly due to data flushing during the Hudi checkpoint. When the traffic is relatively large, 3G data might be flushed within a checkpoint interval, causing a write pause.

The idea to solve this problem is to reduce the buffer size of the Hudi stream write (write task max size) to spread the load of flush data during the checkpoint window to other times.

As shown in the preceding figure, after the buffer size was adjusted, the change in write traffic caused by backpressure resulting from checkpoints was well mitigated.

To alleviate the backpressure of checkpoints, we also made some other optimizations:

• Reducing the Hudi bucket number to reduce the number of files that need to be flushed during the checkpoint. This conflicts with increasing the number of buckets in the full data stage, so this option should be made after weighing.

• Using an off-link Spark job to run compaction promptly to avoid excessive overheads of list files caused by excessive accumulation of log files during log writing.

2) Providing appropriate write metrics to help troubleshoot performance issues

In the process of tuning the Flink link, we found that the Flink-Hudi write metrics are seriously missing. When troubleshooting, we needed to analyze the performance through complicated means such as observing the field logs, dumping the memory, and performing CPU profiling. As a result, we developed a set of in-house Flink stream write metrics to help quickly locate performance problems.

The main indicators include:

• The size of the buffer occupied by the current stream write operators

• Time consumption of flush buffer

• Time consumption of requesting OSS to create an object

• Number of active write files

Size of the in-heap memory buffer occupied by stream write or append write:

Time consumption of flushing Parquet/Avro log to disk:

Changes in the indicator values can help us quickly locate the problem. For example, the time consumption of Hudi flush in the preceding figure indicates an upward trend. We can quickly locate the problem and find that the log file listing is slowed down because the compaction is not done in time, resulting in a backlog of log files. After increasing compaction resources, the flush time consumption can be kept stable.

We are also continuously contributing Flink-Hudi metrics-related codes to the community. For details, please refer to HUDI-2141.

3) Compaction tuning

To simplify the configuration, we initially adopted the in-link compaction solution, but we soon found that compaction preempts write resources seriously and the load was unstable, which greatly affected the performance and stability of the write link. As shown in the following figure, the compaction and GC almost occupied all CPU resources of Task Manager.

Therefore, we now adopt the strategy of separate deployment of TableService and write link, and use Spark offline tasks to run TableService, so that TableService and write link do not affect each other. In addition, Table Service consumes serverless resources and is charged on demand. The write link does not need compaction, so it can stably occupy a relatively small amount of resources. Overall, resource usage and performance stability have been improved.

To facilitate the management of TableService of multiple tables in databases, we have developed a utility that can run multiple TableServices of multiple tables in a single Spark task, which has been contributed to the community. For more information, see PR.

3. Summary of Flink CDC Hudi Multi-table Ingestion into the Lake

After several rounds of development and tuning, wirting multiple Flink CDC tables into Hudi is basically available. Among them, we believe that the critical stability and performance optimizations are:

• Separating the compaction from the write link to improve the resource usage for writing and compaction.

• Developing a set of Flink Hudi metrics systems, and combining source code and logs for the fine tuning of Hoodie stream write.

However, this architecture scheme still has the following problems that cannot be easily solved:

Flink Hudi does not support schema evolution. The schema used by Hudi to convert Flink rows to Hoodie records is fixed when the topology is created. This means that the Flink link needs to be restarted each time DDL is used, which affects incremental consumption. However, it is difficult to dynamically change schemas without stopping tasks in Flink Hudi scenarios through POC.
Multi-table synchronization requires large resource overheads. Operators of tables without data still need maintenance, causing unnecessary overheads.
Adding or removing a synchronized table entails restarting the link. The Flink task topology is fixed when the task starts. Adding or removing a table entails changing the topology and restarting the link, which affects incremental consumption.
Direct reading of source databases or binary logs adds heavy loads to the source database. Multiple concurrent readings of binary logs tend to cause the breakdown of the source database and make the binary log client unstable. There is no intermediate cache, so once the binary log location expires, the data needs to be imported again.
Full synchronization can only synchronize one table concurrently at the same time, which is inefficient for importing small tables. Large tables may also fail to make full use of resources due to small concurrency settings.
The number of Hudi buckets has a great influence on the performance of full data import and incremental upsert data write. However, the Flink CDC + Hudi framework cannot determine the number of buckets for different tables in the database, which makes it difficult to weigh and decide the number.

If we continue to implement multi-table CDC ingestion into the lake based on this solution, we can try the following directions:

Optimize Flink CDC full data import, support concurrent import of multiple tables, and support the sampling of the data volume of the source table during import to dynamically determine the bucket index number of Hudi. This helps solve the preceding Problem 5 and Problem 6.
Introduce the consistent hashing bucket index of Hudi to solve the problem that the number of bucket indexes cannot be dynamically changed from the Hudi side. For more information, see HUDI-6329. This helps solve the preceding Problem 5 and Problem 6.
Introduce a new binary log cache component (by building an in-house one or using a mature product on the cloud), so that multiple downstream links read binary logs from the cache queue instead of directly accessing the source database. This helps solve the preceding Problem 4.
Flink supports dynamic topology, or Hudi supports dynamic schema change. This helps solve the preceding Problems 1, 2, and 3.

However, based on internal discussions and verifications, we believe that it is challenging to continue implementing multi-table full and incremental CDC ingestion based on the Flink + Hudi framework. In this scenario, we should consider replacing it with the Spark engine. The main considerations are as follows:

The engineering effort and complexity involved in the proposed Flink-Hudi optimizations are significant, with some requiring alterations to core mechanisms.
Our team has considerable experience with full and incremental multi-table ingestion using Spark, and we have stable, long-term customer cases running online.
The Spark engine offers a richer feature set. For instance, Spark's micro-batch semantics naturally support dynamic schema changes, and TableService is more compatible with Spark batch jobs.

Subsequent practices have confirmed the accuracy of our judgment. After transitioning to the Spark engine, the feature set, extensibility, performance, and stability of full and incremental multi-table CDC ingestion have all seen improvements. In future articles, we will share our experiences with implementing full and incremental multi-table CDC ingestion based on Spark + Hudi. Stay tuned!

Community

Lakehouse: AnalyticDB for MySQL Ingests Data from Multiple Tables to Data Lakes with Flink CDC + Hudi

1. Background

1.1. Data Lake Ingestion of Multiple CDC Tables

1.2. Introduction to Apache Hudi

1.3. Introduction to Industry Solutions

1.3.1. Single Table Data Lake Ingestion with Spark/Flink + Hudi

1.3.2. Data Lake Ingestion of Multiple Tables Based on Flink VVP

2. Implementing the Data Lake Ingestion of Multiple CDC Tables Based on Flink CDC + Hudi

2.1. Overall Architecture

2.2. Dynamic Schema Change Support with Flink CDC + Hudi

2.3. Tuning of Reading and Writing Multiple Tables in Flink

2.3.1. Tuning of Full Import Based on Flink CDC + Hudi Bucket Indexes

2.3.2. Tuning of Incremental Import Based on Flink CDC + Hudi Bucket Index

3. Summary of Flink CDC Hudi Multi-table Ingestion into the Lake

4. References

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Realtime Compute for Apache Flink

Data Lake Formation

Big Data Consulting for Data Technology Solution

Data Lake Storage Solution