Flink CDC realizes real-time synchronization and conversion of massive data

1. Flink CDC technology

CDC is the abbreviation of Change Data Capture. It is a technology for capturing changed data. CDC technology has existed for a long time. Up to now, there are many CDC technology solutions in the industry, which can be divided into two categories in principle:

One is query-based CDC technology, such as DataX. As the current scene has higher and higher requirements for real-time performance, the defects of this type of technology have gradually become prominent. Offline scheduling and batch processing modes lead to high latency; slicing based on offline scheduling cannot guarantee data consistency; in addition, real-time performance cannot be guaranteed.
One is log-based CDC technologies, such as Debezium, Canal, and Flink CDC. This CDC technology can consume database logs in real time, and the stream processing mode can ensure data consistency and provide real-time data, which can meet the current increasingly real-time business needs.

The figure above shows a comparison of common open source CDC solutions. It can be seen that the mechanism of Flink CDC and its performance in incremental synchronization, breakpoint resume, and full synchronization are very good, and it also supports full incremental integrated synchronization, while many other open source solutions cannot support full incremental integrated synchronization. Flink CDC is a distributed architecture that can meet the business scenarios of massive data synchronization. Relying on Flink's ecological advantages, it provides DataStream API and SQL API, which provide very powerful transformation capabilities. In addition, the open source ecology of the Flink CDC community and the Flink community is very complete, attracting many community users and companies to develop and build together in the community.

Flink CDC supports full incremental integrated synchronization, providing users with real-time consistent snapshots. For example, a table contains full historical data and new real-time change data. The incremental data is continuously written to the Binlog log file. Flink CDC will first synchronize the full historical data, and then seamlessly switch to synchronizing incremental data. During incremental synchronization, if it is newly inserted data (the blue block in the figure above), it will be appended to the real-time consistent snapshot; if it is updated data (the yellow block in the figure above), it will be added to the existing Update historical data.

Flink CDC is equivalent to providing a real-time materialized view, providing users with real-time consistent snapshots of the tables in the database, which can be used to further process these data, such as cleaning, aggregation, filtering, etc., and then write them downstream.

2. Pain points of traditional data integration solutions

The picture above shows the traditional data storage architecture 1.0, which mainly uses DataX or Sqoop to fully synchronize to HDFS, and then builds a data warehouse around Hive.

There are many defects in this solution: it is easy to affect the stability of the business, because the data needs to be queried from the business table every day; the daily output leads to poor timeliness and high delay; if the scheduling interval is adjusted to every few minutes, the source database It causes a lot of pressure; the scalability is poor, and performance bottlenecks are prone to occur after the business scale expands.

The picture above shows the traditional data warehouse 2.0 architecture. It is divided into two links: real-time and offline. The real-time link performs incremental synchronization, such as synchronizing to Kafka through Canal and then performing real-time reflow. Generally, full synchronization is only done once, and the daily increment is regularly merged on HDFS, and finally Import it into the Hive data warehouse.

This method only performs full synchronization once, so it basically does not affect business stability. However, incremental synchronization has a timed backflow, which can only be maintained at the hour and day level, so its timeliness is relatively low. At the same time, the two links of full and incremental are separated, which means that there are many links and components that need to be maintained, and the maintainability of the system will be relatively poor.

The figure above shows the traditional CDC ETL analysis architecture. After CDC data is collected through tools such as Debezium and Canal, it is written into the message queue, and then the calculation engine is used for calculation and cleaning, and finally transmitted to the downstream storage to complete the construction of real-time data warehouse and data lake.

Traditional CDC ETL analysis introduces many components such as Debezium and Canal, which need to be deployed and maintained, and the Kafka message queue cluster also needs to be maintained. The defect of Debezium is that although it supports full and incremental, its single concurrency model cannot cope with massive data scenarios well. Canal, on the other hand, can only read incrementally, and requires the cooperation of DataX and Sqoop to read the full amount, which is equivalent to requiring two links, and the number of components to be maintained also increases. Therefore, the pain points of traditional CDC ETL analysis are poor single-concurrency performance, full-increment splitting, and many dependent components.

3. Real-time synchronization and conversion of massive data based on Flink CDC

What improvement can the Flink CDC solution bring to the real-time synchronization and conversion of massive data?

Flink CDC 2.0 implements the incremental snapshot reading algorithm on MySQL CDC. In the latest version 2.2, the Flink CDC community abstracts the incremental snapshot algorithm into a framework, so that other data sources can also reuse the incremental snapshot algorithm.

The incremental snapshot algorithm solves some pain points in the full incremental integrated synchronization. For example, the early version of Debezium used locks when implementing full-incremental integrated synchronization, and it was a single-concurrency model with a redo mechanism on failure, so it was unable to implement resumable uploads at the full-volume stage. The incremental snapshot algorithm uses a lock-free algorithm, which is very friendly to the business library; it supports concurrent reading, which solves the problem of processing massive data; it supports resuming uploads from breakpoints, avoiding redo on failure, and can greatly improve the efficiency and efficiency of data synchronization. user experience.

The above picture shows the framework of full incremental integration. Simply speaking, the whole framework is to divide the tables in the database into chunks according to PK or UK, and then assign them to multiple tasks for parallel reading, that is, parallel reading is realized in the full stage. Full and incremental can be switched automatically, and the lock-free algorithm is used to switch between lock-free and consistent. After switching to the incremental phase, only a separate task is required to be responsible for the data analysis of the incremental part, thus realizing the integrated reading of all incrementals. After entering the incremental stage, the user can modify the job and release the resources that are no longer needed by the job.

We compared the full incremental integration framework with the Debezium 1.6 version for a simple TPC-DS read test. The customer single table data volume is 65 million. When Flink CDC uses 8 concurrency, the throughput has increased by 6.8 times, and the time-consuming Only 13 minutes, thanks to the support of concurrent reading, if users need faster reading speed, users can increase the concurrent implementation.

When Flink CDC is designed, it also considers storage-friendly write design. In the Flink CDC 1.x version, if you want to achieve exactly-once synchronization, you need to cooperate with the checkpoint mechanism provided by Flink. If you do not slice in the full phase, you can only complete it in one checkpoint, which will lead to a problem: each checkpoint If you want to spit out the full amount of data in this table to the downstream writer, the writer will mix the full amount of data in this table in memory, which will put a lot of pressure on its memory, and the job stability will be particularly poor.

After Flink CDC 2.0 proposes the incremental snapshot algorithm, the checkpoint granularity can be reduced to chunk by slicing, and the chunk size is configurable by the user, the default is 8096, and the user can adjust it to be smaller, reducing the pressure on the writer and reducing The use of memory resources improves the stability of downstream writing to storage.

After full incremental integration, Flink CDC's lake-entry architecture becomes very simple without affecting business stability; being able to achieve minute-level output means that near-real-time or real-time analysis can be achieved; concurrent reading It achieves higher throughput and has a good performance in massive data scenarios; the link is short, the components are few, and the operation and maintenance is friendly.

With Flink CDC, the pain points of traditional CDC ETL analysis have also been greatly improved. Canal, Kafka message queue and other components are no longer needed, and only need to rely on Flink to realize the ability of full incremental integrated synchronization and real-time ETL processing. And it supports concurrent reading, the entire architecture has short links, fewer components, and is easy to maintain.

Relying on the Flink DataStream API and easy-to-use SQL API, Flink CDC also provides a very powerful and complete transformation capability, and can guarantee the changelog semantics during the transformation process. In the traditional solution, it is very difficult to perform transformation on the changelog and guarantee the semantics of the changelog.

Real-time synchronization and conversion of massive data Example 1: Flink CDC realizes the integration of heterogeneous data sources

This business scenario is that business tables such as product tables and order tables are stored in the MySQL database, and logistics tables are stored in the PG database. It is necessary to realize the integration of heterogeneous data sources and widen them during the integration process. It is necessary to perform Streaming Join on the product table, order table, and logistics table before writing the result table into the library. With Flink CDC, the whole process can be realized with only 5 lines of Flink SQL. The downstream storage used here is Hudi, and the entire link can get output at the minute level or even lower, making it possible to do near real-time analysis around Hudi.

Real-time synchronization and conversion of massive data Example 2: Flink CDC realizes sub-database and sub-table integration

Flink CDC has very complete support for sub-databases and sub-tables. When declaring a CDC table, it supports the use of regular expressions to match library names and table names. Regular expressions mean that multiple libraries and multiple tables under these multiple libraries can be matched . At the same time, it provides the support of metadata column. You can know which database and table the data comes from. When writing to the downstream Hudi, you can bring the two columns declared by metadata, and use database_name, table_name and the primary key in the original table (example id column) as the new primary key, only three lines of Flink SQL are needed to realize the real-time integration of sub-database and sub-table data, which is very simple.

Relying on Flink's rich ecology, it can achieve many upstream and downstream expansions. Flink itself has a rich connector ecology. After the addition of Flink CDC, the upstream has richer sources to ingest, and the downstream also has richer destinations to write.

Real-time synchronization and conversion of massive data Example 3: Three lines of SQL to achieve real-time ranking list of cumulative sales of single products

This demo demonstrates the real-time leaderboard of products through 3 lines of SQL without any dependencies. First, add MySQL and ElasticSearch images in Docker, and ElasticSearch is the destination. With Docker up, download the Flink package and the two SQL Connector jars for MySQL CDC and ElasticSearch. Pull up the Flink cluster and SQL Client. Create a table in the MySQL built-in library, fill in the data, and after updating, use Flink SQL to do some real-time processing and analysis, and write it into ES. Construct an order table in the MySQL database and insert data.

The first line of SQL in the above figure is to create the order table, the second line is to create the result table, and the third line is to perform group by query to realize the real-time ranking function, and then write it into the ElasticSearch table created by the second line of SQL.

We made a visualization in ElasticSearch, and we can see that as the orders in MySQL are continuously updated, the ranking list of ElasticSearch will be refreshed in real time.

4. Flink CDC community development

In the past year or so, the community has released 4 major versions, the number of contributors and commits has continued to grow, and the community has become more and more active. We have always insisted on providing all the core features to the community edition, such as MySQL's tens of billions of large tables, incremental snapshot framework, MySQL dynamic table addition and other advanced functions.

The latest 2.2 version also adds a lot of new features. First of all, in terms of data sources, OceanBase, PolarDB-X, SqlServer, and TiDB are supported. In addition, the ecology of Flink CDC has been continuously enriched, compatible with Flink 1.13 and 1.14 clusters, and an incremental snapshot reading framework is provided. In addition, it supports MySQL CDC dynamic table addition and improves MongoDB, such as supporting specified collections, making it more flexible and friendly through regular expressions.

Beyond that, documentation is a particularly important part of the community. We provide an independent versioned community website, in which different versions correspond to different versions of documents, and provide rich demos and FAQs in Chinese and English to help novices get started quickly.

In many key indicators of the community, such as the number of issues created, the number of PRs merged, and the number of Github Stars, the Flink CDC community performed very well.

The future planning of the Flink CDC community mainly includes the following three aspects:

I hope that all databases can be connected to a better framework in the future; some exploratory work has been done on Schema Evolution and whole database synchronization , which will be made available to the community when mature.

Ecological integration: provide more DBs and more versions; data lake integration expects smoother links; provide some end-to-end solutions, users do not need to care about the parameters of Hudi and Flink CDC.

Ease of use: Provide more out-of-the-box experience and improve documentation and teaching

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us