Flink CDC officially released new support for Oracle and MongoDB
foreword
CDC (Change Data Capture) is a technology for capturing database change data. Flink has natively supported CDC data (changelog) processing since version 1.11, and is currently a very mature change data processing solution.
Flink CDC Connectors is a set of source connectors of Flink, which are the core components of Flink CDC. These connectors are responsible for reading stock historical data and incremental change data from databases such as MySQL, PostgreSQL, Oracle, and MongoDB. Flink CDC Connectors is an independent open source project. Since its open source in July last year, the community has maintained a fairly high-speed development, with an average version of two months. The attention in the open source community continues to increase, and there are gradually more and more users. Use Flink CDC to quickly build real-time data warehouses and data lakes.
In July this year, Flink CDC Maintainer Xu Bangjiang (Xuejin) shared the design of Flink CDC 2.0 for the first time at the Flink Meetup in Beijing. In the following August, the Flink CDC community released version 2.0, which solved many pain points in production practice, and the user base of the Flink CDC community also grew rapidly.
In addition to the rapid expansion of community user groups, the number of developers in the community is also increasing rapidly. At present, developers from many companies at home and abroad have joined the open source co-construction of the Flink CDC community, including developers from Cloudera in North America and Vinted in Europe. , the developers of Ververica, domestic developers are more active, there are developers from Internet companies such as Tencent, Alibaba, and Byte, as well as developers from start-up companies such as XTransfer and Xinhua Winshare, and traditional enterprises. In addition, many cloud vendors at home and abroad have integrated Flink CDC in their stream computing products, allowing more users to experience the power and convenience of Flink CDC.
1. Overview of Flink CDC 2.1
With the joint efforts of community developers, today the Flink CDC community is happy to announce the official release of Flink CDC 2.1: https://github.com/ververica/flink-cdc-connectors/releases/tag/release-2.1.0
This article takes you 10 minutes to understand the major improvements and core functions of Flink CDC version 2.1. Version 2.1 contains 100+ PRs contributed by 23 contributors, focusing on improving the performance and production stability of the MySQL CDC connector, and launching the Oracle CDC connector and MongoDB CDC connector.
MySQL CDC supports very large tables with tens of billions of data, supports all MySQL data types, and greatly improves stability through optimization such as connection pool reuse. At the same time, it provides a DataStream API that supports lock-free algorithms and concurrent reading, and users can use this to build a synchronization link for the entire database;
A new Oracle CDC connector is added to support obtaining full historical data and incremental change data from the Oracle database;
A new MongoDB CDC connector is added to support obtaining full historical data and incremental change data from the MongoDB database;
All connectors support the metadata column function, and users can access meta information such as database name, table name, and data change time through SQL, which is very practical for data integration in database and table scenarios;
Enrich the introductory documentation of Flink CDC and add end-to-end practical tutorials for various scenarios.
2. Detailed Explanation of MySQL CDC Connector Improvement
In the Flink CDC 2.0 version, the MySQL CDC connector provides advanced features such as lock-free algorithm, concurrent reading, breakpoint resume, etc., which solves many pain points in production practice, and then a large number of users start to use it and go online on a large scale . During the launch process, we cooperated with users to solve many production problems, and at the same time developed some high-quality functions that users urgently need. The improvement of Flink CDC 2.1 version for MySQL CDC connector mainly includes two categories, one is stability improvement, One category is function enhancement.
1. Stability improvement
For different primary key distributions, a dynamic fragmentation algorithm is introduced
For scenarios where the primary key is non-numeric, Snowflake ID, sparse primary key, joint primary key, etc., by dynamically analyzing the uniformity of the primary key distribution of the source table, the shard size is automatically calculated according to the uniformity of the distribution, making the slicing more reasonable and the sharding calculation more efficient. quick. The dynamic sharding algorithm can well solve the problems of too many shards in the sparse primary key scenario and too large shards in the joint primary key scenario. The size and number of fragments can be controlled through the chunk size, without caring about the primary key type.
Support tens of billions of large-scale tables
When the size of the table is very large, an error of failing to deliver binlog shards was reported before. This is because there are many snapshot shards corresponding to a very large table, and the binlog shards need to contain all the snapshot shard information. When the SourceCoordinator delivers When binglog is fragmented to the SourceReader node, if the fragment size exceeds the maximum size supported by the RPC communication framework, the fragment delivery will fail. Although the problem of excessive fragmentation size can be alleviated by modifying the parameters of the RPC framework, it cannot be completely solved. In version 2.1, multiple snapshot fragment information is divided into groups and sent, and a binlog fragment is divided into multiple groups and sent one by one, thus completely solving this problem.
Introduce a connection pool to manage database connections and improve stability
By introducing a connection pool to manage database connections, on the one hand, the number of database connections is reduced, and connection leaks caused by extreme scenarios are also avoided.
When the sub-database and sub-table schemas are inconsistent, the missing fields are automatically filled with NULL values
2. Enhanced functions
Supports all MySQL data types
Including complex types such as enumeration type, array type, geographic information type, etc.
support metadata column
Users can access meta information such as database name (database_name), table name (table_name), change time (op_ts) through db_name STRING METADATA FROM 'database_name' in Flink DDL. This is very useful for data integration in database and table scenarios.
DataStream API that supports concurrent reads
In version 2.0, functions such as lock-free algorithm and concurrent reading are only revealed to users on SQL API, but DataStream API is not revealed to users. Version 2.1 supports DataStream API, and data sources can be created through MySqlSourceBuilder. Users can capture multi-table data at the same time, so as to build a synchronization link for the entire database. At the same time, schema changes can also be captured through MySqlSourceBuilder#includeSchemaChanges.
Support currentFetchEventTimeLag, currentEmitEventTimeLag, sourceIdleTime monitoring indicators
These metrics follow the connector metrics specification from FLIP-33 [1], see FLIP-33 for what each metric means. Among them, the currentEmitEventTimeLag indicator records the difference between the time point when Source sends a record to the downstream node and the time point when the record is generated in DB, which is used to measure the delay between data generation from DB and departure from Source node. Users can use this indicator to judge whether the source has entered the binlog reading phase:
That is, when the indicator is 0, it means that it is still in the full history reading stage;
When it is greater than 0, it means that it has entered the binlog reading phase.
3. Detailed explanation of the new Oracle CDC connector
Oracle is also a widely used database. The Oracle CDC connector supports capturing and recording row-level changes in the Oracle database server. The principle is to use the LogMiner [2] tool provided by Oracle or the native XStream API [3] to obtain from Oracle Change data.
LogMiner is an analysis tool provided by Oracle Database, which can parse Oracle Redo log files, thereby parsing the data change log of the database into change event output. When using the LogMiner method, the Oracle server imposes strict resource restrictions on the process of parsing log files, so data parsing will be slower for particularly large-scale tables. The advantage is that LogMiner can be used for free.
XStream API is an internal interface provided by Oracle Database for Oracle GoldenGate (OGG). Clients can efficiently obtain change events through XStream API. The change data is not obtained from the Redo log file, but directly from a piece of memory in the Oracle server. Reading saves the overhead of writing data to log files and parsing log files, which is more efficient, but you must purchase an Oracle GoldenGate (OGG) license.
Oracle CDC Connector supports both LogMiner and XStream API to capture change events. In theory, it can support various Oracle versions. Currently, three versions of Oracle 11, 12 and 19 have been tested in the Flink CDC project. Using the Oracle CDC connector, users only need to declare the following Flink SQL to capture the changed data in the Oracle database in real time:
Using Flink's rich peripheral ecology, users can easily write to various downstream storages, such as message queues, data warehouses, and data lakes.
The Oracle CDC connector has shielded the underlying CDC details. For the entire real-time synchronization link, users only need a few lines of Flink SQL and can capture and send Oracle data changes in real time without developing any Java code.
In addition, the Oracle CDC connector also provides two working modes, that is, read full data + incremental change data, and read only incremental change data. The Flink CDC framework guarantees exactly-once semantics with no more and no less.
4. Detailed explanation of the new MongoDB CDC connector
The MongoDB CDC connector does not depend on Debezium and is developed independently in the Flink CDC project. The MongoDB CDC connector supports capturing and recording real-time change data in the MongoDB database. Its principle is to pretend to be a copy in the MongoDB cluster [4], and using the high availability mechanism of the MongoDB cluster, the copy can obtain complete oplog (operation log) events from the master node flow. The Change Streams API provides the ability to subscribe to these oplog event streams in real time, and these real-time oplog event streams can be pushed to subscribed applications.
In the update event obtained from the ChangeStreams API, for the update event, there is no pre-mirror value of the update event, that is, the MongoDB CDC data source can only be used as an upsert source. However, the Flink framework will automatically add a Changelog Normalize node to MongoDB CDC to complete the pre-mirror value of the update event (that is, the UPDATE_BEFORE event), thereby ensuring the semantic correctness of the CDC data.
Using the MongoDB CDC connector, users only need to declare the following Flink SQL to capture the full and incremental change data in the MongoDB database in real time. With the powerful integration capabilities of Flink, users can easily synchronize the data in MongoDB to Flink in real time Support All downstream storage for .
Throughout the data capture process, users do not need to learn the copy mechanism and principles of MongoDB, which greatly simplifies the process and lowers the threshold for use. MongoDB CDC also supports two startup modes:
The default initial mode is to first synchronize the stock data in the table, and then synchronize the incremental data in the table;
The latest-offset mode is to synchronize only the incremental data in the table from the current point in time.
In addition, MongoDB CDC also provides a wealth of configuration and optimization parameters. For production environments, these configurations and parameters can greatly improve the performance and stability of real-time links.
CDC (Change Data Capture) is a technology for capturing database change data. Flink has natively supported CDC data (changelog) processing since version 1.11, and is currently a very mature change data processing solution.
Flink CDC Connectors is a set of source connectors of Flink, which are the core components of Flink CDC. These connectors are responsible for reading stock historical data and incremental change data from databases such as MySQL, PostgreSQL, Oracle, and MongoDB. Flink CDC Connectors is an independent open source project. Since its open source in July last year, the community has maintained a fairly high-speed development, with an average version of two months. The attention in the open source community continues to increase, and there are gradually more and more users. Use Flink CDC to quickly build real-time data warehouses and data lakes.
In July this year, Flink CDC Maintainer Xu Bangjiang (Xuejin) shared the design of Flink CDC 2.0 for the first time at the Flink Meetup in Beijing. In the following August, the Flink CDC community released version 2.0, which solved many pain points in production practice, and the user base of the Flink CDC community also grew rapidly.
In addition to the rapid expansion of community user groups, the number of developers in the community is also increasing rapidly. At present, developers from many companies at home and abroad have joined the open source co-construction of the Flink CDC community, including developers from Cloudera in North America and Vinted in Europe. , the developers of Ververica, domestic developers are more active, there are developers from Internet companies such as Tencent, Alibaba, and Byte, as well as developers from start-up companies such as XTransfer and Xinhua Winshare, and traditional enterprises. In addition, many cloud vendors at home and abroad have integrated Flink CDC in their stream computing products, allowing more users to experience the power and convenience of Flink CDC.
1. Overview of Flink CDC 2.1
With the joint efforts of community developers, today the Flink CDC community is happy to announce the official release of Flink CDC 2.1: https://github.com/ververica/flink-cdc-connectors/releases/tag/release-2.1.0
This article takes you 10 minutes to understand the major improvements and core functions of Flink CDC version 2.1. Version 2.1 contains 100+ PRs contributed by 23 contributors, focusing on improving the performance and production stability of the MySQL CDC connector, and launching the Oracle CDC connector and MongoDB CDC connector.
MySQL CDC supports very large tables with tens of billions of data, supports all MySQL data types, and greatly improves stability through optimization such as connection pool reuse. At the same time, it provides a DataStream API that supports lock-free algorithms and concurrent reading, and users can use this to build a synchronization link for the entire database;
A new Oracle CDC connector is added to support obtaining full historical data and incremental change data from the Oracle database;
A new MongoDB CDC connector is added to support obtaining full historical data and incremental change data from the MongoDB database;
All connectors support the metadata column function, and users can access meta information such as database name, table name, and data change time through SQL, which is very practical for data integration in database and table scenarios;
Enrich the introductory documentation of Flink CDC and add end-to-end practical tutorials for various scenarios.
2. Detailed Explanation of MySQL CDC Connector Improvement
In the Flink CDC 2.0 version, the MySQL CDC connector provides advanced features such as lock-free algorithm, concurrent reading, breakpoint resume, etc., which solves many pain points in production practice, and then a large number of users start to use it and go online on a large scale . During the launch process, we cooperated with users to solve many production problems, and at the same time developed some high-quality functions that users urgently need. The improvement of Flink CDC 2.1 version for MySQL CDC connector mainly includes two categories, one is stability improvement, One category is function enhancement.
1. Stability improvement
For different primary key distributions, a dynamic fragmentation algorithm is introduced
For scenarios where the primary key is non-numeric, Snowflake ID, sparse primary key, joint primary key, etc., by dynamically analyzing the uniformity of the primary key distribution of the source table, the shard size is automatically calculated according to the uniformity of the distribution, making the slicing more reasonable and the sharding calculation more efficient. quick. The dynamic sharding algorithm can well solve the problems of too many shards in the sparse primary key scenario and too large shards in the joint primary key scenario. The size and number of fragments can be controlled through the chunk size, without caring about the primary key type.
Support tens of billions of large-scale tables
When the size of the table is very large, an error of failing to deliver binlog shards was reported before. This is because there are many snapshot shards corresponding to a very large table, and the binlog shards need to contain all the snapshot shard information. When the SourceCoordinator delivers When binglog is fragmented to the SourceReader node, if the fragment size exceeds the maximum size supported by the RPC communication framework, the fragment delivery will fail. Although the problem of excessive fragmentation size can be alleviated by modifying the parameters of the RPC framework, it cannot be completely solved. In version 2.1, multiple snapshot fragment information is divided into groups and sent, and a binlog fragment is divided into multiple groups and sent one by one, thus completely solving this problem.
Introduce a connection pool to manage database connections and improve stability
By introducing a connection pool to manage database connections, on the one hand, the number of database connections is reduced, and connection leaks caused by extreme scenarios are also avoided.
When the sub-database and sub-table schemas are inconsistent, the missing fields are automatically filled with NULL values
2. Enhanced functions
Supports all MySQL data types
Including complex types such as enumeration type, array type, geographic information type, etc.
support metadata column
Users can access meta information such as database name (database_name), table name (table_name), change time (op_ts) through db_name STRING METADATA FROM 'database_name' in Flink DDL. This is very useful for data integration in database and table scenarios.
DataStream API that supports concurrent reads
In version 2.0, functions such as lock-free algorithm and concurrent reading are only revealed to users on SQL API, but DataStream API is not revealed to users. Version 2.1 supports DataStream API, and data sources can be created through MySqlSourceBuilder. Users can capture multi-table data at the same time, so as to build a synchronization link for the entire database. At the same time, schema changes can also be captured through MySqlSourceBuilder#includeSchemaChanges.
Support currentFetchEventTimeLag, currentEmitEventTimeLag, sourceIdleTime monitoring indicators
These metrics follow the connector metrics specification from FLIP-33 [1], see FLIP-33 for what each metric means. Among them, the currentEmitEventTimeLag indicator records the difference between the time point when Source sends a record to the downstream node and the time point when the record is generated in DB, which is used to measure the delay between data generation from DB and departure from Source node. Users can use this indicator to judge whether the source has entered the binlog reading phase:
That is, when the indicator is 0, it means that it is still in the full history reading stage;
When it is greater than 0, it means that it has entered the binlog reading phase.
3. Detailed explanation of the new Oracle CDC connector
Oracle is also a widely used database. The Oracle CDC connector supports capturing and recording row-level changes in the Oracle database server. The principle is to use the LogMiner [2] tool provided by Oracle or the native XStream API [3] to obtain from Oracle Change data.
LogMiner is an analysis tool provided by Oracle Database, which can parse Oracle Redo log files, thereby parsing the data change log of the database into change event output. When using the LogMiner method, the Oracle server imposes strict resource restrictions on the process of parsing log files, so data parsing will be slower for particularly large-scale tables. The advantage is that LogMiner can be used for free.
XStream API is an internal interface provided by Oracle Database for Oracle GoldenGate (OGG). Clients can efficiently obtain change events through XStream API. The change data is not obtained from the Redo log file, but directly from a piece of memory in the Oracle server. Reading saves the overhead of writing data to log files and parsing log files, which is more efficient, but you must purchase an Oracle GoldenGate (OGG) license.
Oracle CDC Connector supports both LogMiner and XStream API to capture change events. In theory, it can support various Oracle versions. Currently, three versions of Oracle 11, 12 and 19 have been tested in the Flink CDC project. Using the Oracle CDC connector, users only need to declare the following Flink SQL to capture the changed data in the Oracle database in real time:
Using Flink's rich peripheral ecology, users can easily write to various downstream storages, such as message queues, data warehouses, and data lakes.
The Oracle CDC connector has shielded the underlying CDC details. For the entire real-time synchronization link, users only need a few lines of Flink SQL and can capture and send Oracle data changes in real time without developing any Java code.
In addition, the Oracle CDC connector also provides two working modes, that is, read full data + incremental change data, and read only incremental change data. The Flink CDC framework guarantees exactly-once semantics with no more and no less.
4. Detailed explanation of the new MongoDB CDC connector
The MongoDB CDC connector does not depend on Debezium and is developed independently in the Flink CDC project. The MongoDB CDC connector supports capturing and recording real-time change data in the MongoDB database. Its principle is to pretend to be a copy in the MongoDB cluster [4], and using the high availability mechanism of the MongoDB cluster, the copy can obtain complete oplog (operation log) events from the master node flow. The Change Streams API provides the ability to subscribe to these oplog event streams in real time, and these real-time oplog event streams can be pushed to subscribed applications.
In the update event obtained from the ChangeStreams API, for the update event, there is no pre-mirror value of the update event, that is, the MongoDB CDC data source can only be used as an upsert source. However, the Flink framework will automatically add a Changelog Normalize node to MongoDB CDC to complete the pre-mirror value of the update event (that is, the UPDATE_BEFORE event), thereby ensuring the semantic correctness of the CDC data.
Using the MongoDB CDC connector, users only need to declare the following Flink SQL to capture the full and incremental change data in the MongoDB database in real time. With the powerful integration capabilities of Flink, users can easily synchronize the data in MongoDB to Flink in real time Support All downstream storage for .
Throughout the data capture process, users do not need to learn the copy mechanism and principles of MongoDB, which greatly simplifies the process and lowers the threshold for use. MongoDB CDC also supports two startup modes:
The default initial mode is to first synchronize the stock data in the table, and then synchronize the incremental data in the table;
The latest-offset mode is to synchronize only the incremental data in the table from the current point in time.
In addition, MongoDB CDC also provides a wealth of configuration and optimization parameters. For production environments, these configurations and parameters can greatly improve the performance and stability of real-time links.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00