×
Community Blog Flink CDC Series – Part 5: Implement Real-Time Writing of MySQL Data to Apache Doris

Flink CDC Series – Part 5: Implement Real-Time Writing of MySQL Data to Apache Doris

Part 5 of this 5-part series explains how to use Flink CDC and Doris Flink Connector to monitor data from MySQL databases and store data in the tables in real-time.

This article uses an example to demonstrate how to use Flink CDC and Doris Flink Connector to monitor data from MySQL databases and store data in the tables corresponding to Doris data warehouses in real-time. The main contents include:

  1. What Is CDC?
  2. Flink CDC
  3. What Is Flink Doris Connector?
  4. Usage Examples

1. What Is CDC?

Change Data Capture (CDC) can synchronize incremental change records of source databases to one or more data destinations (Sinks). You can perform certain processing on data during synchronization, such as GROUP BY and multi-table JOIN.

For example, for an e-commerce platform, user orders are written to a source database in real-time. Department A needs to aggregate the real-time data per minute and save it to Redis for a query. Department B needs to temporarily store the data of the current day to Elasticsearch for report display. Department C needs to store the data in ClickHouse for a real-time warehouse. Over time, subsequent departments D and E will have data analysis requirements. In this scenario, the traditional copy distribution method for multiple copies is inflexible, while CDC can implement one change record and then process and deliver it to multiple destinations in real-time.

Application Scenarios for CDC

  • Data Synchronization for backup and disaster recovery
  • Data Distribution: A data source is distributed to multiple downstream systems.
  • Data Collection: ETL Data Integration for data warehouses and databases, which is an important data source

CDC has many technical solutions. Currently, the mainstream implementation mechanisms in the industry can be divided into two types:

CDC Based on Queries

  • Offline scheduling of query jobs and batch processing. Synchronize a table to other systems and obtain the latest data in the table through queries each time
  • Data consistency cannot be guaranteed, and the data may have changed many times during the check.
  • Real-time performance is not guaranteed. There is a natural delay based on offline scheduling.

CDC Based on Logs

  • Real-time consumption logs and stream processing. For example, the binlog of MySQL records changes in the database. You can use the binlog file as the data source of the stream.
  • Ensure data consistency because the binlog file contains all historical change details
  • Real-time performance is guaranteed because log files similar to binlog can be consumed by streaming and provide real-time data.

2. Flink CDC

Flink has added the CDC feature in version 1.11, which is referred to as Change Data Capture. The name is a bit messy. Let's look at the contents of CDC from the previous data architecture.

1

The preceding is the previous mysql binlog log processing process. For example, canal listens to binlog and writes logs to Kafka. Apache Flink consumes Kafka data in real-time to synchronize MySQL data or other content. It can be divided into the following phases:

  1. MySQL enables binlog.
  2. Canal synchronizes binlog data and writes it to Kafka.
  3. Flink reads binlog data in Kafka for related business processing.

The overall processing link is long, and many components need to be used. Apache Flink CDC can obtain a binlog from the database for downstream business computing and analysis.

Characteristics of Flink Connector Mysql CDC 2.0

It provides MySQL CDC 2.0. The core features include:

  • Concurrent Read: The read performance of full data can be horizontally expanded.
  • Lock-Free: It does not cause the risk of locking the online business.
  • Resumable Upload: The checkpoint of the full stage is supported.

There are test documents on the Internet showing that the customer table in the TPC-DS dataset was tested. The Flink version is 1.13.1, the customer table has 65 million pieces of data, the source concurrency is 8, and in the full read phase:

  • MySQL CDC 2.0 takes 13 minutes.
  • MySQL CDC 1.4 takes 89 minutes.
  • Read performance is improved 6.8 times.

3. What Is Flink Doris Connector?

Flink Doris Connector is an extension of the Doris community to use Flink to read and write Doris data tables. Currently, Doris supports Flink 1.11.x, 1.12.x, and 1.13.x. Scala: 2.12.x.

Currently, the Flink Doris connector controls warehousing through two parameters:

  1. sink.batch.size: Write every several entries. The default value is 100.
  2. sink.batch.interval: Write every several seconds. The default value is one second.

These two parameters work at the same time, and the condition that enters first triggers the operation of writing the Doris table.

Note: The http v2 version is to be enabled, and the enable_http_server_v2=true is configured in fe.conf. At the same time, since the be list is obtained through fe http rest api, the users that need to be configured have admin permissions.

4. Usage Examples

4.1 Flink Doris Connector Compilation

First of all, we need to compile Doris's Flink connector. You can download it here.

Note: Since Doris's Flink Connector is developed based on Scala 2.12.x, select the version corresponding to Scala 2.12 when using Flink. If you use the preceding address to download the corresponding jar, ignore the following compilation content.

Compile under the Docker compilation environment apache/incubator-doris:build-env-1.2 of Doris. Since the JDK version under 1.3 is version 11, there will be compilation problems.

Execute in the source directory extension/flink-doris-connector/:

sh build. Sh

After the compilation is successful, a file doris-flink-1.0.0-SNAPSHOT.jar is generated in the output/ directory. Copy this file to the Flink ClassPath to use the Flink-Doris-Connector. For example, for a Flink that runs in Local mode, put this file under the jars/ folder. If Flink runs in Yarn cluster mode, put this file in the pre-deployment package.

For Flink 1.13.x Version Adaptation Issues

   <properties>
        <scala.version>2.12</scala.version>
        <flink.version>1.11.2</flink.version>
        <libthrift.version>0.9.3</libthrift.version>
        <arrow.version>0.15.1</arrow.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <doris.home>${basedir}/../../</doris.home>
        <doris.thirdparty>${basedir}/../../thirdparty</doris.thirdparty>
    </properties>

You need to change the flink.version here to the same version as your Flink cluster and edit it again.

4.2 Configure Flink

Here, we do it the Flink SQL Client way.

The software version used:

  1. MySQL 8.x
  2. Apache Flink: 1.13.3
  3. Apache Doris: 0.14.13.1

4.2.1 Install Flink

First, download and install Flink

The local standalone mode is used here:

# wget https://dlcdn.apache.org/flink/flink-1.12.5/flink-1.12.5-bin-scala_2.12.tgz
# tar zxvf flink-1.12.5-bin-scala_2.12.tgz

Download the Flink CDC related JAR package

Note: The correspondence between the versions of Flink CDC and Flink.

2

  • Copy the downloaded or compiled Flink Doris Connector jar package to the lib directory under the Flink root directory
  • The JAR package of Flink CDC is copied to the lib directory of the Flink root directory.

3

4.2.2 Start Flink

Here, we use the local standalone mode:

# bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host doris01.
Starting taskexecutor daemon on host doris01.

We start the Flink cluster through web access (the default port is 8081). You can see the cluster starts normally.

4

4.3 Install Apache Doris

Please refer to this link (article in Chinese) for specific methods of installing and deploying Doris:

4.4 Install and Configure MySQL

1.  Install MySQL and use Docker to install and configure MySQL. Please refer to this link (article in Chinese) for details

2.  Open MySQL binlog, enter the Docker container to modify the /etc/my.cnf file, and add the following content under [mysqld]:

log_bin=mysql_bin
binlog-format=Row
server-id=1

Then, restart MySQL:

systemctl restart mysqld

3.  Create a MySQL database table:

CREATE TABLE `test_cdc` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
 ) ENGINE=InnoDB 

4.5 Create Doris Table

CREATE TABLE `doris_test` (
  `id` int NULL COMMENT "",
  `name` varchar(100) NULL COMMENT ""
 ) ENGINE=OLAP
 UNIQUE KEY(`id`)
 COMMENT "OLAP"
 DISTRIBUTED BY HASH(`id`) BUCKETS 1
 PROPERTIES (
 "replication_num" = "3",
 "in_memory" = "false",
 "storage_format" = "V2"
 );

4.6 Start Flink SQL Client

./bin/sql-client.sh embedded
> set execution.result-mode=tableau;

5

4.6.1 Create a Flink CDC MySQL Mapping Table

CREATE TABLE test_flink_cdc ( 
  id INT, 
  name STRING,
  primary key(id)  NOT ENFORCED
) WITH ( 
  'connector' = 'mysql-cdc', 
  'hostname' = 'localhost', 
  'port' = '3306', 
  'username' = 'root', 
  'password' = 'password', 
  'database-name' = 'demo', 
  'table-name' = 'test_cdc' 
);

The MySQL mapping table created by executing the query is displayed normally.

select * from test_flink_cdc;

6

4.6.2 Create a Flink Doris Table Mapping Table

Use the Doris Flink Connector to create a Doris mapping table:

CREATE TABLE doris_test_sink (
   id INT,
   name STRING
) 
WITH (
  'connector' = 'doris',
  'fenodes' = 'localhost:8030',
  'table.identifier' = 'db_audit.doris_test',
  'sink.batch.size' = '2',
  'sink.batch.interval'='1',
  'username' = 'root',
  'password' = ''
)

Execute the preceding statement on the command line. You can see the table is created. Then, execute the query statement to verify whether it is normal.

select * from doris_test_sink;

7

Execute the insert operation to insert the MySQL data into Doris using Flink CDC and Doris Flink Connector:

INSERT INTO doris_test_sink select id,name from test_flink_cdc

8

After submission, we can see the relevant job information on the web interface of Flink:

9

4.6.3 Insert Data into the MySQL Table

INSERT INTO test_cdc VALUES (123, 'this is a update');
INSERT INTO test_cdc VALUES (1212, ' test flink CDC');
INSERT INTO test_cdc VALUES (1234, 'this is a test');
INSERT INTO test_cdc VALUES (11233, 'zhangfeng_1');
INSERT INTO test_cdc VALUES (21233, 'zhangfeng_2');
INSERT INTO test_cdc VALUES (31233, 'zhangfeng_3');
INSERT INTO test_cdc VALUES (41233, 'zhangfeng_4');
INSERT INTO test_cdc VALUES (51233, 'zhangfeng_5');
INSERT INTO test_cdc VALUES (61233, 'zhangfeng_6');
INSERT INTO test_cdc VALUES (71233, 'zhangfeng_7');
INSERT INTO test_cdc VALUES (81233, 'zhangfeng_8');
INSERT INTO test_cdc VALUES (91233, 'zhangfeng_9');

4.6.4 Observe the Data of the Doris Table

First, stop the Insert into the task. Since there is only one task in the local standalone mode, this task should be stopped, and the data can be seen when executing the query statement on the command line.

10

4.6.5 Modify MySQL Data

Restart the Insert into the task:

11

Modify the data in the MySQL table:

update test_cdc set name='This is an operation to verify modification' where id =123

If you look at the data in the Doris table again, you will find that it has been modified.

Note: If you want to modify the data in the MySQL table, the data in Doris is also modified. The model of the Doris data table is a Unique key model, and other data models (Aggregate Key and Duplicate Key) cannot update the data.

12

4.6.6 Delete Data Operations

Currently, the Doris Flink Connector does not support the delete operation, but it will add this operation later.

0 0 0
Share on

Apache Flink Community

131 posts | 41 followers

You may also like

Comments

Apache Flink Community

131 posts | 41 followers

Related Products