×
Community Blog One-Click Database Synchronization from MongoDB to Paimon Using Flink CDC

One-Click Database Synchronization from MongoDB to Paimon Using Flink CDC

This article explores the process of achieving one-click database synchronization from MongoDB to Paimon using Flink CDC.

By Jinsong Li (Zhixin)

Introduction

MongoDB is a mature document database commonly used in business scenarios. Data from MongoDB is often collected and stored in a data warehouse or data lake for analysis purposes.

Flink MongoDB CDC is a connector provided by the Flink CDC community for capturing change data. It enables connecting to MongoDB databases and collections to capture changes such as document additions, updates, replacements, and deletions.

Apache Paimon (incubating) is a streaming data lake storage technology that offers high-throughput, low-latency data ingestion, streaming subscription, and real-time query capabilities.

Paimon CDC

Paimon CDC is a tool that integrates Flink CDC, Kafka, and Paimon to simplify the process of writing data into a data lake with just one click.

You can use Flink SQL or Flink DataStream API to write Flink CDC data into Paimon, or utilize the provided CDC tool by Paimon for writing data into the data lake. What are the differences between these two methods?

1

The above figure illustrates how to write data into the data lake using Flink SQL. This method is straightforward. However, when a new column is added to the source table, it will not be synchronized, and the downstream Paimon table will not be updated with the new column.

2

The above figure illustrates how data can be synchronized using Paimon CDC. As shown, when a new column is added to the source table, the streaming job automatically synchronizes the new column and propagates it to the downstream Paimon table, ensuring schema evolution synchronization.

In addition, Paimon CDC also provides whole database synchronization.

3

Whole database synchronization offers the following benefits:

  1. Synchronizes multiple tables within a single job, enabling low-cost synchronization of a large number of small tables.
  2. Automatically performs schema evolution within the job.
  3. Automatically synchronizes new tables without the need to restart the job, allowing for seamless synchronization.

Demo Description

You can follow the demo steps to experience the fully automatic synchronization capabilities of Paimon CDC. The demo shows how to synchronize data from MongoDB to Paimon, as shown in the following figure.

4

The following demo uses Flink to write data into the data lake and Spark SQL for querying. Alternatively, you can also use Flink SQL or other computing engines such as Trino, Presto, StarRocks, Doris, and Hive for querying.

Demo Preparation

Step 1:

Download the free version of MongoDB Community Server.

https://www.mongodb.com/try/download/community

Start MongoDB Server.

mkdir /tmp/mongodata 
./mongod --replSet rs0 --dbpath /tmp/mongodata

Note: In this example, replSet is enabled. Only databases with replSet enabled will generate a changelog, and then the CDC data can be read incrementally by Flink Mongo CDC.

Step 2:

Download MongoDB Shell.

https://www.mongodb.com/try/download/shell

Start MongoDB Shell.

./mongosh

In addition, you need to initialize the replSet. Otherwise, the MongoDB server will keep reporting an error.

rs.initiate()

Step 3:

Go to the official website to download the latest Flink.

https://www.apache.org/dyn/closer.lua/flink/flink-1.18.0/flink-1.18.0-bin-scala_2.12.tgz

Download the following jar files in sequence to the lib directory of Flink.

paimon-flink-1.18-0.6-*.jar, the Paimon-Flink integrated jar file:

https://repository.apache.org/snapshots/org/apache/paimon/paimon-flink-1.18/0.6-SNAPSHOT/

flink-shaded-hadoop-*.jar. Paimon requires Hadoop-related dependencies:

https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar

flink-sql-connector-mongodb-cdc-*.jar:

https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mongodb-cdc/2.4.2/flink-sql-connector-mongodb-cdc-2.4.2.jar

Set the checkpoint interval in the flink/conf/flink-conf.yaml file.

execution.checkpointing.interval: 10 s

This interval is not recommended in production. A large number of files will be generated if the interval is too short, causing increased costs. Generally, the recommended checkpoint interval is 1-5 minutes.

Start a Flink cluster.

./bin/start-cluster.sh

Start a Flink synchronization task.

./bin/flink run lib/paimon-flink-action-0.6-*.jar
mongodb-sync-database
--warehouse /tmp/warehouse1
--database test
--mongodb-conf hosts=127.0.0.1:27017
--mongodb-conf database=test
--table-conf bucket=1

Parameter description:

  1. Warehouse specifies the directory of the file system where Paimon is located. If you have an HDFS cluster or OSS, you can replace it with your directory.
  2. For MongoDB-related configurations, enter the password if needed.
  3. Finally, specify the number of buckets. Currently, whole database synchronization only supports tables with fixed buckets. For special requirements, you can modify the number of buckets for some tables.

5

As you can see, the job has been started. The topology contains three main nodes:

  1. Source: Flink MongoDB CDC Source. Complete schema evolution and automatically add tables.
  2. CDC MultiplexWriter: Paimon Table Writer for complex multiple tables. Add tables dynamically and automatically.
  3. Multiplex Global Committer: the file submission node of the two-phase commit protocol.

Both Writer and Committer may become bottlenecks. The concurrency of Writer and Committer can be affected by the configuration of Flink.

You can turn on the full asynchronous mode to avoid the Compaction bottleneck of Writer:

https://paimon.apache.org/docs/master/maintenance/write-performance/#asynchronous-compaction

Step 4:

Go to the official website to download the latest version of Spark.

https://spark.apache.org/downloads.html

Download the Paimon-Spark integrated jar files.

https://repository.apache.org/content/groups/snapshots/org/apache/paimon/paimon-spark-3.5/0.6-SNAPSHOT/

Start Spark SQL.

./bin/spark-sql
--conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog
--conf spark.sql.catalog.paimon.warehouse=file:/tmp/warehouse1

Use the Paimon Catalog to specify the database.

USE paimon;
USE rs0;

Demo Procedure

Step 1:

First, test that the written data can be read.

Insert a piece of data into MongoDB.

db.orders.insertOne({id: 1, price: 5})

Query in Spark SQL.

6

As shown, this data is synchronized to Paimon, and the schema of the orders table is added with a column "_id", which is the implicit primary key automatically generated by MongoDB.

Step 2:

This step shows how updates are synchronized.

Update the following data in Mongo Shell.

db.orders.update({id: 1}, {$set: { price: 8 }})

Query in Spark.

7

The price of the data is updated to 8.

Step 3:

This step checks the synchronization of added fields.

In Mongo Shell, insert a new data record, and a new column appears.

db .orders.insertOne ({ id: 2, price: 6, desc: “haha”})

Query in Spark.

8

As shown, a column is added to the table corresponding to Paimon. The query data shows that the default value of the old data is NULL.

Step 4:

This step checks the synchronization of added tables.

Insert the data of a new table into Mongo Shell.

db .brands.insertOne ({ id: 1, brand: “NBA”})

Query in Spark.

9

A table is automatically added to Paimon, and the data is synchronized.

Summary

The data ingestion program of Paimon CDC allows you to fully synchronize your business databases to Paimon automatically, including data, schema evolution, and new tables. All you need to do is to manage the Flink job. This program has been deployed to various industries and companies, bringing the ability to easily mirror business data to the lake storage.

There are more data sources for you to explore: Mysql, Kafka, MongoDB, Pulsar, and PostgreSQL.

Paimon’s long-term goals include:

• Extreme ease-to-use, high-performance data ingestion, convenient data lake storage management, and rich ecological queries.

• Easy data stream reading, integration with the Flink ecosystem, and ability to bring fresh data generated one minute ago to the business.

• Enhanced Append data processing, time travel and data sorting which bring efficient queries, and upgraded Hive data warehouses.

About Paimon

Official website: https://paimon.apache.org/.

0 1 0
Share on

Apache Flink Community

148 posts | 42 followers

You may also like

Comments