×
Community Blog From Storage Engines to Binlogs: Why MySQL’s DuckDB Integration is More Elegant than PostgreSQL’s

From Storage Engines to Binlogs: Why MySQL’s DuckDB Integration is More Elegant than PostgreSQL’s

We see that currently there are 3 mainstream solutions running: pg_duckdb, pg_mooncake, and pg_lake.

By Baotiao

We see that currently there are 3 mainstream solutions for PostgreSQL integration with DuckDB running: pg_duckdb, pg_mooncake, and pg_lake.

pg_duckdb is the official DuckDB plugin. It can only provide the migration of existing row-oriented tables to DuckDB and cannot synchronize incremental data to DuckDB tables. Therefore, the applicable scenarios are relatively limited.

pg_mooncake, acquired by Databricks, can support existing and incremental data synchronization. It copies PostgreSQL data through logical replication via an additional pg_moonlink process and writes the data in Iceberg format. Subsequently, if there is a replication query, the query needs to follow the request method of postgresql => pg_moonlink => s3 (Iceberg).

pg_lake, acquired by Snowflake, also does not support incremental data synchronization. It only supports full data import and export. It feels more like a solution for a data archiving scenario.

We can see several problems. One is that the PostgreSQL logical replication capability is not mature enough, far inferior to its native physical replication capability, and cannot connect PostgreSQL and DuckDB read-only instances through logical replication. Another problem is that PostgreSQL does not support good pluggable storage engine capabilities. Although PostgreSQL provides table access method as the storage engine API, it does not provide capabilities such as primary-standby replication or crash recovery. In many scenarios, data consistency cannot be guaranteed.

This problem is well resolved in MySQL.

First, MySQL is naturally a pluggable storage engine design. In the early days, the default engine of MySQL was MyISAM. Later, because InnoDB supported row-level Multi-Version Concurrency Control (MVCC), MySQL transformed the default engine to the InnoDB engine. In MySQL, there were originally column store solutions such as InfoBright, but they did not become popular. Therefore, supporting the column store DuckDB in MySQL and adding a column store engine is a very natural thing. It does not require a solution like PostgreSQL where written data needs to be written to the row store first and then transformed into the column store.

Another aspect is the MySQL binary logging (Binlog) mechanism. The double log mechanism of MySQL has disadvantages and advantages. The existence of Binlog and redo log will definitely have an impact on write performance, but Binlog provides excellent support for the upstream and downstream of the MySQL ecosystem. Binlog provides complete SQL statements, which makes it very convenient to copy to the downstream. This is also the reason why Online Analytical Processing (OLAP) applications in the MySQL ecosystem are so popular, such as ClickHouse, StarRocks, and SelectDB.

In the scenario where MySQL uses DuckDB as a storage engine, the MySQL Binlog ecosystem is completely compatible and has not been destroyed. Therefore, it can serve as a data warehouse node. The data written to this data warehouse node can still stream the Binlog out. In the scenario of hybrid transactional and analytical processing (HTAP), the primary node MySQL InnoDB engine sends Binlog to the downstream MySQL DuckDB engine, thereby achieving completely compatible streaming.

0 1 0
Share on

ApsaraDB

601 posts | 182 followers

You may also like

Comments