×
Community Blog When DuckDB Embraces MySQL in the AI Era

When DuckDB Embraces MySQL in the AI Era

AliSQL enhances MySQL with DuckDB-powered analytical instances for high-performance HTAP capabilities while maintaining full MySQL compatibility.

By Simian and Yuanqi

1

Following the previous article, when MySQL meets the DuckDB columnar storage engine in the AI era

Product Form Evolution of the MySQL DuckDB Engine

RDS for MySQL embeds DuckDB into the MySQL kernel. This uses the performance advantages of DuckDB in analytical query scenarios and resolves the shortcomings of MySQL in analytical processing (AP) capabilities.

For product positioning, we defined a policy of "MySQL first, DuckDB second" from the start. We use DuckDB to enhance the analytical capabilities of MySQL. We focus on the user ecosystem of MySQL. We remain as compatible as possible with existing customer habits. These include syntax, protocols, and third-party tools.

For technical implementation, we integrate DuckDB into MySQL as a built-in storage engine. This gives it the capabilities of the MySQL Server layer, such as data dictionary management and binary log replication. It can also independently handle all operation requests from the MySQL Server layer, similar to native storage engines such as InnoDB. We import the DuckDB computing engine into the query SQL execution path. This ensures the query efficiency of DuckDB nodes and high compatibility with MySQL syntax.

Based on the basic capabilities of the DuckDB engine, we launched DuckDB read-only instances. A DuckDB node acts as a data replica. It replays binary logs from the primary database. It can independently provide read request services. Combined with the automatic request distribution among row store and column store nodes at the agent layer, the MySQL primary node and the DuckDB node provide a complete hybrid transactional and analytical processing (HTAP) solution.

In addition to DuckDB-based analytical read-only instances, many customers provided feedback that they want to deploy DuckDB nodes as independent database instances. This lets users directly ingest data using SQL. It can also serve as an aggregation instance for multiple source instances. Therefore, we launched the DuckDB-based analytical primary instance.

2_
Two product forms of MySQL DuckDB

DuckDB-based Analytical Primary Instances

The primary instance form greatly expands the scenarios of DuckDB. In theory, you can replace MySQL with DuckDB primary instances in all scenarios. The primary instance form also has higher requirements for the MySQL DuckDB engine. These include data security guarantees, data ingestion performance, and storage management for larger storage spaces.

The following sections introduce the core capabilities enhanced by the DuckDB engine in the primary instance form. These focus on four aspects: binary logging adaptation, high availability and data security, data ingestion capabilities, and compatibility.

The analytical performance of DuckDB is fully verified in multiple benchmark tests, such as TPC-H and ClickBench. DuckDB primary instances have the same query capabilities as read-only instances. They also have the same compression efficiency. This topic does not describe these basic capabilities.

Binary Logging Capability Adaptation

When DuckDB-based read-only instances were released, we turned off the binary logging toggle for read-only instances by default. First, read-only instances typically do not need to generate additional binary logs in user scenarios. Second, the DuckDB engine was not fully adapted to the binary logging enabled scenarios at that time. However, binary logging is an essential part of the DuckDB primary instance form. During the development of primary instances, we fully adapted binary logging. We provide the following technical capabilities:

  1. Provide a one-primary-multiple-replica cluster deployment structure. Data synchronization between the primary node and replica nodes is performed using binary logging. This supports real-time backup of binary logs and downstream binary logging subscription requirements. Note that the binary logging format of DuckDB-based analytical primary instances is currently fixed to the Mixed format (binlog_format=MIXED).
  2. Ensure consistency between binary logs and engine data. During the transaction commit phase, the DuckDB engine atomically coordinates binary log writes with DuckDB transaction commits. Even if the instance restarts abnormally during the transaction commit process, it ensures consistency between binary logs and engine data (see the next section for details).
  3. DuckDB currently adapts Binlog Cache Free Flush to eliminate the cost of Binlog Flush during large transaction commits. We will continue to adapt optimizations such as real-time binary logging transmission and real-time binary logging application. This resolves the replication delay issue in large transaction and large DDL scenarios.
  4. Support binary logging (Binlog) cascading replication and multi-source replication: The DuckDB primary instance can act as an ancestor node in a multi-level replication task. It also supports Binlog multi-source replication. The DuckDB primary instance can act as a replication downstream for multiple source instances at the same time. You can configure multiple independent replication channels. (For more information about Common scenarios of multi-source replication, see the "Customer practices" section in this article.)

High Availability and Data Security

High availability (HA) means that a database can continuously provide services. The DuckDB-based analytical primary instance uses a primary-secondary cluster architecture. The primary database performs one-way data synchronization to secondary databases through Binlog. If the primary database becomes abnormal, the system automatically triggers an HA switch to ensure continuous service availability.

3_
DuckDB primary instance deployment diagram

Data security specifically refers to the Durability attribute in the atomicity, consistency, isolation, and durability (ACID) attributes of a database. This means that after a transaction commit, the Result requires permanent retention. The Result is not lost even if the system crashes.

For a DuckDB-based analytical read-only instance, the persistence requirements can be appropriately lowered because it acts as a data replica of the primary instance. Even if some data lacks persistence when the instance crashes, you can recover the data. Simply pull the Binlog from the primary instance and replay it after a restart. Therefore, the read-only instance uses an idempotent Binlog replay mechanism. This ensures data consistency between the primary database and DuckDB-based read-only nodes.

However, persistence is a rigid requirement for the DuckDB-based analytical primary instance. In this state, data is stored only in a columnar format. You cannot rely on the primary node to rebuild data. Therefore, the DuckDB engine must guarantee data persistence in any failure scenario. The data security of the DuckDB primary instance is guaranteed mainly in the following ways:

1. Backup and restore capabilities: The backup and restore capabilities of the DuckDB-based analytical primary instance are identical to those of a standard MySQL instance. The primary instance supports regular full backups and real-time incremental backups based on Binlog. It also provides a point-in-time recovery (PITR) feature accurate to the second. You can use this feature to recover the instance to any historical Time point.

2. Kernel persistence guarantee: The DuckDB engine ensures the persistence of transactional data through the write-ahead logging (WAL) mechanism. We also modified the transaction commit path of the DuckDB engine at the MySQL layer. This ensures the atomicity of Binlog writes and DuckDB local transaction commits. During the instance crash recovery phase, the system can correctly replay Submitted transactions. This ensures the consistency among DuckDB engine data, binlog content, and binlog GTID offsets. This provides data security guarantees that align with MySQL standards.

In summary, the DuckDB-based analytical primary instance is fully comparable to MySQL high availability instances in terms of high availability and data security. It effectively guarantees the continuous availability of customer core systems and the security of key data assets.

Data Ingestion Performance

Unlike a read-only instance that only supports primary database Binlog log replay, the DuckDB-based analytical primary instance supports multiple data Ingestion methods. These methods include Binlog log replay, direct SQL writing, and Data Transmission Service (DTS) full and incremental data synchronization. The Binlog replay method is identical to that of a read-only instance. The log replay performance can reach about 300,000 rows per second. The DuckDB-based analytical primary instance also supports more flexible replication topologies, such as multi-source replication and cascading replication.

In scenarios involving direct business writes or data synchronization through DTS, the DuckDB-based analytical primary instance maintains high ingestion performance. This relies mainly on the following kernel optimization capabilities:

  1. Batch import mode: The DuckDB engine provides the duckdb_data_import_mode configuration Toggle. When this toggle is enabled, conditional INSERT and DELETE requests automatically enter the batch commit path. Multiple DML operations are cached in a temporary table. They then merge into a single transaction commit and are written to DuckDB. This significantly reduces the execution overhead of DML statements in DuckDB.
  2. DuckDB optimistic import path optimization: During the full data migration phase, import transactions enter the optimistic import path of DuckDB. In this mode, Data copying is not required during a transaction commit. Instead, the generated RowGroups are directly Linked to the table schema to complete persistence. The actual data writing to WAL logs is also skipped. Optimistic import reduces I/O amplification during the write phase and improves write throughput.
  3. Large transaction Binlog optimization: This relies on the Binlog Free Flush capability in RDS for MySQL. When a large transaction commits, you do not need to copy content from the Binlog cache to Binlog files. This eliminates performance bottlenecks of writing to Binlog during large transaction commits.

4_

We conducted benchmark testing on the write performance of the DuckDB-based analytical primary instance as a synchronization Target based on the Sysbench dataset.

• When full data is imported using LOAD DATA, the peak throughput can reach 2,000,000 rows per second.

• The peak throughput of DTS incremental synchronization can reach 300,000 rows per second.

Note that the actual synchronization speed depends on multiple factors. These include the query efficiency of the source instance, table schema complexity, and Data Transmission Service (DTS) instance type. The preceding values are the maximum end-to-end replication performance in an ideal staging environment.

RDS DuckDB-based analytical instances are deeply integrated with Alibaba Cloud DTS. When you create a DTS job, simply set the target database type to DuckDB. This automatically delivers high ingestion performance. No complex parameter settings or performance tuning are required.

Compatibility Enhancements

We continuously optimize compatibility. Recently, we made the following enhancements. To improve SQL syntax compatibility, we reverse-parse the abstract syntax tree obtained in the MySQL Prepare phase. This achieves SQL regularization. It resolves over 90% of incompatible SQL syntax and significantly improves compatibility. For functions, we use rapid iterations to support nearly 20 frequently used Json Functions and regular functions. These cover almost all common functions for online users. We are also gradually adding support for less frequently used functions. Ultimately, our functions will be fully compatible with MySQL.

Recently, we added support for character sets such as ascii and latin1 to the DuckDB engine. We also added support for generated columns. We improved the execution efficiency of COPY DDL operations. For more information about these kernel updates, see our release notes.

Customer Practices

A travel platform uses a DuckDB primary instance for data convergence

The customer needs to perform global queries on recent business orders. This requires Real-time Sync of data from 1024 sharded tables across 32 source MySQL instances to a single convergence instance for unified queries. The convergence instance must retain data for one month. The total data volume is about 10 TB. The peak synchronization Transaction Per Second (TPS) reaches 300,000. All traffic consists of order-related transactional DML operations. These include high-frequency INSERT, UPDATE, and DELETE operations.

Data convergence scenarios usually require joint analysis of multiple sharded tables. This requires data synchronization to converge data from multiple instances to a single target instance. These jobs are typically handled by AnalyticDB or distributed databases. However, we used a single DuckDB primary instance to successfully meet the customer requirements. This mainly depends on the following capabilities:

Multi-source and multi-table convergence capability: We extended the MySQL binary logging (binlog) multi-source replication feature. We support configuring table-level routing mapping rules. This lets you uniformly write the binlog streams of multiple source sharded tables to a single target table in DuckDB. To do this, adapt the table schema. For example, the source MySQL table uses an auto increment ID as the primary key. In DuckDB, use the business order number as the primary key. This avoids auto increment ID conflicts across different source instances.

Cross-transaction DML batch optimization: DuckDB merges the DML operations of multiple transactions in the binlog to achieve batch writing. This converts multiple small operations into a single batch commit to the engine. The system supports configuring the batching time window, transaction batch size, and merge policy during low traffic using parameters. This ensures low latency and significantly improves synchronization throughput.

5

The DuckDB-based analytical primary instance convergence solution successfully met the multi-source synchronization requirements of the customer. It saved the cost of deploying 32 independent synchronization links. It also smoothly supported the service traffic peaks during the New Year holiday. The DuckDB-based analytical primary instance efficiently completes all analysis queries. The duration of most SQL queries is less than 1 second. For queries involving complex operations such as multi-table JOINs and Json Functions, the performance improvement is significant compared to equivalent queries executed on the original MySQL sharded tables. In addition, columnar storage significantly reduces the storage space occupied by DuckDB. Actual tests show that under the same data volume, the disk usage of DuckDB is about 1/4 of that of InnoDB.

Not Just a MySQL Storage Engine

The DuckDB-based analytical primary instance format greatly expands the scenarios. In theory, DuckDB-based analytical primary instances can support all environments that use MySQL. Furthermore, DuckDB is not just a storage engine within MySQL. It also provides an open window for MySQL. It serves as a bridge that connects MySQL with the open data ecosystem.

Data lake integration is our key future direction. By integrating plugins such as Iceberg natively supported by DuckDB, MySQL seamlessly connects with the data lake. Applications still write data using standard MySQL APIs. The underlying data is automatically saved to Iceberg tables on S3. For the business side, the write method remains completely unchanged. For the Data Platform, the data naturally exists in the data lake in the standard Iceberg format. Businesses can also directly query tables in data lakes such as Iceberg using the SQL syntax of MySQL. This achieves a unified experience of one API and dual-mode access.

0 1 0
Share on

ApsaraDB

621 posts | 184 followers

You may also like

Comments