By Simian and Yuanqi

Following the previous article, when MySQL meets the DuckDB columnar storage engine in the AI era
RDS for MySQL embeds DuckDB into the MySQL kernel. This uses the performance advantages of DuckDB in analytical query scenarios and resolves the shortcomings of MySQL in analytical processing (AP) capabilities.
For product positioning, we defined a policy of "MySQL first, DuckDB second" from the start. We use DuckDB to enhance the analytical capabilities of MySQL. We focus on the user ecosystem of MySQL. We remain as compatible as possible with existing customer habits. These include syntax, protocols, and third-party tools.
For technical implementation, we integrate DuckDB into MySQL as a built-in storage engine. This gives it the capabilities of the MySQL Server layer, such as data dictionary management and binary log replication. It can also independently handle all operation requests from the MySQL Server layer, similar to native storage engines such as InnoDB. We import the DuckDB computing engine into the query SQL execution path. This ensures the query efficiency of DuckDB nodes and high compatibility with MySQL syntax.
Based on the basic capabilities of the DuckDB engine, we launched DuckDB read-only instances. A DuckDB node acts as a data replica. It replays binary logs from the primary database. It can independently provide read request services. Combined with the automatic request distribution among row store and column store nodes at the agent layer, the MySQL primary node and the DuckDB node provide a complete hybrid transactional and analytical processing (HTAP) solution.
In addition to DuckDB-based analytical read-only instances, many customers provided feedback that they want to deploy DuckDB nodes as independent database instances. This lets users directly ingest data using SQL. It can also serve as an aggregation instance for multiple source instances. Therefore, we launched the DuckDB-based analytical primary instance.

Two product forms of MySQL DuckDB
The primary instance form greatly expands the scenarios of DuckDB. In theory, you can replace MySQL with DuckDB primary instances in all scenarios. The primary instance form also has higher requirements for the MySQL DuckDB engine. These include data security guarantees, data ingestion performance, and storage management for larger storage spaces.
The following sections introduce the core capabilities enhanced by the DuckDB engine in the primary instance form. These focus on four aspects: binary logging adaptation, high availability and data security, data ingestion capabilities, and compatibility.
The analytical performance of DuckDB is fully verified in multiple benchmark tests, such as TPC-H and ClickBench. DuckDB primary instances have the same query capabilities as read-only instances. They also have the same compression efficiency. This topic does not describe these basic capabilities.
When DuckDB-based read-only instances were released, we turned off the binary logging toggle for read-only instances by default. First, read-only instances typically do not need to generate additional binary logs in user scenarios. Second, the DuckDB engine was not fully adapted to the binary logging enabled scenarios at that time. However, binary logging is an essential part of the DuckDB primary instance form. During the development of primary instances, we fully adapted binary logging. We provide the following technical capabilities:
binlog_format=MIXED).High availability (HA) means that a database can continuously provide services. The DuckDB-based analytical primary instance uses a primary-secondary cluster architecture. The primary database performs one-way data synchronization to secondary databases through Binlog. If the primary database becomes abnormal, the system automatically triggers an HA switch to ensure continuous service availability.

DuckDB primary instance deployment diagram
Data security specifically refers to the Durability attribute in the atomicity, consistency, isolation, and durability (ACID) attributes of a database. This means that after a transaction commit, the Result requires permanent retention. The Result is not lost even if the system crashes.
For a DuckDB-based analytical read-only instance, the persistence requirements can be appropriately lowered because it acts as a data replica of the primary instance. Even if some data lacks persistence when the instance crashes, you can recover the data. Simply pull the Binlog from the primary instance and replay it after a restart. Therefore, the read-only instance uses an idempotent Binlog replay mechanism. This ensures data consistency between the primary database and DuckDB-based read-only nodes.
However, persistence is a rigid requirement for the DuckDB-based analytical primary instance. In this state, data is stored only in a columnar format. You cannot rely on the primary node to rebuild data. Therefore, the DuckDB engine must guarantee data persistence in any failure scenario. The data security of the DuckDB primary instance is guaranteed mainly in the following ways:
1. Backup and restore capabilities: The backup and restore capabilities of the DuckDB-based analytical primary instance are identical to those of a standard MySQL instance. The primary instance supports regular full backups and real-time incremental backups based on Binlog. It also provides a point-in-time recovery (PITR) feature accurate to the second. You can use this feature to recover the instance to any historical Time point.
2. Kernel persistence guarantee: The DuckDB engine ensures the persistence of transactional data through the write-ahead logging (WAL) mechanism. We also modified the transaction commit path of the DuckDB engine at the MySQL layer. This ensures the atomicity of Binlog writes and DuckDB local transaction commits. During the instance crash recovery phase, the system can correctly replay Submitted transactions. This ensures the consistency among DuckDB engine data, binlog content, and binlog GTID offsets. This provides data security guarantees that align with MySQL standards.
In summary, the DuckDB-based analytical primary instance is fully comparable to MySQL high availability instances in terms of high availability and data security. It effectively guarantees the continuous availability of customer core systems and the security of key data assets.
Unlike a read-only instance that only supports primary database Binlog log replay, the DuckDB-based analytical primary instance supports multiple data Ingestion methods. These methods include Binlog log replay, direct SQL writing, and Data Transmission Service (DTS) full and incremental data synchronization. The Binlog replay method is identical to that of a read-only instance. The log replay performance can reach about 300,000 rows per second. The DuckDB-based analytical primary instance also supports more flexible replication topologies, such as multi-source replication and cascading replication.
In scenarios involving direct business writes or data synchronization through DTS, the DuckDB-based analytical primary instance maintains high ingestion performance. This relies mainly on the following kernel optimization capabilities:

We conducted benchmark testing on the write performance of the DuckDB-based analytical primary instance as a synchronization Target based on the Sysbench dataset.
• When full data is imported using LOAD DATA, the peak throughput can reach 2,000,000 rows per second.
• The peak throughput of DTS incremental synchronization can reach 300,000 rows per second.
Note that the actual synchronization speed depends on multiple factors. These include the query efficiency of the source instance, table schema complexity, and Data Transmission Service (DTS) instance type. The preceding values are the maximum end-to-end replication performance in an ideal staging environment.
RDS DuckDB-based analytical instances are deeply integrated with Alibaba Cloud DTS. When you create a DTS job, simply set the target database type to DuckDB. This automatically delivers high ingestion performance. No complex parameter settings or performance tuning are required.
We continuously optimize compatibility. Recently, we made the following enhancements. To improve SQL syntax compatibility, we reverse-parse the abstract syntax tree obtained in the MySQL Prepare phase. This achieves SQL regularization. It resolves over 90% of incompatible SQL syntax and significantly improves compatibility. For functions, we use rapid iterations to support nearly 20 frequently used Json Functions and regular functions. These cover almost all common functions for online users. We are also gradually adding support for less frequently used functions. Ultimately, our functions will be fully compatible with MySQL.
Recently, we added support for character sets such as ascii and latin1 to the DuckDB engine. We also added support for generated columns. We improved the execution efficiency of COPY DDL operations. For more information about these kernel updates, see our release notes.
The customer needs to perform global queries on recent business orders. This requires Real-time Sync of data from 1024 sharded tables across 32 source MySQL instances to a single convergence instance for unified queries. The convergence instance must retain data for one month. The total data volume is about 10 TB. The peak synchronization Transaction Per Second (TPS) reaches 300,000. All traffic consists of order-related transactional DML operations. These include high-frequency INSERT, UPDATE, and DELETE operations.
Data convergence scenarios usually require joint analysis of multiple sharded tables. This requires data synchronization to converge data from multiple instances to a single target instance. These jobs are typically handled by AnalyticDB or distributed databases. However, we used a single DuckDB primary instance to successfully meet the customer requirements. This mainly depends on the following capabilities:
Multi-source and multi-table convergence capability: We extended the MySQL binary logging (binlog) multi-source replication feature. We support configuring table-level routing mapping rules. This lets you uniformly write the binlog streams of multiple source sharded tables to a single target table in DuckDB. To do this, adapt the table schema. For example, the source MySQL table uses an auto increment ID as the primary key. In DuckDB, use the business order number as the primary key. This avoids auto increment ID conflicts across different source instances.
Cross-transaction DML batch optimization: DuckDB merges the DML operations of multiple transactions in the binlog to achieve batch writing. This converts multiple small operations into a single batch commit to the engine. The system supports configuring the batching time window, transaction batch size, and merge policy during low traffic using parameters. This ensures low latency and significantly improves synchronization throughput.

The DuckDB-based analytical primary instance convergence solution successfully met the multi-source synchronization requirements of the customer. It saved the cost of deploying 32 independent synchronization links. It also smoothly supported the service traffic peaks during the New Year holiday. The DuckDB-based analytical primary instance efficiently completes all analysis queries. The duration of most SQL queries is less than 1 second. For queries involving complex operations such as multi-table JOINs and Json Functions, the performance improvement is significant compared to equivalent queries executed on the original MySQL sharded tables. In addition, columnar storage significantly reduces the storage space occupied by DuckDB. Actual tests show that under the same data volume, the disk usage of DuckDB is about 1/4 of that of InnoDB.
The DuckDB-based analytical primary instance format greatly expands the scenarios. In theory, DuckDB-based analytical primary instances can support all environments that use MySQL. Furthermore, DuckDB is not just a storage engine within MySQL. It also provides an open window for MySQL. It serves as a bridge that connects MySQL with the open data ecosystem.
Data lake integration is our key future direction. By integrating plugins such as Iceberg natively supported by DuckDB, MySQL seamlessly connects with the data lake. Applications still write data using standard MySQL APIs. The underlying data is automatically saved to Iceberg tables on S3. For the business side, the write method remains completely unchanged. For the Data Platform, the data naturally exists in the data lake in the standard Iceberg format. Businesses can also directly query tables in data lakes such as Iceberg using the SQL syntax of MySQL. This achieves a unified experience of one API and dual-mode access.
Deep Dive into Source Code: How Hermes Agent Achieves Self-Improving
ApsaraDB - May 13, 2026
ApsaraDB - January 15, 2026
ApsaraDB - May 13, 2026
ApsaraDB - March 20, 2026
ApsaraDB - November 26, 2025
ApsaraDB - September 10, 2025
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
PolarDB for MySQL
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn More
AnalyticDB for MySQL
AnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreMore Posts by ApsaraDB