×
Community Blog DuckDB on ApsaraDB RDS for Lightning-Fast Analytics

DuckDB on ApsaraDB RDS for Lightning-Fast Analytics

This article introduces how Alibaba Cloud has integrated DuckDB into MySQL as a new storage engine to dramatically improve its analytical query performance.

By Zongzhi Chen, Manager of Alibaba Cloud RDS Team

MySQLs Pluggable Storage Engine Architecture

One of MySQL's core innovations is its Pluggable Storage Engine Architecture, which enables MySQL to extend its capabilities through various storage engines, supporting diverse business scenarios. MySQL's pluggable architecture is illustrated below:

1

MySQL's pluggable storage engine architecture can be divided into four main parts:

  • Runtime Layer: Responsible for MySQL runtime-related tasks such as communication, access control, system configuration, monitoring, etc.
  • Binlog Layer: Responsible for Binlog generation, replication, and application.
  • SQL Layer: Responsible for SQL parsing, optimization, and SQL execution.
  • Storage Engine Layer: Responsible for data storage and access.

MySQL designs a standardized data access control interface (Plugable Engine Interface) between SQL computation and data storage. The SQL layer performs data updates, queries, and management through this standard interface, enabling storage engines to be "hot-swapped" as independent components.

Currently commonly used storage engines in MySQL include:

  • MyISAM: MySQL's earliest engine. Due to lack of transaction support, it has been replaced by InnoDB. However, it remained the system table storage engine until MySQL-5.7.
  • InnoDB: MySQL's default engine. Due to its transaction support and excellent performance, it gradually replaced MyISAM as MySQL's most widely used engine.
  • CSV: CSV file engine, storage engine for MySQL slow logs and General Log.
  • Memory: In-memory table storage engine, also used as storage engine for internal temporary tables during SQL execution.
  • TempTable: Engine introduced in MySQL-8.0 for storing internal temporary tables.

The introduction of InnoDB as an engine into MySQL was a very important milestone in MySQL's pluggable engine architecture. In the early days of the internet, MyISAM gained favor in internet businesses due to its simple and efficient access, and together with Linux, Apache, and PHP, it was called the LAMP architecture. With the rise of e-commerce and social internet, MyISAM's shortcomings became increasingly apparent. InnoDB greatly expanded MySQL's capabilities due to its support for transaction ACID, advantages in concurrent access and performance. With InnoDB, MySQL became the most popular open-source OLTP database.

With the widespread use of MySQL, we see more and more analytical queries based on TP data. InnoDB's architecture is naturally designed for OLTP. Although it can deliver excellent performance in TP business scenarios, InnoDB's query efficiency in analytical business scenarios is very low. This greatly limits MySQL's usage scenarios. To this day, MySQL has lacked an analytical query engine. The emergence of DuckDB has shown us a possibility.

Introduction to DuckDB

DuckDB is an open-source database designed for online analytical processing (OLAP) and data analytics workloads. Due to its lightweight, high-performance, zero-configuration, and easy-to-integrate characteristics, it is rapidly becoming a popular choice in data science, BI tools, and embedded analytics scenarios. DuckDB mainly has the following features:

  • Exceptional query performance: Single-node DuckDB performance is not only far higher than InnoDB, but even better than ClickHouse and SelectDB.
  • Excellent compression ratio: DuckDB uses columnar storage, automatically selects appropriate compression algorithms based on type, and has a very high compression rate.
  • Embedded design: DuckDB is an embedded database system, naturally suitable for integration into MySQL.
  • Pluggable design: DuckDB adopts a pluggable design, making third-party development and functional extensions very convenient.
  • Friendly License: DuckDB's license allows any form of use of DuckDB's source code, including commercial activities.

Based on the above reasons, we believe DuckDB is very suitable to become MySQL's AP storage engine. Therefore, we have integrated DuckDB into AliSQL.

2

The positioning of the DuckDB engine is to achieve lightweight single-node analytical capabilities. Currently, the RDS MySQL DuckDB read-only instance based on the DuckDB engine has been launched. Welcome to try it. In the future, we will also launch highly available RDS MySQL DuckDB primary instances with master-standby setup. Users can aggregate heterogeneous data to RDS MySQL DuckDB instances through tools like DTS to perform data analysis queries.

Architecture of RDS MySQL DuckDB Read-Only Instance

3

The DuckDB analytical read-only instance adopts a read-write separation architecture. Analytical services and primary instance services are separated without affecting each other. Like ordinary read-only instances, it replicates data from the primary instance through the Binlog replication mechanism. The DuckDB analytical read-only node has the following advantages:

  • High-performance analytical queries: Based on DuckDB's query capability, analytical query performance is up to 200 times higher than InnoDB (see performance section).
  • Low storage costs: Based on DuckDB's high compression rate, the storage space of DuckDB read-only instances is usually only 20% of the primary instance's storage space.
  • 100% compatible with MySQL syntax, eliminating learning costs. Since DuckDB is integrated into MySQL as an engine, users still use MySQL syntax for queries without any learning curve.
  • No additional management costs: The DuckDB read-only instance is still an RDS MySQL instance. Compared with ordinary read-only instances, only some MySQL parameters are added. Therefore, DuckDB is managed, operated, and monitored like ordinary RDS MySQL instances. Monitoring information, slow logs, audit logs, RDS APIs, etc. are identical.
  • One-click creation of DuckDB read-only instances, automatic conversion of data from InnoDB to DuckDB, no additional operations required.

Implementation of DuckDB Engine

4

The DuckDB read-only instance can be divided into a query path and a Binlog replication path in usage. The query path accepts user query requests and executes data queries. The Binlog replication path connects to the primary instance for Binlog replication. The technical principles will be introduced separately from these two aspects below.

Query Path

5

The query execution process is as shown above. InnoDB is only used to store metadata and system information, such as accounts, configurations, etc. All user data is stored in the DuckDB engine.

Users connect to the instance through the MySQL client. After the query arrives, MySQL first performs parsing and necessary processing. Then it sends the SQL to the DuckDB engine for execution. After DuckDB completes execution, it returns the results to the Server layer, which converts the result set into a MySQL result set and returns it to the client.

The most important task of the query path is compatibility work. DuckDB and MySQL data types are basically compatible, but there are significant differences in syntax and function support. To address this, we extended DuckDB's syntax parser to be compatible with MySQL-specific syntax; rewrote a large number of DuckDB functions and added many MySQL functions, allowing common MySQL functions to run accurately. Automated compatibility testing platform with approximately 170,000 SQL tests showed a compatibility rate of 99%. Detailed compatibility information see link

Binlog Replication Path

6

Idempotent Replay

Since DuckDB does not support two-phase commit, it cannot use two-phase commit to ensure consistency between Binlog GTID and data, nor can it ensure consistency between InnoDB metadata and DuckDB during DDL operations. Therefore, we modified the transaction commit process and Binlog replay process to ensure data consistency after instance abnormal crash and restart.

DML Replay Optimization

Due to DuckDB's implementation, it favors large transactions. The execution efficiency of frequent small transactions is very low, causing severe replication delays. Therefore, we optimized Binlog replay by adopting a batch approach for transaction replay. After optimization, it can achieve a replay capability of 30 rows/s. In Sysbench stress tests, it achieved no replication delay, with even higher replay performance than InnoDB.

7

Parallel Copy DDL

A small number of DDLs in MySQL, such as modifying column order, are not supported by DuckDB. To ensure normal replication, we implemented a Copy DDL mechanism. For DDLs natively supported by DuckDB, they are executed using Inplace/Instant methods. When encountering DDLs not supported by DuckDB, the Copy DDL method is used to create a new table to replace the original table.

8

Copy DDL uses multi-threaded parallel execution, reducing execution time by 7 times.

9

Performance of DuckDB Read-Only Instance

Test Environment

ECS instance 32CPU, 128GB memory, ESSD PL1 cloud disk 500GB

Test Type

TPC-H SF100

10

Online purchase of RDS MySQL instances can be directly experienced:

https://www.alibabacloud.com/help/en/rds/apsaradb-rds-for-mysql/duckdb-based-analytical-instances-are-released-for-apsaradb-rds-for-mysql

0 1 0
Share on

ApsaraDB

559 posts | 178 followers

You may also like

Comments

ApsaraDB

559 posts | 178 followers

Related Products