×
Community Blog PolarDB's New Open Source Project: DuckDB-paimon, Helping Enterprises Build an AI-Oriented Multi-Model Data Foundation

PolarDB's New Open Source Project: DuckDB-paimon, Helping Enterprises Build an AI-Oriented Multi-Model Data Foundation

This article introduces DuckDB-paimon, an open-source PolarDB extension that lets DuckDB query Apache Paimon data lakes natively without external clusters.

Enable DuckDB to Natively Read Apache Paimon Data Lakes

Project URL: https://github.com/polardb/duckdb-paimon
If this project is helpful to you, you are welcome to support it with a ⭐ Star! If you have any questions or suggestions, you are welcome to submit an Issue or PR.
DingTalk Group: 164165020808

Duckdb-paimon is a DuckDB extension plugin developed by the PolarDB team. It allows DuckDB to directly read and query data lake tables in the Apache Paimon format without any extract, transform, and load (ETL) transfers or Flink or Spark clusters. You can execute SQL analysis on Paimon tables simply by opening the DuckDB Shell.

It plays the same role as DuckDB's official duckdb-iceberg plugin—extending DuckDB's powerful local analysis capabilities to the Paimon data lake ecosystem.

Why Do You Need DuckDB?

Apache Paimon is a new-generation real-time lakehouse format. It integrates Lake Format with Log-Structured Merge-tree (LSM) structures, natively supports unified stream and batch ingestion, and is widely used in real-time data warehousing scenarios.

However, the Paimon ecosystem lacks a sufficiently lightweight and low-latency query entry point:

Querying with Flink or Spark: You need to deploy a cluster in advance, which is resource-intensive. The query latency is high, making it more suitable for batch processing queries.

Querying with Trino or Presto: Compared to the former, query latency is reduced, and it is friendlier to queries with latencies in the hundreds of milliseconds. However, the complexity of cluster deployment is still unavoidable.

DuckDB is an embedded Online Analytical Processing (OLAP) database that requires no server, starts in seconds, supports single-file deployment, and is extremely efficient for data analysis in local Notebook environments. For example, PolarDB for PostgreSQL deeply integrates the PostgreSQL-based row store engine and the DuckDB-based column store engine into the PolarDB unified cloud-native database architecture (that is, In-Memory Column Index (IMCI)), achieving a 100× improvement in complex query analysis performance.

DuckDB plays a key role in today's AI and data lake analysis. The released Duckdb-paimon plugin also fills the gap in Paimon data sources. Duckdb-paimon connects the two, allowing Data Engineers and Analysts to query Paimon data lakes directly in the most lightweight way.

Core Capabilities

Directly Read Paimon Tables

You can directly query Paimon table data through SQL without any intermediate transformation:

-- Read via full path
SELECT * FROM paimon_scan('./data/testdb.db/testtbl');

-- Read via the warehouse / database / table three-part format
SELECT * FROM paimon_scan('/warehouse', 'mydb', 'orders');

Remote OSS Storage Support

You can safely configure Alibaba Cloud OSS access credentials through DuckDB's Secret mechanism to directly query Paimon data lakes stored on OSS:

-- Configure OSS access credentials
CREATE SECRET paimon_oss (
    TYPE paimon,
    key_id 'your-access-key-id',
    secret 'your-access-key-secret',
    endpoint 'oss-cn-hangzhou.aliyuncs.com'
);

-- Directly query Paimon tables on OSS
SELECT COUNT(*), region
FROM paimon_scan('oss://my-bucket/warehouse', 'sales_db', 'orders')
GROUP BY region
ORDER BY COUNT(*) DESC;

Projection Pushdown (Column Pruning)

When querying, only the columns actually used in the SQL are read, significantly reducing I/O overhead. This is particularly effective in wide table scenarios:

-- Only the data files for the order_id and amount columns will be read; the other columns will not be read 

SELECT order_id, amount FROM paimon_scan('oss://...', 'db', 'orders');

Mount as a Catalog (ATTACH)

Mount the Paimon warehouse as a complete Catalog to DuckDB, and browse Paimon's schemas and table schemas just like operating a local database:

ATTACH 'oss://my-bucket/warehouse' AS paimon_lake (TYPE paimon);

SHOW ALL TABLES IN paimon_lake;
DESCRIBE paimon_lake.sales_db.orders;

Common Scenarios

Lightweight ad hoc query for Real-time Lakehouse

Data is written to Paimon in real-time by Flink. Analysts use DuckDB + Duckdb-paimon to perform ad hoc queries directly on OSS without starting any compute clusters. Query latency is reduced from minutes to seconds.

Data Validation and quality checks

In CI/CD pipelines, use DuckDB to perform Data Quality assertions on Paimon tables to verify whether the output results of Flink jobs meet expectations. It is lightweight, fast, and dependency-free.

Data Exploration and debug

When developing Flink jobs, Data Engineers can use DuckDB Shell at any time to View the current status of Paimon tables to quickly locate data issues, with efficiency far exceeding that of starting the Flink SQL Client.

Cross-format data federated query

DuckDB natively supports querying multiple formats such as Parquet, CSV, JSON, and Iceberg. By combining with Duckdb-paimon, you can perform federated JOINs between Paimon tables and other data sources without data movement:

-- Paimon order table JOIN local CSV dimension table
SELECT o.order_id, o.amount, c.customer_name
FROM paimon_scan('oss://...', 'db', 'orders') o
JOIN read_csv('customers.csv') c ON o.customer_id = c.id;

Technical Highlights

Native C++ implementation based on paimon-cpp

Duckdb-paimon is built upon paimon-cpp, Alibaba's open-source Paimon C++ native read/write library. Compared to JVM solutions:

Zero JVM Dependency: No need to install Java runtime, making deployment extremely simple.

Low memory footprint: Native C++ implementation, with memory efficiency far superior to JVM.

Quick Launch: No JVM warm-up overhead, and the DuckDB process is ready in seconds.

Arrow format data exchange

Data is exchanged between paimon-cpp and DuckDB via the Apache Arrow format. This is the industry standard for passing columnar data in memory, avoiding serialization/deserialization overheads and achieving efficient data flow.

Multi-threaded parallel scan

The Scan Plan splits the Paimon table into multiple independent splits. DuckDB's multi-threaded execution engine can read each split in parallel, fully utilizing multi-core CPU resources.

DuckDB Secret Security Credential Management

OSS access credentials are managed by DuckDB's native Secret Manager. It supports scope fencing. The key will not appear in the query log (automatic masking).

Roadmap

Capability Status
Read Paimon tables (Local / OSS) Completed
Projection Pushdown Completed
Parquet + ORC format Support Completed
Catalog ATTACH Completed
Filter Pushdown (predicate pushdown) Planning
Snapshot query (Time Travel) Planning
Write to Paimon tables Planning
Metadata Management (DDL) Planning
Access remote Catalog services such as REST Catalog and DLF Planning
More file system Support (such as Hadoop Distributed File System (HDFS)) Planning

Quick Start

Build

git clone --recurse-submodules https://github.com/polardb/duckdb-paimon.git
cd duckdb-paimon
GEN=ninja make

--recurse-submodules pulls both DuckDB and paimon-cpp submodules simultaneously. This is required for the build.
Ensure ninja is installed, or simply run make.

Run

./build/release/duckdb

Query local Paimon tables

SELECT * FROM paimon_scan('./data/testdb.db/testtbl');

Query Paimon tables on OSS

CREATE SECRET my_oss (
    TYPE paimon,
    key_id 'your-ak',
    secret 'your-sk',
    endpoint 'oss-cn-hangzhou.aliyuncs.com'
);

SELECT * FROM paimon_scan('oss://your-bucket/warehouse', 'your_db', 'your_table');

About PolarDB for PostgreSQL

PolarDB for PostgreSQL is a cloud-native database developed by Alibaba Cloud. Based on a storage-compute decoupled architecture, it leverages the integration of software and hardware to provide database services with second-level Elasticity, high Performance, mass storage, security, and reliability. PolarDB PostgreSQL is 100% compatible with multiple versions of native PostgreSQL, including PostgreSQL 11/14/15/16/17/18.

PolarDB for PostgreSQL delivers an HTAP solution—the In-Memory Column Index (IMCI)—powered by DuckDB. Users can seamlessly enable this feature via an extension, which automatically synchronizes row-based tables into a columnar format in real-time. This significantly accelerates analytical queries without requiring any modifications to business DML, as the system intelligently routes queries to either the row or column index based on cost.

Furthermore, deep integration with DuckDB enriches PolarDB's plugin ecosystem. For instance, leveraging capabilities like DuckDB-paimon, it supports a wider range of data sources—including open formats like Parquet and CSV, as well as data lake systems such as Iceberg, Paimon, and LanceDB—thereby greatly extending PolarDB's data processing boundaries.

Moreover, to become fully "AI-Ready," PolarDB is accelerating the realization of its "Lakehouse" vision. The upcoming "PolarDB Lakebase"—a cornerstone feature of this vision—is purpose-built for the massive volumes of semi-structured and unstructured data in the AI era. It offers end-to-end support ranging from storage, processing, and management to multi-modal retrieval and application development. This empowers PolarDB to seamlessly handle both database transactions and data lake analytics on a unified platform, with the ultimate goal of becoming a one-stop solution for enterprise multi-modal data fusion in the age of AI.

0 0 0
Share on

ApsaraDB

594 posts | 182 followers

You may also like

Comments

ApsaraDB

594 posts | 182 followers

Related Products