Project URL: https://github.com/polardb/duckdb-paimon
If this project is helpful to you, you are welcome to support it with a ⭐ Star! If you have any questions or suggestions, you are welcome to submit an Issue or PR.
DingTalk Group: 164165020808
Duckdb-paimon is a DuckDB extension plugin developed by the PolarDB team. It allows DuckDB to directly read and query data lake tables in the Apache Paimon format without any extract, transform, and load (ETL) transfers or Flink or Spark clusters. You can execute SQL analysis on Paimon tables simply by opening the DuckDB Shell.
It plays the same role as DuckDB's official duckdb-iceberg plugin—extending DuckDB's powerful local analysis capabilities to the Paimon data lake ecosystem.
Apache Paimon is a new-generation real-time lakehouse format. It integrates Lake Format with Log-Structured Merge-tree (LSM) structures, natively supports unified stream and batch ingestion, and is widely used in real-time data warehousing scenarios.
However, the Paimon ecosystem lacks a sufficiently lightweight and low-latency query entry point:
Querying with Flink or Spark: You need to deploy a cluster in advance, which is resource-intensive. The query latency is high, making it more suitable for batch processing queries.
Querying with Trino or Presto: Compared to the former, query latency is reduced, and it is friendlier to queries with latencies in the hundreds of milliseconds. However, the complexity of cluster deployment is still unavoidable.
DuckDB is an embedded Online Analytical Processing (OLAP) database that requires no server, starts in seconds, supports single-file deployment, and is extremely efficient for data analysis in local Notebook environments. For example, PolarDB for PostgreSQL deeply integrates the PostgreSQL-based row store engine and the DuckDB-based column store engine into the PolarDB unified cloud-native database architecture (that is, In-Memory Column Index (IMCI)), achieving a 100× improvement in complex query analysis performance.
DuckDB plays a key role in today's AI and data lake analysis. The released Duckdb-paimon plugin also fills the gap in Paimon data sources. Duckdb-paimon connects the two, allowing Data Engineers and Analysts to query Paimon data lakes directly in the most lightweight way.
You can directly query Paimon table data through SQL without any intermediate transformation:
-- Read via full path
SELECT * FROM paimon_scan('./data/testdb.db/testtbl');
-- Read via the warehouse / database / table three-part format
SELECT * FROM paimon_scan('/warehouse', 'mydb', 'orders');
You can safely configure Alibaba Cloud OSS access credentials through DuckDB's Secret mechanism to directly query Paimon data lakes stored on OSS:
-- Configure OSS access credentials
CREATE SECRET paimon_oss (
TYPE paimon,
key_id 'your-access-key-id',
secret 'your-access-key-secret',
endpoint 'oss-cn-hangzhou.aliyuncs.com'
);
-- Directly query Paimon tables on OSS
SELECT COUNT(*), region
FROM paimon_scan('oss://my-bucket/warehouse', 'sales_db', 'orders')
GROUP BY region
ORDER BY COUNT(*) DESC;
When querying, only the columns actually used in the SQL are read, significantly reducing I/O overhead. This is particularly effective in wide table scenarios:
-- Only the data files for the order_id and amount columns will be read; the other columns will not be read
SELECT order_id, amount FROM paimon_scan('oss://...', 'db', 'orders');
Mount the Paimon warehouse as a complete Catalog to DuckDB, and browse Paimon's schemas and table schemas just like operating a local database:
ATTACH 'oss://my-bucket/warehouse' AS paimon_lake (TYPE paimon);
SHOW ALL TABLES IN paimon_lake;
DESCRIBE paimon_lake.sales_db.orders;
Data is written to Paimon in real-time by Flink. Analysts use DuckDB + Duckdb-paimon to perform ad hoc queries directly on OSS without starting any compute clusters. Query latency is reduced from minutes to seconds.
In CI/CD pipelines, use DuckDB to perform Data Quality assertions on Paimon tables to verify whether the output results of Flink jobs meet expectations. It is lightweight, fast, and dependency-free.
When developing Flink jobs, Data Engineers can use DuckDB Shell at any time to View the current status of Paimon tables to quickly locate data issues, with efficiency far exceeding that of starting the Flink SQL Client.
DuckDB natively supports querying multiple formats such as Parquet, CSV, JSON, and Iceberg. By combining with Duckdb-paimon, you can perform federated JOINs between Paimon tables and other data sources without data movement:
-- Paimon order table JOIN local CSV dimension table
SELECT o.order_id, o.amount, c.customer_name
FROM paimon_scan('oss://...', 'db', 'orders') o
JOIN read_csv('customers.csv') c ON o.customer_id = c.id;
Duckdb-paimon is built upon paimon-cpp, Alibaba's open-source Paimon C++ native read/write library. Compared to JVM solutions:
Zero JVM Dependency: No need to install Java runtime, making deployment extremely simple.
Low memory footprint: Native C++ implementation, with memory efficiency far superior to JVM.
Quick Launch: No JVM warm-up overhead, and the DuckDB process is ready in seconds.
Data is exchanged between paimon-cpp and DuckDB via the Apache Arrow format. This is the industry standard for passing columnar data in memory, avoiding serialization/deserialization overheads and achieving efficient data flow.
The Scan Plan splits the Paimon table into multiple independent splits. DuckDB's multi-threaded execution engine can read each split in parallel, fully utilizing multi-core CPU resources.
OSS access credentials are managed by DuckDB's native Secret Manager. It supports scope fencing. The key will not appear in the query log (automatic masking).
| Capability | Status |
|---|---|
| Read Paimon tables (Local / OSS) | Completed |
| Projection Pushdown | Completed |
| Parquet + ORC format Support | Completed |
| Catalog ATTACH | Completed |
| Filter Pushdown (predicate pushdown) | Planning |
| Snapshot query (Time Travel) | Planning |
| Write to Paimon tables | Planning |
| Metadata Management (DDL) | Planning |
| Access remote Catalog services such as REST Catalog and DLF | Planning |
| More file system Support (such as Hadoop Distributed File System (HDFS)) | Planning |
git clone --recurse-submodules https://github.com/polardb/duckdb-paimon.git
cd duckdb-paimon
GEN=ninja make
--recurse-submodulespulls both DuckDB and paimon-cpp submodules simultaneously. This is required for the build.
Ensure ninja is installed, or simply run make.
./build/release/duckdb
SELECT * FROM paimon_scan('./data/testdb.db/testtbl');
CREATE SECRET my_oss (
TYPE paimon,
key_id 'your-ak',
secret 'your-sk',
endpoint 'oss-cn-hangzhou.aliyuncs.com'
);
SELECT * FROM paimon_scan('oss://your-bucket/warehouse', 'your_db', 'your_table');
PolarDB for PostgreSQL is a cloud-native database developed by Alibaba Cloud. Based on a storage-compute decoupled architecture, it leverages the integration of software and hardware to provide database services with second-level Elasticity, high Performance, mass storage, security, and reliability. PolarDB PostgreSQL is 100% compatible with multiple versions of native PostgreSQL, including PostgreSQL 11/14/15/16/17/18.
PolarDB for PostgreSQL delivers an HTAP solution—the In-Memory Column Index (IMCI)—powered by DuckDB. Users can seamlessly enable this feature via an extension, which automatically synchronizes row-based tables into a columnar format in real-time. This significantly accelerates analytical queries without requiring any modifications to business DML, as the system intelligently routes queries to either the row or column index based on cost.
Furthermore, deep integration with DuckDB enriches PolarDB's plugin ecosystem. For instance, leveraging capabilities like DuckDB-paimon, it supports a wider range of data sources—including open formats like Parquet and CSV, as well as data lake systems such as Iceberg, Paimon, and LanceDB—thereby greatly extending PolarDB's data processing boundaries.
Moreover, to become fully "AI-Ready," PolarDB is accelerating the realization of its "Lakehouse" vision. The upcoming "PolarDB Lakebase"—a cornerstone feature of this vision—is purpose-built for the massive volumes of semi-structured and unstructured data in the AI era. It offers end-to-end support ranging from storage, processing, and management to multi-modal retrieval and application development. This empowers PolarDB to seamlessly handle both database transactions and data lake analytics on a unified platform, with the ultimate goal of becoming a one-stop solution for enterprise multi-modal data fusion in the age of AI.
CloudJump II: Optimization and Implementation of Cloud Databases in Shared Storage Scenarios
Alibaba Cloud Native Community - June 4, 2025
ApsaraDB - January 15, 2026
ApsaraDB - November 30, 2018
Alibaba Clouder - September 27, 2018
Alibaba Cloud Native Community - May 22, 2025
ApsaraDB - January 28, 2026
Data Lake Formation
An end-to-end solution to efficiently build a secure data lake
Learn More
Data Lake Storage Solution
Build a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn More
PolarDB for PostgreSQL
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn More
PolarDB for Xscale
Alibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreMore Posts by ApsaraDB