Use Apache Paimon to build a streaming lakehouse solution - Realtime Compute for Apache Flink

Apache Paimon is a lakehouse storage format that unifies streaming and batch data processing. Built on the log-structured merge-tree (LSM) structure, Paimon brings real-time update semantics directly into the lake layer, giving you consistent reads without sacrificing throughput. Use Paimon tables in Realtime Compute for Apache Flink to build a streaming lakehouse on cloud storage such as Object Storage Service (OSS).

Paimon integrates with Apache Flink for stream processing and Apache Spark for batch processing through a single storage format. Key capabilities include:

Real-time data ingestion: Ingest tens of millions of records from database change streams (such as MySQL CDC) with automatic schema change synchronization and low latency.
Unified stream and batch processing: Read the same Paimon table as a bounded batch source in Spark or as an unbounded changelog stream in Flink — no format conversion required.
Broad ecosystem integration: Connect Paimon tables to Realtime Compute for Apache Flink, E-MapReduce (Spark, StarRocks, Hive, and Trino), and MaxCompute without data duplication.
Low-latency OLAP queries: Deletion vectors and primary key indexes keep streaming, batch, and online analytical processing (OLAP) query latency at the minute level.

For the full Apache Paimon specification, see Apache Paimon.

Usage

Get started with Paimon

Start with core concepts and basic operations before building production pipelines. For more information, see Get started with Paimon catalogs.
Choose the right table type for your workload: use primary key tables when data requires streaming inserts, updates, or deletes, and append-only tables (without primary keys) for insert-only workloads such as log synchronization.
To understand how Paimon maintains data freshness and consistency across snapshots, see Data latency and consistency.
For a step-by-step guide to building a streaming lakehouse end to end, see Build a streaming lakehouse with Paimon and StarRocks.

Create a Paimon catalog

A Paimon catalog is a centralized registry for Paimon tables stored in external systems such as OSS. Other Alibaba Cloud services can access tables through the same catalog. Set up a catalog in any of the following ways:

Create and use a Paimon catalog. For more information, see Manage Paimon catalogs.
Synchronize Paimon table metadata to Data Lake Formation (DLF). For more information, see Create a DLF catalog.
Create a Paimon external table in MaxCompute to query Paimon data from MaxCompute. For more information, see Create a MaxCompute catalog.
Synchronize metadata to DLF and create a Paimon external table in MaxCompute simultaneously. For more information, see Create a Paimon sync catalog.

Create a Paimon table

Create a table directly inside a Paimon catalog. For more information, see Manage tables.
Create a Paimon table by synchronizing data from external sources such as MySQL and Apache Kafka using the CREATE TABLE AS (CTAS) statement or the CREATE DATABASE AS (CDAS) statement. For more information, see Create a table using CTAS or CDAS.

Write data to a Paimon table

Insert or update records in a Paimon table. For more information, see Write data.
Join a Paimon table with other tables and apply aggregate functions. For more information, see Merge engines.
Partially or fully overwrite a Paimon table. For more information, see Overwrite data (INSERT OVERWRITE).
Delete rows from a Paimon table. For more information, see Delete data (DELETE).
Drop partitions from a Paimon table. For more information, see Modify a table schema.

Consume data from a Paimon table

Query or consume data from a Paimon table in batch or streaming mode. For more information, see Consume data. To consume data from a primary key table in streaming mode, configure the changelog producer first.
Configure the consumer offset of a Paimon table. For more information, see Consume data from a specified offset.
Save the consumer offset of a Paimon table or keep expired snapshot files referenced by active consumers. For more information, see Save consumption progress with consumer ID.
Run a batch job to read the historical state of a Paimon table from a specific snapshot. For more information, see Time travel.

Maintain a Paimon table

Troubleshoot common issues with Paimon. For more information, see Connectors.
Tune read and write performance for Paimon tables. For more information, see Performance tuning for Paimon tables.
Inspect table metadata such as partition list and per-partition file size. For more information, see Paimon system tables.
Modify the schema of a table in a Paimon catalog. For more information, see Modify a table schema.
Drop a table from a Paimon catalog. For more information, see Drop a table.
Rescale the number of buckets for a Paimon table in fixed bucket mode. For more information, see Change the number of buckets in fixed bucket mode.
Clean up obsolete files from a Paimon table directory. For more information, see Clean up expired data.