Paimon Table - Data Lake Formation - Alibaba Cloud Documentation Center

This topic describes how to manage Paimon tables in Data Lake Formation (DLF).

Introduction to Table Types

Features	Uses the lakehouse format Paimon table. Supports integrated real-time and batch storage. Enables efficient read and write access via compute engines and open source APIs.
Use cases	Stream processing, real-time updates, and high-performance OLAP queries.
Data management	Fully managed by DLF, including metadata and data files. Deleting a table removes both metadata and data.
Storage system	DLF auto-generates the storage path using UUIDs. You do not need to specify a storage path manually.
Deletion behavior	By default, data is retained for 1 day after table deletion to reduce the risk of accidental deletion. Data is permanently deleted after 1 day.

Managed Paimon table features:

Fully managed compaction: Runs independently from data writes to improve stability.
Concurrent writes: Multiple jobs can write to the same partition of the same table simultaneously.
Real-time partition-level metrics: Includes row count, file count, and size.
Multi-version support: Enables time travel and fine-grained insert, update, and delete operations.

Create a table

Log on to the Data Lake Formation console.
On the Data Catalog list page, you can click the Catalog name to go to the Catalog details page.
In the Databases list, click your database name.
In the Tables list, click Create Table.

Configure the following settings and click OK.

Configuration item	Description
Table Format	Select Paimon Table.
Table Name	Required. Must be unique within the database.
Table Description	Optional. Enter a description.
Column	Define column information, including column name, whether it is a primary key, whether it is non-null, whether it is a partition field, data type, length or type, description, and actions.
User-defined Table Properties	Add custom properties. These overwrite the default parameters of the DLF global meta service during table creation. For supported configuration items, see the Apache Paimon documentation.

Note

Paimon tables created in DLF use write-only mode by default. Background table optimizations—such as compaction, snapshot cleanup, and partition cleanup—are automatically handled by DLF.

SQL Examples

DLF supports primary key tables and append-only tables. If you have registered a DLF catalog in other platforms—such as EMR or Flink—you can create databases and tables on those platforms. Metadata is written directly to DLF. For more information, see Engine integration.

Primary key tables

A primary key table uses a primary key as its unique identifier. It is designed for stream processing scenarios. It supports real-time updates, inserts, and deletes on records and automatically generates precise change logs for downstream stream consumption. In addition, primary key tables support efficient data queries based on primary key conditions.

Flink SQL example

CREATE TABLE orders (
  order_id BIGINT,
  price    BIGINT,
  customer STRING,
  PRIMARY KEY NOT ENFORCED(order_id)
);

Spark SQL example

CREATE TABLE orders (
  order_id BIGINT,
  price    BIGINT,
  customer STRING
) TBLPROPERTIES (
  'primary-key' = 'order_id'
);

DLF uses Postpone Bucket mode by default. This is an adaptive bucket allocation strategy that dynamically adjusts the number of buckets based on partition data volume. It avoids performance issues caused by too many buckets—reduced read performance—or too few buckets—reduced write performance. You do not need to configure buckets manually. However, Postpone mode introduces data latency. Newly written data is not visible until compaction completes.

To avoid latency, do one of the following:

Use Flink with Ververica Runtime (VVR) 11.4+ or Spark with esr-4.5+. These versions write batches directly to buckets in Postpone mode, eliminating latency.
For latency-sensitive tables, explicitly set the number of buckets. For example: 'bucket' = '5'. We recommend one bucket per 1 GB of partition data.

The system dynamically adjusts the number of buckets based on data characteristics and traffic loads. It performs automatic scaling as needed to maintain storage efficiency and read and write performance. Read the following section to learn more about the scaling strategy.

Dynamic bucketing and automatic scaling strategy

The following factors determine bucket allocation:

Total partition storage size
The total file size in a partition is the core metric for bucket count. Larger partitions get more buckets.
Data record count
When deletion vectors are enabled, the number of rows in a partition becomes a key metric. More rows trigger more buckets to maintain query performance.
Write traffic load
The system monitors write job throughput. During high-traffic writes, it increases bucket count to prevent write bottlenecks.
Data skew
If data distribution is uneven—data skew—the system adds buckets to spread data more evenly across buckets.
Per-record data size
If average record size is small—for example, due to many null values—the system adds buckets to optimize underlying file structure.
Historical partition reference
In some cases, the system uses historical partition bucket configurations as a reference. It applies a heuristic algorithm to determine the optimal bucket count for new partitions.

For other business-related configurations, define the following:

Merge engine (merge-engine) for complex calculations.
Deletion vectors (deletion-vectors.enabled) to significantly improve query performance.
Note
After you enable this feature, all newly written data must be compacted before it becomes visible, regardless of the bucket mode.
Changelog producer (changelog-producer) set to 'lookup' to generate changelogs for downstream stream reads.
Sequence field (sequence.field) to handle out-of-order data and ensure correct update order.

If your upstream data is CDC data, use Flink CDC or a data integration product to write data to the lake. These tools provide full-database synchronization, automatic table creation, and table schema synchronization.

To achieve high-performance OLAP queries on primary key tables, we highly recommend enabling deletion vectors. Although this consumes more compaction resources, it delivers more stable and higher-performing OLAP queries.

Append-only tables

An append-only table has no primary key. Unlike primary key tables, it does not support direct stream updates. However, its batch processing performance is significantly better.

Supports stream writes and stream reads. DLF automatically merges small files to improve data timeliness at low compute cost.
Supports fine-grained operations such as DELETE, UPDATE, and MERGE INTO. Also provides version management and time travel to meet diverse business needs.
Supports query acceleration through sorting and bitmaps. OLAP engines deliver excellent direct-read performance.

Use append-only tables for most batch processing scenarios or stream processing without a primary key. Compared to primary key tables, append-only tables are simpler to use and still deliver efficient data writes and queries.

Flink SQL example

CREATE TABLE orders (
  order_id BIGINT,
  price    BIGINT,
  customer STRING
);

Spark SQL example

CREATE TABLE orders (
  order_id BIGINT,
  price    BIGINT,
  customer STRING
);

View a table

In the Database List, click the name of the database to view its table list.
In the Table List, click a table name to view its fields.
Click the Table Details tab to view basic table information, column list, and partition list.
Note
On the Table Details tab, you can manually modify the storage class for both partitioned and non-partitioned tables. For more information, see Manually change the storage class.
Click the Permissions tab to grant table permissions to users or roles. For more information, see Data authorization management.

Delete a table

Warning

After you delete a table, data is retained for 1 day by default to reduce the risk of accidental deletion. Data is permanently deleted after 1 day.

In the Databases list, click your database name.
In the Tables list, click Delete in the Actions column.
In the confirmation dialog box, click OK to complete the deletion.