This topic describes how to manage Paimon tables in Data Lake Formation (DLF).
Introduction
Features | Integrated real-time and batch storage. Efficient read/write via compute engine and Paimon APIs. |
Use cases | Stream processing, real-time updates, and high-performance OLAP. |
Management | Fully managed by DLF (metadata & data). Deleting a table removes both. |
Storage | Auto-generated storage path (UUID-based). No manual path needed. |
Deletion | Data retained for 1 day by default after table deletion (prevents accidental loss). Permanently deleted thereafter. |
Paimon table capabilities:
Managed compaction: Isolated from writes for stable operation.
Concurrent writes: Multiple jobs can write to the same partition simultaneously.
Real-time metrics: Partition-specific stats (rows, files, size).
Multi-version support: Enables time travel and granular insert, update, and delete operations.
Create a table
Log on to the DLF console.
In the left navigation menu, select Catalogs, and click your catalog name.
In the Database section, click your database name.
Click Create Table.
Configure the table as shown below and click OK.
Configuration item
Description
Table Format
Select Paimon Table.
Table Name
Enter a table name. It must be unique within the database.
Table Description
Enter a description.
Columns
Define the columns of the table.
User-defined Table Properties
Define custom properties as needed. During table creation, these properties overwrite default Paimon table properties. For more information, see Configuration in the Apache Paimon documentation.
NoteDLF Paimon tables default to "write-only" mode. Background optimization (compaction, cleanup) is automatically handled by DLF.
Create a table using SQL
DLF Paimon tables support primary key and append-only types. When a DLF catalog is registered in platforms like EMR Serverless Spark or Realtime Compute for Apache Flink, you can create databases and tables via SQL. Metadata is written directly to DLF. For more information, see Engine integration.
Primary key tables
Designed for stream processing, these tables use a primary key for unique identification. They support real-time data updates, inserts, and deletes, and automatically generate precise change logs for downstream consumption. Efficient data queries based on primary key conditions are also supported.
Flink SQL
CREATE TABLE orders ( order_id BIGINT, price BIGINT, customer STRING, PRIMARY KEY NOT ENFORCED(order_id) );Spark SQL example
CREATE TABLE orders ( order_id BIGINT, price BIGINT, customer STRING ) TBLPROPERTIES ( 'primary-key' = 'order_id' );
DLF's default Postpone Bucket mode adaptively allocates buckets based on data volume. This optimizes read/write performance by avoiding too many or too few buckets, eliminating manual configuration. However, newly written data has latency until compaction is complete.
To eliminate the latency, do one of the following:
Use Ververica Runtime (VVR) 11.4+ for Flink jobs or esr-4.5+ for Spark jobs. In the Bucket Postpone mode, new versions write batches directly to buckets, which eliminates latency.
For latency-sensitive tables, explicitly set the number of buckets. For example:
'bucket' = '5'. We recommend one bucket for every 1 GB of partition data.
For other business-related configurations, you can define the following:
Merge engine: Used for complex calculations.
Deletion vectors: Significantly improves query performance.
NoteEnabling deletion vectors means new data is only visible after compaction, regardless of the bucket mode used.
Changelog producer: A lookup changelog producer generates changelogs for downstream stream reads.
Sequence field: Handles out-of-order data and ensures the correct update sequence.
Utilize Flink CDC or data integration products for writing CDC data to the lake. These solutions provide full database synchronization, automatic table creation, and table schema synchronization capabilities.
For optimal OLAP query performance on primary key tables, enabling deletion vectors is highly recommended. This feature provides enhanced stability and performance at the cost of increased compaction resource utilization.
Append-only tables
Append-only tables do not have a primary key. While they lack direct stream updates, they offer superior batch processing performance.
Support stream writes and reads. DLF automatically merges small files for better data availability and lower cost.
Support fine-grained DELETE, UPDATE, and MERGE INTO operations. Includes version management and time travel.
Optimized for OLAP engines through sorting and bitmaps, ensuring excellent direct read performance.
Append-only tables are ideal for most batch processing or stream processing without a primary key. They are simpler to use than primary key tables, with efficient writes and queries.
Flink SQL
CREATE TABLE orders ( order_id BIGINT, price BIGINT, customer STRING );Spark SQL
CREATE TABLE orders ( order_id BIGINT, price BIGINT, customer STRING );
View a table
In the Database section, click your database name.
On the Tables tab, click your table name.
On the Table Details tab, view the table's basic information and columns.
NoteOn the Table Details tab, you can modify the storage class for partitioned and non-partitioned tables. For more information, see Manually change the storage class.
On the Permissions tab, grant table permissions to DLF users or roles. For more information, see Manage data permissions.
Delete a table
Deleted table data is retained for 1 day to prevent accidental deletion. Data is permanently deleted afterward.
In the Database section, click your database name.
On the Tables tab, click Delete in the Actions column of your target table.
In the dialog box, click OK to confirm deletion.