DLF Paimon Primary Key Table Performance Evaluation - Data Lake Formation

This report compares DLF Compaction and self-managed Paimon Compaction based on compute resource consumption. We developed a multi-dimensional performance evaluation framework focusing on resource utilization and elastic scheduling capabilities.

Test Scenarios

We systematically validate DLF’s technical advantages using the following three test scenarios:

Adaptive bucketing strategy: Dynamically adjusts the number of buckets based on partition data volume, enabling fine-grained resource allocation.
Deletion Vectors (DV) optimization: Optimizes lookup and merge efficiency in Deletion Vectors (DV) scenarios to improve Compaction performance.
Dynamic resource elasticity optimization: Automatically scales compute resources (CU) based on real-time load, eliminating waste from over-provisioning or shortages during peak demand.

Adaptive Bucketing Strategy

Test Motivation

In scenarios with highly skewed data distribution, traditional fixed-bucket strategies struggle to balance performance and resource efficiency. DLF implements a partition-level adaptive bucketing mechanism. This mechanism dynamically calculates the optimal number of buckets per partition based on actual data volume. This eliminates the need for manual bucket count configuration and enables precise resource allocation.

Test Plan

Table schema design

Create a Paimon table with dynamic partitioning and compare it with a hybrid partitioning strategy.

-- DLF table (intelligent dynamic partitioning)
-- Enable Deletion Vectors; system manages buckets automatically
CREATE TABLE perf_rest.pk_partitions_db.t (
    -- Primary key fields and extended attributes
) 
PARTITIONED BY (`partition_id`)
WITH (
    'deletion-vectors.enabled' = 'true'
);

-- Self-managed Paimon table (fixed bucket configuration)
-- Bucket count must be specified upfront
CREATE TABLE perf_filesystem.pk_partitions_db.t (
    -- Primary key fields and extended attributes
) 
PARTITIONED BY (`partition_id`)
WITH (
    'bucket' = '500',
    'write-only' = 'true',
    'deletion-vectors.enabled' = 'true'
);

Data injection strategy
- Initial data layer: Inject 500 GB of baseline data to create a non-uniform distribution. Assign 70% of the data to the primary partition and 3% each to nine secondary partitions, which creates significant data skew.
- Incremental stream: Simulate real-time data writes under production workloads.
Compaction execution
- DLF: Trigger intelligent Compaction. The system automatically adapts to partition characteristics.
- Self-managed Paimon: Run Compaction jobs using Flink Action with fixed configurations.

Performance Evaluation

Metric Dimension	DLF Compaction	Self-managed Paimon Compaction
Compaction CU Consumption	237 CU	482 CU
Resource Allocation Method	Dynamic optimized allocation	Static fixed allocation

DLF reduces resource consumption by 50.8% using its intelligent bucketing strategy. In asymmetric data distribution scenarios, this advantage stems from two key mechanisms:

Dynamic bucketing algorithm: Calculates the optimal bucket count per partition in real time to avoid resource misallocation.
Partition-level resource isolation: Eliminates the “long-tail effect” and prevents large partitions from degrading overall job performance.

Deletion Vectors Optimization

Test Motivation

In Partial-Update scenarios with Deletion Vectors (DV) enabled, data merging often encounters performance bottlenecks. DLF applies kernel-level optimizations for this scenario to improve Compaction efficiency during high-frequency updates.

Test Plan

Table schema design

Create a Paimon table that supports high-frequency updates, emphasizing Lookup File processing efficiency.

-- DLF table
CREATE TABLE ...
WITH (
  'deletion-vectors.enabled' = 'true',
  'merge-engine' = 'partial-update'
);

-- Self-managed Paimon table
CREATE TABLE ... 
WITH (
  'bucket' = '1024',
  'write-only' = 'true',
  'deletion-vectors.enabled' = 'true',
  'merge-engine' = 'partial-update'
);

Test workflow
- Load injection: Simulate 100,000 mixed read-write operations per second.
- Continuous monitoring: Record memory usage, garbage collection (GC) frequency, and system latency during each Compaction epoch.

Performance Evaluation

Metric Dimension	DLF Compaction	Self-managed Paimon Compaction
Compaction CU Consumption	41 CU	102 CU

DLF optimizes Lookup File processing efficiency at the core for the Deletion Vectors mode. Test results show that under equal throughput pressure, DLF consumes only 40% of the compute resources used by the self-managed cluster.

Dynamic Resource Elasticity Optimization

Test Motivation

Service traffic follows a peak-and-trough pattern. Self-managed Compaction jobs typically require over-provisioned compute resources to handle peak loads, which causes waste during low-traffic periods or shortages during peaks. DLF automatically adjusts CU consumption based on real-time data volume, delivering true cloud-native elasticity.

Test Plan

Table schema design

-- DLF table
CREATE TABLE perf_rest.pk_elastic.t (
 ...
 PRIMARY KEY (`id`,`item_id`) NOT ENFORCED
) WITH (
 'deletion-vectors.enabled' = 'true'
);
-- Self-managed Paimon table
CREATE TABLE perf_fs.pk_elastic.t (
 ...
 PRIMARY KEY (`id`,`item_id`) NOT ENFORCED
) WITH (
 'bucket' = '500',
 'write-only' = 'true',
 'deletion-vectors.enabled' = 'true'
);

Test workflow
- Baseline data: Preload 500 GB of data.
- Dynamic traffic simulation: Write 10 million rows per minute for 20 minutes (peak), then 250,000 rows per minute for 40 minutes (off-peak).
- Resource configuration: Self-managed jobs use a fixed CU count sized for peak demand. DLF uses automatic elastic scaling.

Performance Evaluation

Metric Dimension	DLF Compaction	Self-managed Paimon Compaction
Average Compaction CU Consumption	135 CU	400 CU

DLF significantly improves overall resource utilization through adaptive resource adjustment:

Elastic scaling: Automatically releases compute resources when data traffic drops, greatly reducing average CU consumption.
Fully managed: Eliminates operational overhead from manually adjusting job parallelism in response to traffic fluctuations.