All Products
Search
Document Center

Data Lake Formation:DLF Paimon Primary Key Table Performance Evaluation

Last Updated:Mar 02, 2026

This report compares DLF Compaction and self-managed Paimon Compaction based on compute resource consumption. We developed a multi-dimensional performance evaluation framework focusing on resource utilization and elastic scheduling capabilities.

Test Scenarios

We systematically validate DLF’s technical advantages using the following three test scenarios:

  • Adaptive bucketing strategy: Dynamically adjusts the number of buckets based on partition data volume, enabling fine-grained resource allocation.

  • Deletion Vectors (DV) optimization: Optimizes lookup and merge efficiency in Deletion Vectors (DV) scenarios to improve Compaction performance.

  • Dynamic resource elasticity optimization: Automatically scales compute resources (CU) based on real-time load, eliminating waste from over-provisioning or shortages during peak demand.

Adaptive Bucketing Strategy

Test Motivation

In scenarios with highly skewed data distribution, traditional fixed-bucket strategies struggle to balance performance and resource efficiency. DLF implements a partition-level adaptive bucketing mechanism. This mechanism dynamically calculates the optimal number of buckets per partition based on actual data volume. This eliminates the need for manual bucket count configuration and enables precise resource allocation.

Test Plan

  1. Table schema design

    Create a Paimon table with dynamic partitioning and compare it with a hybrid partitioning strategy.

    -- DLF table (intelligent dynamic partitioning)
    -- Enable Deletion Vectors; system manages buckets automatically
    CREATE TABLE perf_rest.pk_partitions_db.t (
        -- Primary key fields and extended attributes
    ) 
    PARTITIONED BY (`partition_id`)
    WITH (
        'deletion-vectors.enabled' = 'true'
    );
    
    -- Self-managed Paimon table (fixed bucket configuration)
    -- Bucket count must be specified upfront
    CREATE TABLE perf_filesystem.pk_partitions_db.t (
        -- Primary key fields and extended attributes
    ) 
    PARTITIONED BY (`partition_id`)
    WITH (
        'bucket' = '500',
        'write-only' = 'true',
        'deletion-vectors.enabled' = 'true'
    );
  2. Data injection strategy

    • Initial data layer: Inject 500 GB of baseline data to create a non-uniform distribution. Assign 70% of the data to the primary partition and 3% each to nine secondary partitions, which creates significant data skew.

    • Incremental stream: Simulate real-time data writes under production workloads.

  3. Compaction execution

    • DLF: Trigger intelligent Compaction. The system automatically adapts to partition characteristics.

    • Self-managed Paimon: Run Compaction jobs using Flink Action with fixed configurations.

Performance Evaluation

Metric Dimension

DLF Compaction

Self-managed Paimon Compaction

Compaction CU Consumption

237 CU

482 CU

Resource Allocation Method

Dynamic optimized allocation

Static fixed allocation

DLF reduces resource consumption by 50.8% using its intelligent bucketing strategy. In asymmetric data distribution scenarios, this advantage stems from two key mechanisms:

  • Dynamic bucketing algorithm: Calculates the optimal bucket count per partition in real time to avoid resource misallocation.

  • Partition-level resource isolation: Eliminates the “long-tail effect” and prevents large partitions from degrading overall job performance.

Deletion Vectors Optimization

Test Motivation

In Partial-Update scenarios with Deletion Vectors (DV) enabled, data merging often encounters performance bottlenecks. DLF applies kernel-level optimizations for this scenario to improve Compaction efficiency during high-frequency updates.

Test Plan

  1. Table schema design

    Create a Paimon table that supports high-frequency updates, emphasizing Lookup File processing efficiency.

    -- DLF table
    CREATE TABLE ...
    WITH (
      'deletion-vectors.enabled' = 'true',
      'merge-engine' = 'partial-update'
    );
    
    -- Self-managed Paimon table
    CREATE TABLE ... 
    WITH (
      'bucket' = '1024',
      'write-only' = 'true',
      'deletion-vectors.enabled' = 'true',
      'merge-engine' = 'partial-update'
    );
  2. Test workflow

    • Load injection: Simulate 100,000 mixed read-write operations per second.

    • Continuous monitoring: Record memory usage, garbage collection (GC) frequency, and system latency during each Compaction epoch.

Performance Evaluation

Metric Dimension

DLF Compaction

Self-managed Paimon Compaction

Compaction CU Consumption

41 CU

102 CU

DLF optimizes Lookup File processing efficiency at the core for the Deletion Vectors mode. Test results show that under equal throughput pressure, DLF consumes only 40% of the compute resources used by the self-managed cluster.

Dynamic Resource Elasticity Optimization

Test Motivation

Service traffic follows a peak-and-trough pattern. Self-managed Compaction jobs typically require over-provisioned compute resources to handle peak loads, which causes waste during low-traffic periods or shortages during peaks. DLF automatically adjusts CU consumption based on real-time data volume, delivering true cloud-native elasticity.

Test Plan

  1. Table schema design

    -- DLF table
    CREATE TABLE perf_rest.pk_elastic.t (
     ...
     PRIMARY KEY (`id`,`item_id`) NOT ENFORCED
    ) WITH (
     'deletion-vectors.enabled' = 'true'
    );
    -- Self-managed Paimon table
    CREATE TABLE perf_fs.pk_elastic.t (
     ...
     PRIMARY KEY (`id`,`item_id`) NOT ENFORCED
    ) WITH (
     'bucket' = '500',
     'write-only' = 'true',
     'deletion-vectors.enabled' = 'true'
    );
  2. Test workflow

    • Baseline data: Preload 500 GB of data.

    • Dynamic traffic simulation: Write 10 million rows per minute for 20 minutes (peak), then 250,000 rows per minute for 40 minutes (off-peak).

    • Resource configuration: Self-managed jobs use a fixed CU count sized for peak demand. DLF uses automatic elastic scaling.

Performance Evaluation

Metric Dimension

DLF Compaction

Self-managed Paimon Compaction

Average Compaction CU Consumption

135 CU

400 CU

DLF significantly improves overall resource utilization through adaptive resource adjustment:

  • Elastic scaling: Automatically releases compute resources when data traffic drops, greatly reducing average CU consumption.

  • Fully managed: Eliminates operational overhead from manually adjusting job parallelism in response to traffic fluctuations.