All Products
Search
Document Center

Data Lake Formation:Incremental Clustering

Last Updated:Mar 26, 2026

Incremental Clustering is a data layout optimization feature for Paimon append-only tables in Data Lake Formation (DLF). When enabled, DLF automatically schedules periodic runs that rewrite data files to narrow the statistical range of clustering columns—letting the query engine skip irrelevant files and scan less data overall. This continuously optimizes the physical data structure with low resource overhead.

Use cases

Incremental Clustering delivers the most benefit in the following situations:

  • Continuous data ingestion with filter queries: The table receives a steady stream of new data, and queries frequently filter on specific columns such as event_time or user_id. Each clustering run tightens the data layout so filters become more selective over time.

  • High-cardinality filter columns: Queries filter on columns with many distinct values (for example, user IDs or device IDs). Without clustering, the query engine must scan many files to find matching rows. After clustering, files are physically sorted so the engine can skip most of them.

  • Accumulating small files: The ingestion pipeline writes many small files. Incremental Clustering merges these into files close to the target-file-size in a single pass, reducing file count and improving read throughput.

  • Historical partitions with stale layout: Partitions that stopped receiving new data still have a poorly sorted layout from early ingestion. Enabling full clustering for historical partitions reorganizes them when resources allow, without affecting active ingestion.

How it works

Once enabled, DLF automatically schedules Incremental Clustering runs. Each run does the following in one pass:

  1. Selects and rewrites data files. During each run, the system selects some data files for rewriting. The rewritten files have a more compact statistical range for the specified clustering columns, such as a smaller range between minimum and maximum values. This allows the query engine to skip irrelevant data files and significantly reduces the amount of data scanned. The performance improvement is more significant for larger datasets.

  2. Merges small files. The rewrite follows the target-file-size setting, so multiple small files are merged into larger files as part of the same operation.

Partitions that have been idle longer than a configurable threshold are promoted to full clustering, which reorganizes all files in the partition rather than just new arrivals. Full clustering runs at lower priority than incremental runs and only when sufficient resources are available.

Prerequisites

Before you begin, ensure that:

  • The table is a Paimon append-only table

  • bucket = -1 (non-bucketed table)

  • row-tracking.enabled is not set to true

  • data-evolution.enabled is not set to true

Important

If a single batch of incremental data exceeds 256 GB, resource limits may affect clustering performance. Contact Alibaba Cloud technical support for configuration recommendations.

Enable Incremental Clustering

The DLF console does not currently provide a UI for this feature. Configure it by running ALTER TABLE with the following table properties:

ParameterRequiredTypeDefaultDescription
clustering.incrementalYesBooleanfalseEnables Incremental Clustering.
clustering.columnsYesStringComma-separated list of clustering columns, for example 'event_time,user_id'. Select the leftmost filter columns that are frequently used in queries. Do not include partition columns.
clustering.strategyNoStringAutoThe sorting algorithm. If not set, DLF selects one based on column count: order for 1 column, zorder for 2–4 columns, and hilbert (Hilbert curve) for 5 or more columns.

Example:

ALTER TABLE catalog.db.my_table SET (
  'clustering.incremental' = 'true',
  'clustering.columns' = 'event_time,user_id',
  'clustering.strategy' = 'zorder'
);

Adjust the scheduling interval

By default, DLF schedules an Incremental Clustering run every 1 hour. Shorten the interval to optimize freshly ingested data faster; increase it to reduce system load.

ParameterRequiredTypeDefaultDescription
morax.compact.check-intervalNoDuration1hHow often the scheduler checks for partitions to cluster. Minimum value: 5min.

Example:

ALTER TABLE catalog.db.my_table SET ('morax.compact.check-interval' = '30min');

Enable full clustering for historical partitions (optional)

A partition is considered historical when it has not received new data for longer than the threshold you configure. Full clustering reorganizes all files in historical partitions, not just those added since the last run.

ParameterRequiredTypeDefaultDescription
clustering.history-partition.idle-to-full-sortYesDurationHow long a partition must be idle before it is promoted to full clustering. Example: 3d.
clustering.history-partition.limitNoInteger5Maximum number of historical partitions processed in a single task run.
Important

Full clustering of historical partitions runs only when sufficient system resources are available and at lower priority than incremental clustering of active partitions.

Example:

ALTER TABLE catalog.db.my_table SET (
  'clustering.history-partition.idle-to-full-sort' = '3d',
  'clustering.history-partition.limit' = '3'
);

Verify clustering status

Paimon data files have a level property:

  • Newly written files start at level = 0.

  • Files that have been through a clustering run are promoted to level 1 through level 5.

Query the $files metadata table to confirm clustering has run on a partition:

SELECT *
FROM `catalog_name`.`database_name`.`table_name$files`
WHERE `partition` = 'your_partition_value';

If the result set contains rows where level > 0, clustering has been applied to that partition. To measure the performance improvement, compare query execution times or inspect query plans before and after enabling clustering.