Frequent writes to delta tables generate many small files, which reduce I/O throughput and increase storage pressure. Compaction merges all data files into a consolidated set of base files, keeping only the latest version of each row, so queries scan fewer files and run faster.
Overview
Delta tables support two data merging methods:
| Method | What it does | Trigger |
|---|---|---|
| Clustering | Merges only delta files into a larger file without changing data content. Delta files contain data from committed operations. | Automatic — the system runs clustering periodically based on file size and count. No manual action required. |
| Compaction | Merges all data files into a batch of base files. For rows with the same primary key, only the latest row is kept. Historical row versions and system column data are not retained. | Manual — run ALTER TABLE ... COMPACT MAJOR. |
Prerequisites
Before you begin, ensure that you have:
A delta table (created with
tblproperties ("transactional"="true"))The session property
odps.merge.task.mode=serviceset in your session
Syntax
ALTER TABLE <table_name>
[PARTITION (<partition_key> = '<partition_value>' [, ...])]
COMPACT MAJOR;Parameters
| Parameter | Required | Description |
|---|---|---|
table_name | Yes | Name of the delta table to compact. |
partition_key | No | Partition key name. Specify to compact a single partition. |
partition_value | No | Value of the partition key. |
Usage notes
Session property required: Set
odps.merge.task.mode=servicebefore running compaction.SET odps.merge.task.mode=service;Storage cost: Compaction generates new base files before removing old ones, so storage usage increases temporarily. You can configure a trigger frequency based on your business requirements.
Write amplification and conflicts: The compaction policy is designed to prevent serious write amplification and conflict-caused failures in frequent compaction scenarios.
Billing
| Billing method | How compaction is charged |
|---|---|
| Pay-as-you-go | Amount of data scanned × 1 × unit price |
| Subscription | Deducted from the subscription computing quota |
Example
The following example creates a partitioned delta table, inserts data across three batches, runs compaction, and then queries historical data using time travel.
-- Create a delta table with a primary key and two partition columns.
CREATE TABLE mf_dt (pk BIGINT NOT NULL PRIMARY KEY, val BIGINT NOT NULL)
PARTITIONED BY (dd STRING, hh STRING)
TBLPROPERTIES ("transactional"="true");
-- Insert data in three batches to simulate incremental writes.
INSERT INTO TABLE mf_dt PARTITION(dd='01', hh='01') VALUES (1, 1), (2, 2);
INSERT INTO TABLE mf_dt PARTITION(dd='01', hh='01') VALUES (2, 20), (3, 3);
INSERT INTO TABLE mf_dt PARTITION(dd='01', hh='01') VALUES (3, 30), (4, 4);
-- Run compaction on the partition.
SET odps.merge.task.mode=service;
ALTER TABLE mf_dt PARTITION(dd='01', hh='01') COMPACT MAJOR;
-- Query the latest snapshot after compaction.
SELECT * FROM mf_dt TIMESTAMP AS OF get_latest_timestamp('mf_dt')
WHERE dd='01' AND hh='01';
-- Query the second-to-latest snapshot (time travel still works after compaction).
SELECT * FROM mf_dt TIMESTAMP AS OF get_latest_timestamp('mf_dt', 2)
WHERE dd='01' AND hh='01';