Remove Expired Apache Paimon Data to Reclaim Storage - Realtime Compute for Apache Flink

Manage storage growth in Apache Paimon tables by controlling how long snapshots are retained, automatically expiring time-partitioned data, and removing orphan files left by failed jobs.

Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.5 or later supports Apache Paimon tables.

How it works

Apache Paimon tables accumulate three categories of storage that require periodic cleanup:

Snapshots: Every write operation creates a new snapshot. Without expiration rules, snapshots accumulate indefinitely, increasing storage costs and the time needed to read table metadata.
Partitions: For tables partitioned by date or time, historical partitions that are no longer needed continue to occupy storage unless you configure automatic expiration.
Orphan files: When a job fails or restarts unexpectedly, uncommitted temporary files may remain in the table directory. These files are not tracked in any snapshot and cannot be removed through normal snapshot expiration.

Each category requires a different cleanup approach, described in the sections below.

Expire snapshots

How snapshot expiration works

Three parameters control when snapshots are expired:

Parameter	Type	Default	Description
`snapshot.num-retained.min`	Integer	10	Minimum number of snapshots to keep
`snapshot.num-retained.max`	Integer	2,147,483,647	Maximum number of snapshots to keep
`snapshot.time-retained`	Duration	`1h`	Maximum age of the oldest retained snapshot

Expiration is triggered when either of these conditions is met:

The snapshot count exceeds snapshot.num-retained.min and the oldest snapshot is older than snapshot.time-retained
The snapshot count exceeds snapshot.num-retained.max

The following example shows how these rules interact, using snapshot.num-retained.min = 2 and snapshot.time-retained = 1h:

New snapshot	All snapshots after expiration	Why
(snapshot-1, 10:00)	(snapshot-1, 10:00)	Only 1 snapshot — count ≤ min, no expiration
(snapshot-2, 10:20)	(snapshot-1, 10:00) (snapshot-2, 10:20)	2 snapshots — count = min, no expiration
(snapshot-3, 10:40)	(snapshot-1, 10:00) (snapshot-2, 10:20) (snapshot-3, 10:40)	3 snapshots > min, but snapshot-1 is only 40 min old — not yet expired
(snapshot-4, 11:20)	(snapshot-3, 10:40) (snapshot-4, 11:20)	snapshot-1 is now 1h20min old — count > min AND age > time-retained, so snapshot-1 and snapshot-2 are expired

Important

Snapshots are used to restore historical data. Deleting a snapshot too early can cause two types of failures:

Batch queries fail mid-read. A long-running batch query reads a snapshot taken at query start. If that snapshot expires before the query finishes, the query fails with a File xxx not found, Possible causes error.
Streaming jobs fail to restart. When a streaming job restarts, it resumes from the last recorded snapshot. If that snapshot has expired, the job cannot restart.

Set snapshot.time-retained to at least the duration of your longest-running batch query or streaming job restart window. For streaming jobs with tight retention settings, use Consumer Id to prevent the snapshot a consumer depends on from being expired.

Set snapshot retention parameters

To modify snapshot retention, run an ALTER TABLE statement or use SQL hints in a draft for data writing. For detailed steps, see the "Modify the schema of an Apache Paimon table" section in Manage Apache Paimon catalogs.

Expire partitions

For tables partitioned by time, configure partition expiration to automatically delete historical partitions and release storage.

Important

Data files in a partition are completely deleted only after the snapshots that contain the related partition expiration events expire.

How partition expiration works

Three parameters control partition expiration:

Parameter	Description	Remarks
`partition.expiration-time`	Validity period of a partition	Duration format, for example, `12h` or `7d`
`partition.timestamp-pattern`	Pattern for converting a partition value to a time string	Reference each partition key column with a `$` prefix
`partition.timestamp-formatter`	Pattern for converting a time string to a timestamp	Defaults to `yyyy-MM-dd HH:mm:ss` or `yyyy-MM-dd`; accepts any Java DateTimeFormatter-compatible pattern

A partition is deleted when the difference between the current system time and the timestamp converted from the partition value exceeds the value of partition.expiration-time.

Convert partition values to timestamps

The system converts partition values to timestamps using a two-step process:

partition.timestamp-pattern converts a partition value to a time string, with each partition key column referenced as $<column-name>.
partition.timestamp-formatter converts the time string to a timestamp.

Examples:

Single partition key dt with values like dt=20240308:

'partition.timestamp-pattern' = '$dt',
'partition.timestamp-formatter' = 'yyyyMMdd'

Three partition keys year, month, day with values like year=2023,month=04,day=21:

'partition.timestamp-pattern' = '$year-$month-$day'
-- No formatter needed: the default pattern yyyy-MM-dd matches the resulting string 2023-04-21

Four partition keys year, month, day, hour with values like year=2023,month=04,day=21,hour=17:

'partition.timestamp-pattern' = '$year-$month-$day $hour:00:00'
-- No formatter needed: the default pattern yyyy-MM-dd HH:mm:ss matches the resulting string 2023-04-21 17:00:00

Set partition expiration parameters

To configure partition expiration, run an ALTER TABLE statement or use SQL hints in a draft for data writing. For detailed steps, see the "Modify the schema of an Apache Paimon table" section in Manage Apache Paimon catalogs.

Remove orphan files

When a job fails or restarts unexpectedly, uncommitted temporary files may remain in the table directory. These orphan files are not associated with any snapshot and are not removed by snapshot expiration. Use the remove_orphan_files system procedure to delete them.

By default, only orphan files older than one day are deleted. Specify an optional timestamp to delete orphan files created before a specific point in time.

Delete orphan files

Log on to the Realtime Compute for Apache Flink console and create a script. For more information, see Create a script.
In the script editor, enter the following SQL statement. Replace the placeholders with your actual values.
Placeholder Description
<catalog-name> Name of the Apache Paimon catalog
<database-name> Name of the database where the table resides
<table-name> Name of the Apache Paimon table
```
CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');
```
To delete orphan files created before a specific time, add a timestamp as the second argument:
```
CALL `mycat`.sys.remove_orphan_files('mydb.mytbl', '2023-10-31 12:00:00');
```
This example deletes orphan files in mycat.mydb.mytbl that were created no later than 12:00:00 on October 31, 2023.
Select the SQL statement you entered, then click Run in the upper-left corner of the script editor. After the operation completes, the Results tab shows the total number of deleted files.

Placeholder	Description
`<catalog-name>`	Name of the Apache Paimon catalog
`<database-name>`	Name of the database where the table resides
`<table-name>`	Name of the Apache Paimon table

Next steps

For common optimization methods for Apache Paimon primary key tables, see Performance optimization.
If a streaming job fails with the File xxx not found, Possible causes error when reading from an Apache Paimon table, the snapshot the job depends on may have expired. For troubleshooting steps, see the "What do I do if the 'File xxx not found, Possible causes' error message appears when deployments are read from an Apache Paimon table?" section in FAQ about upstream and downstream storage.