Manage storage growth in Apache Paimon tables by controlling how long snapshots are retained, automatically expiring time-partitioned data, and removing orphan files left by failed jobs.
Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.5 or later supports Apache Paimon tables.
How it works
Apache Paimon tables accumulate three categories of storage that require periodic cleanup:
Snapshots: Every write operation creates a new snapshot. Without expiration rules, snapshots accumulate indefinitely, increasing storage costs and the time needed to read table metadata.
Partitions: For tables partitioned by date or time, historical partitions that are no longer needed continue to occupy storage unless you configure automatic expiration.
Orphan files: When a job fails or restarts unexpectedly, uncommitted temporary files may remain in the table directory. These files are not tracked in any snapshot and cannot be removed through normal snapshot expiration.
Each category requires a different cleanup approach, described in the sections below.
Expire snapshots
How snapshot expiration works
Three parameters control when snapshots are expired:
| Parameter | Type | Default | Description |
|---|---|---|---|
snapshot.num-retained.min | Integer | 10 | Minimum number of snapshots to keep |
snapshot.num-retained.max | Integer | 2,147,483,647 | Maximum number of snapshots to keep |
snapshot.time-retained | Duration | 1h | Maximum age of the oldest retained snapshot |
Expiration is triggered when either of these conditions is met:
The snapshot count exceeds
snapshot.num-retained.minand the oldest snapshot is older thansnapshot.time-retainedThe snapshot count exceeds
snapshot.num-retained.max
The following example shows how these rules interact, using snapshot.num-retained.min = 2 and snapshot.time-retained = 1h:
| New snapshot | All snapshots after expiration | Why |
|---|---|---|
| (snapshot-1, 10:00) | (snapshot-1, 10:00) | Only 1 snapshot — count ≤ min, no expiration |
| (snapshot-2, 10:20) | (snapshot-1, 10:00) (snapshot-2, 10:20) | 2 snapshots — count = min, no expiration |
| (snapshot-3, 10:40) | (snapshot-1, 10:00) (snapshot-2, 10:20) (snapshot-3, 10:40) | 3 snapshots > min, but snapshot-1 is only 40 min old — not yet expired |
| (snapshot-4, 11:20) | (snapshot-3, 10:40) (snapshot-4, 11:20) | snapshot-1 is now 1h20min old — count > min AND age > time-retained, so snapshot-1 and snapshot-2 are expired |
Snapshots are used to restore historical data. Deleting a snapshot too early can cause two types of failures:
Batch queries fail mid-read. A long-running batch query reads a snapshot taken at query start. If that snapshot expires before the query finishes, the query fails with a
File xxx not found, Possible causeserror.Streaming jobs fail to restart. When a streaming job restarts, it resumes from the last recorded snapshot. If that snapshot has expired, the job cannot restart.
Set snapshot.time-retained to at least the duration of your longest-running batch query or streaming job restart window. For streaming jobs with tight retention settings, use Consumer Id to prevent the snapshot a consumer depends on from being expired.
Set snapshot retention parameters
To modify snapshot retention, run an ALTER TABLE statement or use SQL hints in a draft for data writing. For detailed steps, see the "Modify the schema of an Apache Paimon table" section in Manage Apache Paimon catalogs.
Expire partitions
For tables partitioned by time, configure partition expiration to automatically delete historical partitions and release storage.
Data files in a partition are completely deleted only after the snapshots that contain the related partition expiration events expire.
How partition expiration works
Three parameters control partition expiration:
| Parameter | Description | Remarks |
|---|---|---|
partition.expiration-time | Validity period of a partition | Duration format, for example, 12h or 7d |
partition.timestamp-pattern | Pattern for converting a partition value to a time string | Reference each partition key column with a $ prefix |
partition.timestamp-formatter | Pattern for converting a time string to a timestamp | Defaults to yyyy-MM-dd HH:mm:ss or yyyy-MM-dd; accepts any Java DateTimeFormatter-compatible pattern |
A partition is deleted when the difference between the current system time and the timestamp converted from the partition value exceeds the value of partition.expiration-time.
Convert partition values to timestamps
The system converts partition values to timestamps using a two-step process:
partition.timestamp-patternconverts a partition value to a time string, with each partition key column referenced as$<column-name>.partition.timestamp-formatterconverts the time string to a timestamp.
Examples:
Single partition key dt with values like dt=20240308:
'partition.timestamp-pattern' = '$dt',
'partition.timestamp-formatter' = 'yyyyMMdd'Three partition keys year, month, day with values like year=2023,month=04,day=21:
'partition.timestamp-pattern' = '$year-$month-$day'
-- No formatter needed: the default pattern yyyy-MM-dd matches the resulting string 2023-04-21Four partition keys year, month, day, hour with values like year=2023,month=04,day=21,hour=17:
'partition.timestamp-pattern' = '$year-$month-$day $hour:00:00'
-- No formatter needed: the default pattern yyyy-MM-dd HH:mm:ss matches the resulting string 2023-04-21 17:00:00Set partition expiration parameters
To configure partition expiration, run an ALTER TABLE statement or use SQL hints in a draft for data writing. For detailed steps, see the "Modify the schema of an Apache Paimon table" section in Manage Apache Paimon catalogs.
Remove orphan files
When a job fails or restarts unexpectedly, uncommitted temporary files may remain in the table directory. These orphan files are not associated with any snapshot and are not removed by snapshot expiration. Use the remove_orphan_files system procedure to delete them.
By default, only orphan files older than one day are deleted. Specify an optional timestamp to delete orphan files created before a specific point in time.
Delete orphan files
Log on to the Realtime Compute for Apache Flink console and create a script. For more information, see Create a script.
In the script editor, enter the following SQL statement. Replace the placeholders with your actual values.
Placeholder Description <catalog-name>Name of the Apache Paimon catalog <database-name>Name of the database where the table resides <table-name>Name of the Apache Paimon table CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');To delete orphan files created before a specific time, add a timestamp as the second argument:
CALL `mycat`.sys.remove_orphan_files('mydb.mytbl', '2023-10-31 12:00:00');This example deletes orphan files in
mycat.mydb.mytblthat were created no later than 12:00:00 on October 31, 2023.Select the SQL statement you entered, then click Run in the upper-left corner of the script editor. After the operation completes, the Results tab shows the total number of deleted files.
Next steps
For common optimization methods for Apache Paimon primary key tables, see Performance optimization.
If a streaming job fails with the
File xxx not found, Possible causeserror when reading from an Apache Paimon table, the snapshot the job depends on may have expired. For troubleshooting steps, see the "What do I do if the 'File xxx not found, Possible causes' error message appears when deployments are read from an Apache Paimon table?" section in FAQ about upstream and downstream storage.