Data Lake Formation (DLF) storage optimization provides features such as table-level adaptive compaction, expired snapshot cleanup, partition lifecycle management, and orphan file cleanup. These features simplify the use and maintenance of Paimon tables and improve computing and storage efficiency. This topic describes the intelligent storage optimization policies that DLF runs in the background and their execution mechanisms.
Computing resources used for storage optimization are free of charge during the invitational preview period.
Storage optimization strategies
Policy type | Description | DLF execution mechanism |
The compaction feature merges small files into larger ones. This reduces the number of files, which lowers metadata management overhead and file lookup costs during queries. This improves the query performance and efficiency of Paimon tables. | DLF automatically triggers compaction when you commit data writes. | |
As long as a snapshot exists, the data files it references cannot be deleted. This ensures that the historical state of the data remains readable. As new snapshots are created, the storage space consumed by historical data increases. You must remove old snapshots to release the space occupied by inactive historical data. This helps you manage and free up storage resources. | DLF automatically triggers snapshot cleanup when a DLF storage optimization job runs. The default expiration time for snapshots is 1 hour. You can adjust the expiration time using Paimon table parameters. For more information, see Clean up expired data. | |
Many business scenarios require access only to recent data. In these cases, you can partition data by time and set a partition expiration time to automatically delete old historical partitions. This frees up storage space. You can also configure storage tiering to move partition data that is accessed less frequently from high-performance storage, such as Standard, to low-cost storage, such as Infrequent Access, Archive, or Cold Archive. This reduces storage costs while meeting business needs. | You can configure the expiration time using Paimon table parameters. For more information, see Set partition expiration time. After you configure the parameters, the process is automatically triggered when a DLF storage optimization job runs. You can also use the Intelligent storage tiering feature to automatically move partition data that meets the specified criteria to a storage class, such as Standard, Infrequent Access, Archive, or Cold Archive. You can also manually change the storage class on the table details page. On the Storage overview page, you can view the storage tiering distribution for data catalogs, databases, and tables. | |
Because of job errors, restarts, or other issues, some uncommitted temporary files may remain in the Paimon table directory. These orphan files are not referenced by any snapshot and cannot be deleted by the snapshot expiration mechanism. These files need to be cleaned up periodically. | The default retention period for orphan files is 7 days. Orphan files older than this period are considered expired and are automatically cleaned up. DLF triggers a cleanup task every 7 days. |
Enable or disable intelligent storage optimization
The Storage Optimization tab is displayed only when you create a Paimon table.
Log on to the DLF console.
In the left navigation menu, click Catalogs, and click the name of your catalog.
In the Database section, click your database name.
On the Tables tab, click your table name.
Select the Storage Optimization tab. The Intelligent Storage Optimization switch is enabled by default. To disable it, click the
switch.
View and configure storage optimization strategies
Compaction
On the Storage Optimization tab, select the Compaction subtab. You can view the execution details of small file compaction, such as the execution status and the data latency for pending compaction tasks.
Expired snapshot cleanup
On the Storage Optimization tab, select the Snapshot Expiration subtab. You can configure snapshot cleanup rules and view the results.
Configure snapshot cleanup rules.
Click Modify and set Snapshot Retention Period (default: 1 hour), and save the setting.
View snapshot cleanup results .
Total Snapshots: Displays total snapshots.
Earliest Snapshot: Displays the details of the earliest table snapshot, including the snapshot ID, commit time, commit type, total rows, and added rows.
Partition lifecycle management
On the Storage Optimization tab, select the Partition LifeCycle subtab. You can configure partition cleanup rules, view partition cleanup results, and configure storage tiering.
Configure partition cleanup rules
Click the
Expired Partition Cleanup switch to enable partition cleanup.After you enable partition cleanup, configure the following rules as needed. Click OK to complete the configuration.
Configuration item
Description
Expiration Policy
You can select one of the following expiration policies:
Last Access-based Partition Expiration: Determines expiration based on the last access time of the partition data.
Value-based Partition Expiration: Determines expiration based on the partition value (values-time). You can configure the partition timestamp format and partition field pattern.
Timestamp Format: Corresponds to the table configuration item
partition.timestamp-formatter. You can configure formats such asyyyy-MM-dd,yyyyMMdd,dd/MM/yyyy, anddd.MM.yyyy.Timestamp Pattern: Corresponds to the table configuration item
partition.timestamp-pattern. By default, the first partition field is used. You can configure patterns such as$dtor$year-$month-$day.
Modified: Determines expiration based on the last update time of the partition data at the finest granularity.
Partition Retention Period
The maximum value is 999,999 days. The start time for the retention period is determined by the selected expiration policy.
(Optional) Click Modify next to Cleanup Rule Settings to change the settings.
NoteIf you want to retain partitions permanently, do not configure partition expiration rules. By default, the system does not automatically clean up partition data.
View partition cleanup results
Click View Partitions to view the partition list for the current table. The list includes the partition name, number of rows (physical), number of referenced files, total file size, creator, storage class, last updater, creation time, last update time, and operations.
Configure storage tiering
Configuration item
Description
Intelligent Tiering
When enabled, the system automatically tiers the storage for all tables in the catalog based on the lifecycle rules you configure. Specify the tiering policy and rules as needed.NoteIf intelligent tiering is enabled at the catalog level, it is also enabled by default for tables (inherited from the catalog). You can modify the configuration at the table level. If you modify the rules at the table level, the "inherited from Catalog" status is no longer displayed.
If intelligent tiering is not enabled at the catalog level, you can still enable and modify it at the table level.
Tiering Strategy
Last Access Time: The rule is evaluated based on the last access time of the table or partition data.
Last Update Time: The rule is evaluated based on the last update time of the table or partition data.
Tiering Rule
The minimum storage duration requirements vary for different storage classes.
You can configure the following tiering rules:
Transition to Infrequent Access
Inactivity Threshold: Custom. The default is 30 days.
Data is automatically transitioned to IA storage if its last access time exceeds this number of days. IA storage can still be accessed by the compute engine, but with reduced performance.
Convert to Standard Storage Upon Access: If you select this option, the system automatically transitions the partition or non-partitioned table to the Standard storage class when it is accessed.
NoteThis feature is supported only when the tiering strategy is based on "Last Access Time".
Transition to Archive
Inactivity Threshold: Custom. The default is 60 days.
Data is automatically transitioned to Archive Storage if its last access time exceeds this number of days. Data in Archive Storage cannot be accessed by the compute engine.
Convert to Standard Storage Upon Access: If you select this option, the system automatically transitions the partition or non-partitioned table to the Standard storage class when it is accessed.
NoteThis feature is supported only when the tiering policy is based on "Last Access Time".
Transition to Cold Archive
Inactivity Threshold: Custom. The default is 180 days.
Data is automatically transitioned to Cold Archive storage if its last access time exceeds this number of days. Data in Cold Archive storage cannot be accessed by the compute engine.
NoteIn addition to using the Intelligent storage tiering feature, you can also manually change the storage class on the table details page. You can also view the storage tiering distribution for data catalogs, databases, and tables on the Storage overview page.
Orphan file cleanup
On the Storage Optimization tab, select the Orphan File Remove subtab. You can view the orphan file cleanup rules. For example, the default retention period for orphan files is 7 days, based on the file write time. Expired orphan files older than this period are automatically cleaned up by the system.
Change the storage type
On your catalog details page, click a database name.
On the Tables subtab, click a table name.
Click the Table Details tab. You can manually change the storage class for partitioned and non-partitioned tables.
Partitioned tables
Select the Partitions subtab, you can change the storage class for partitions of different storage classes.
Partitions in Standard, Infrequent Access, or Archive storage classes:
In the Actions column, click Modify Storage Class. You can change the storage class to any class other than the current one.
Partitions in the Cold Archive storage class:
You must first restore the data. After the restoration is complete and the status changes to restored, you can change the storage class. Perform the following steps:
Click Restore. Configure the Restored Copy Availability Duration. You can select partitions for batch restoration.
Value range: 1 to 365. The unit is days.
Default value: 1 day.
When the data enters the restored state, click Modify Storage Class in the Actions column to change the storage class.
Non-partitioned tables
In the Basic Information section of the table details page, you can change the storage class.
Standard, Infrequent Access, or Archive storage classes:
Click Edit next to Storage Type. You can change the storage class to any class other than the current one.
Cold Archive storage class
You must first restore the data. After the restoration is complete and the status changes to restored, you can change the storage class. Perform the following steps:
Click Restore next to Storage Type. Configure the Restored Copy Availability Duration.
Value range: 1 to 365. The unit is days.
Default value: 1 day.
If the Storage Type is Cold Archive (Restored), click Edit next to Storage Type. You can then change it to any other storage class.
NoteRestoration time: The time required to restore an object. For the Cold Archive storage class, only the Standard restoration priority is supported. The restoration process takes 2 to 5 hours.
Restored state start time: The time when the first object of the Cold Archive storage class in a partition enters the restored state after the restoration operation is complete.
Restored state duration: The validity period during which the data is in the restored state after the first Cold Archive object in the partition is restored. After all objects in the partition are restored, you can read, write, or change the storage class of the partition. When the restored state duration ends, the data in the partition returns to the Cold Archive state and cannot be accessed directly. To perform operations on the data, you must restore it again.
Restoration procedure
Initially, the object is in the frozen state.
After you submit a restore request, the object enters the restoring state. The actual restoration time may vary.
After the server completes the restoration task, the object enters the restored state. For table-level storage tiering, the partition can be accessed normally after all objects within it are restored.
You can extend the duration of the restored state by adjusting the partition's restored state duration. However, the total duration cannot exceed the maximum limit for that storage class.
After the restored state duration ends, the object returns to the frozen state without changing its original storage class. To access the data again, you must submit a new restore request and wait for the restoration to complete.