Cold data archiving by row (row-level archiving) automatically archives cold data of online tables to Object Storage Service (OSS) and then uses data manipulation language (DML) DELETE operations to remove cold data from online tables. This is also called the TTL 2.0 solution.
Background information
Many enterprises want to retain only the most recent and frequently accessed data (hot data). For old data (cold data) that is infrequently accessed and continuously accumulates, enterprises require a solution that can meet the following requirements:
Remove cold data at regular intervals.
Store cold data at lower costs.
Allow the use of archived cold data for analytical and statistical purposes.
PolarDB-X provides the cold data archiving (TTL) capability to address the preceding challenges.
CCI-based cold data archiving
Benefits
PolarDB-X provides the row-level cold data archiving solution based on the clustered columnar index (CCI) feature. This solution provides the following benefits:
High compression ratios: The CCI uses the optimal data compression algorithm based on the data type of each column in a table to achieve a high overall compression ratio for the entire table. For more information, see CCI overview.
High real-time capability and strong consistency: The CCI subscribes to the incremental binary logs of the primary instance to maintain real-time synchronization with the primary instance.
Low storage costs: CCI data can be archived to OSS, which can significantly reduce storage costs. For more information, see What is OSS?
All-in-one transparent HTAP capability: provides the all-in-one transparent hybrid transactional/analytical processing (HTAP) capability that is fully compatible with MySQL. This enables enterprises to perform transaction processing and data analysis in the same system at the same time without complex changes to the existing architecture. For more information, see HTAP.
Solution overview
PolarDB-X uses a cold data archiving solution that differs from those used by other cloud database products. Instead of the common "removing while archiving" approach, PolarDB-X uses a "pre-archiving and periodical removing" method.
Pre-archiving: After TTL settings are configured for an online table, the columnstore node generates a CCI based on the full data of the table and transfers the CCI to OSS. The columnstore node also subscribes to the binary logs of the online table and stores incremental data to OSS in real time based on the binary logs. The following figure shows the process.
Periodical Removing: After all data in an online table is archived, the archived data in the online table is removed based on the TTL settings. For more information, see Remove archived data from a TTL table. The removal operation does not affect the data that is archived to OSS.
Solution details
Cold data archiving
In the TTL 2.0 solution, when you create an archive table for a TTL table, a CCI is created. During this process, the columnstore node transfers the existing data of the TTL table to the OSS storage of the archive table based on snapshots, and subscribes to the binlogs of the TTL table in real time to perform row-to-column conversion of incremental data and uploads the incremental updates to the OSS storage of the archive table. The following figure shows the process.
NoteThe preceding figure includes the following steps:
Check whether TTL settings are configured for the online table.
You can use DDL statements to configure TTL settings for an online table. The TTL settings define that a specific column expires after a specific period of time. For more information, see Definition and creation of TTL tables.
Create an archive table for the TTL table to store cold data. TTL settings are configured for the current online table.
Create a dedicated CCI for the TTL table.
Transfer the existing data of the TTL table to OSS.
Subscribe to binlogs of the TTL table, convert incremental data from rows to columns in real time, and transfer incremental updates to the OSS storage of the archive table.
Cold data cleanup algorithm
A TTL table uses an earliest-to-most recent batch data cleanup algorithm to remove data. The cleanup process starts with the earliest data and removes data within a fixed time interval during each cycle. For more information, see Clean up expired data of TTL tables.
For example, the current time is 2023-10-01 and an online table needs to retain only the data of the most recent month. The cleanup process is as shown in the following figure.
NoteDay 1:
In this example, the minimum time value of the TTL-defined time column is 2022-10-05(MinValue), and the data cleanup time interval (CleanupDataInterval) is 3 months. For more information about the data cleanup time interval, see Modify the data cleanup time interval. Based on the minimum time value and the data cleanup time interval, the first cleanup time range is from October 5, 2022 (inclusive) to January 1, 2023 (exclusive). The subsequent cleanup time ranges are from January 1, 2023 (inclusive) to April 1, 2023 (exclusive), from April 1, 2023 (inclusive) to July 1, 2023 (exclusive) and from July 1, 2023 (inclusive) to September 1, 2023 (exclusive).
Day 2:
Cleans up data in the time range of January 1, 2023 (inclusive) to April 1, 2023 (exclusive). April 1, 2023 is the upper boundary (CleanupUpperBound) of the cleanup job.
Day 3:
Cleans up data in the time range of April 1, 2023 (inclusive) to July 1, 2023 (exclusive). July 1, 2023 is the upper boundary (CleanupUpperBound) of the cleanup job.
Day 4:
Cleans up data in the time range of July 1, 2023 (inclusive) to September 1, 2023 (exclusive). The time span is 2 months. In this example, the current date is October 1, 2023, and the TTL table is configured to retain data for the most recent month (ExpiredDataInterval). Therefore, the cleanup job cleans up data earlier than September 1, 2023, which is upper boundary (CleanupUpperBound) of the cleanup job.
The upper boundary (CleanupUpperBound) of each cleanup job can be calculated by using the following formula: CleanupUpperBound = Min (MinValue + CleanupDataInterval, ExpiredDataInterval).
(Optional) Clean up cold data in the archive table
As cold data accumulates in the archive table, you may need to regularly clean up cold data in the archive table to reduce storage load. Cold data archived to the OSS storage of the archive table is automatically
range-partitioned based on the TTL column in the TTL table. The following figure shows how cold data in the archive table is cleaned up.NoteThe cold data in the archive table is range-partitioned based on the specified TTL column. You can directly delete the archived data by partitions. For more information, see Delete a partition.