This topic describes the characteristics and usage scenarios of cold data. It also provides an example to demonstrate how to use Tablestore and Delta Lake to separate cold and hot data. The separation of cold and hot data maximizes the utilization of computing and storage resources and ensures high performance at low costs.
Background information
As business and data grow continuously, the trade-off between performance and cost presents a serious challenge to the design of big data systems.
Delta Lake is a new data lake solution. It provides a series of features, such as data ingestion, data structure management, data query, and data outflow. It also supports ACID and CRUD operations on data. ACID is short for atomicity, consistency, isolation, durability. CRUD is short for create, read, update, and delete. You can use Delta Lake and its upstream and downstream components to build an easy-to-use and secure data lake architecture. You can use the hybrid transaction/analytical processing (HTAP) technology to select tiered storage components and computing engines. This technology can analyze large amounts of data, fast update transactions, and reduce the cost of hot and cold data separation.
Data classification based on access frequency
Data can be classified into hot data, warm data, and cold data based on the access frequency. Cold data refers to data that is infrequently accessed or even not accessed during the entire data lifecycle. The volume of cold data is large in most cases.
- Data creation time: Newly written data is hot data because it is accessed frequently. The access frequency becomes lower over time. When the data is rarely accessed or is even not queried at all, it becomes cold data.
This method applies to most data, such as transaction data, monitoring data on time series metrics, and instant messaging (IM) data.
- Data access popularity: You can add related tags to business data based on access popularity. Systems can also automatically distinguish hot and cold data based on access popularity.For example, an old blog is suddenly visited frequently. Even though the blog was created long time ago, it is classified as hot data based on the business and data distribution status.Note This topic describes only hot and cold data separation based on data creation time.
Features of cold data
- Large volume: Compared with hot data, cold data needs to be stored for a long time or even permanently.
- Low management cost: Cold data is infrequently accessed. Therefore, users expect to manage cold data at low costs.
- Low performance requirement: Unlike common queries on terabytes of data, queries on cold data do not require responses in milliseconds. Queries on cold data may require tens of seconds or even longer to return results. Asynchronous processing is also supported.
- Easy operation: In most scenarios, cold data is batch written or batch deleted and is not updated.
When you query cold data, the system reads only the data that meets query conditions. Query conditions are not complex.
Scenarios
- Time series data: This type of data naturally has a time attribute. The volume of data is large, and only the append operation is performed on the data. Time series data is used in the following scenarios:
- IM: In most time, users query only recent messages. Historical data is queried occasionally to meet special requirements. Example: DingTalk.
- Monitoring: In most time, users view only recent monitoring data. Historical data is queried only when users need to investigate issues or make reports. Example: Cloud Monitor.
- Billing: In most time, users view only bills generated in recent days or the latest month. Bills generated one year ago are rarely queried. Example: Alipay.
- Internet of things (IoT): Data recently reported by devices is frequently analyzed. Historical data is infrequently analyzed.
- Archived data: For data that is easy to read and write but complicated to query, you can regularly archive the data to storage components whose storage costs are low or to storage media with a high compression ratio. This helps reduce storage costs.
Example
This example demonstrates how to use Tablestore and Delta Lake to separate cold and hot data.
- Synchronize streaming data in real time.
- Query hot and cold data.We recommend that you store hot data in the Tablestore table for efficient queries on terabytes of data and store cold data or all data in the Delta Lake sink. In this example, the Tablestore table is the source table order_source, and the Delta Lake sink is the destination table delta_orders. Configure TTL for the source table. This way, you can flexibly control the volume of hot data.