This topic describes the characteristics and usage scenarios of cold data. It also provides an example to demonstrate how to use Tablestore and Delta Lake to separate cold and hot data. The separation of cold and hot data maximizes the utilization of computing and storage resources and ensures high performance at low costs.

Background information

As business and data grow continuously, the trade-off between performance and cost presents a serious challenge to the design of big data systems.

Delta Lake is a new data lake solution. It provides a series of features, such as data ingestion, data structure management, data query, and data outflow. It also supports ACID and CRUD operations on data. ACID is short for atomicity, consistency, isolation, durability. CRUD is short for create, read, update, and delete. You can use Delta Lake and its upstream and downstream components to build an easy-to-use and secure data lake architecture. You can use the hybrid transaction/analytical processing (HTAP) technology to select tiered storage components and computing engines. This technology can analyze large amounts of data, fast update transactions, and reduce the cost of hot and cold data separation.

Data classification based on access frequency

Data can be classified into hot data, warm data, and cold data based on the access frequency. Cold data refers to data that is infrequently accessed or even not accessed during the entire data lifecycle. The volume of cold data is large in most cases.

You can distinguish hot and cold data based on the data creation time or access popularity:
  • Data creation time: Newly written data is hot data because it is accessed frequently. The access frequency becomes lower over time. When the data is rarely accessed or is even not queried at all, it becomes cold data.

    This method applies to most data, such as transaction data, monitoring data on time series metrics, and instant messaging (IM) data.

  • Data access popularity: You can add related tags to business data based on access popularity. Systems can also automatically distinguish hot and cold data based on access popularity.
    For example, an old blog is suddenly visited frequently. Even though the blog was created long time ago, it is classified as hot data based on the business and data distribution status.
    Note This topic describes only hot and cold data separation based on data creation time.

Features of cold data

  • Large volume: Compared with hot data, cold data needs to be stored for a long time or even permanently.
  • Low management cost: Cold data is infrequently accessed. Therefore, users expect to manage cold data at low costs.
  • Low performance requirement: Unlike common queries on terabytes of data, queries on cold data do not require responses in milliseconds. Queries on cold data may require tens of seconds or even longer to return results. Asynchronous processing is also supported.
  • Easy operation: In most scenarios, cold data is batch written or batch deleted and is not updated.

    When you query cold data, the system reads only the data that meets query conditions. Query conditions are not complex.

Scenarios

  • Time series data: This type of data naturally has a time attribute. The volume of data is large, and only the append operation is performed on the data. Time series data is used in the following scenarios:
    • IM: In most time, users query only recent messages. Historical data is queried occasionally to meet special requirements. Example: DingTalk.
    • Monitoring: In most time, users view only recent monitoring data. Historical data is queried only when users need to investigate issues or make reports. Example: Cloud Monitor.
    • Billing: In most time, users view only bills generated in recent days or the latest month. Bills generated one year ago are rarely queried. Example: Alipay.
    • Internet of things (IoT): Data recently reported by devices is frequently analyzed. Historical data is infrequently analyzed.
  • Archived data: For data that is easy to read and write but complicated to query, you can regularly archive the data to storage components whose storage costs are low or to storage media with a high compression ratio. This helps reduce storage costs.

Example

This example demonstrates how to use Tablestore and Delta Lake to separate cold and hot data.

  1. Synchronize streaming data in real time.
    1. Create a data source table.
      In this example, the data source table is an order table named OrderSource. The table has two primary key columns (UserId and OrderId) and two attribute columns (price and timestamp). Call the BatchWriteRow operation that is provided by Tablestore SDK for Java to write order data into the table. The time range for the timestamp column is the latest 90 days (2020-02-26 to 2020-05-26). A total number of 3,112,400 records are written into the table.

      When you simulate this order write operation, write timestamps to the version number attribute column of your Tablestore table. Configure a time to live (TTL) attribute for the table. When the retention period of data in the table exceeds the TTL, the system automatically deletes the data based on the version numbers.

    2. Create a tunnel for incremental synchronization in the Tablestore console.
      Use the CDC technology supported by the created tunnel to synchronize incremental data in the source table to Delta Lake. The tunnel ID is used in subsequent SQL statement configuration.
    3. On the Header node of your EMR cluster, enable the Streaming SQL interactive command line.
      streaming-sql --master yarn --use-emr-datasource --num-executors 16 --executor-memory 4g --executor-cores 4
      Run the following commands to create a source table and a destination table.
      // 1. Create a Tablestore source table.
      DROP TABLE IF EXISTS order_source;
      CREATE TABLE order_source
      USING tablestore
      OPTIONS(
      endpoint="http://vehicle-test.cn-hangzhou.vpc.tablestore.aliyuncs.com",
      access.key.id="",
      access.key.secret="",
      instance.name="vehicle-test",
      table.name="OrderSource",
      catalog='{"columns": {"UserId": {"col": "UserId", "type": "string"}, "OrderId": {"col": "OrderId", "type": "string"},"price": {"col": "price", "type": "double"}, "timestamp": {"col": "timestamp", "type": "long"}}}',
      );
      
      
      // 2. Create a Delta Lake sink named delta_orders as the destination table.
      DROP TABLE IF EXISTS delta_orders;
      CREATE TABLE delta_orders(
      UserId string,
      OrderId string,
      price double,
      timestamp long
      )
      USING delta
      LOCATION '/delta/orders';
      
      // 3. Create an incremental scan view for the source table.
      CREATE SCAN incremental_orders ON order_source USING STREAM 
      OPTIONS(
      tunnel.id="324c6bee-b10d-4265-9858-b829a1b71b4b", 
      maxoffsetsperchannel="10000");
      
      // 4. Run a streaming job to synchronize data from the source table to the destination table based on the CDC technology.
      CREATE STREAM orders_job
      OPTIONS (
      checkpointLocation='/delta/orders_checkpoint',
      triggerIntervalMs='3000'
      )
      MERGE INTO delta_orders
      USING incremental_orders AS delta_source
      ON delta_orders.UserId=delta_source.UserId AND delta_orders.OrderId=delta_source.OrderId
      WHEN MATCHED AND delta_source.__ots_record_type__='DELETE' THEN
      DELETE
      WHEN MATCHED AND delta_source.__ots_record_type__='UPDATE' THEN
      UPDATE SET UserId=delta_source.UserId, OrderId=delta_source.OrderId, price=delta_source.price, timestamp=delta_source.timestamp
      WHEN NOT MATCHED AND delta_source.__ots_record_type__='PUT' THEN
      INSERT (UserId, OrderId, price, timestamp) values (delta_source.UserId, delta_source.OrderId, delta_source.price, delta_source.timestamp);
      The following table describes these operations in detail.
      OperationDescription
      Create a Tablestore source table.Create a source table named order_source.

      The catalog parameter in OPTIONS defines the table schema. In this example, the table contains the UserId, OrderId, price, and timestamp columns.

      Create a Delta Lake sink.Create a destination table named delta_orders.

      The LOCATION parameter specifies the location where Delta files are stored.

      Create an incremental-data scan view for the source table.Create a streaming view named incremental_orders.
      • tunnel.id: the ID of the tunnel created in Step 1.b.
      • maxoffsetsperchannel: the maximum data volume that can be written to each partition over the tunnel.
      Run a streaming job to synchronize data in real time.Aggregate the data in the source table based on the primary key columns (UserId and OrderId). Convert the aggregated data based on the operation types (PUT, UPDATE, and DELETE) of CDC logs, and synchronize the converted data to Delta Lake.

      __ots_record_type__ : a predefined column provided by the streaming source of Tablestore. It is a row-based operation type.

  2. Query hot and cold data.
    We recommend that you store hot data in the Tablestore table for efficient queries on terabytes of data and store cold data or all data in the Delta Lake sink. In this example, the Tablestore table is the source table order_source, and the Delta Lake sink is the destination table delta_orders. Configure TTL for the source table. This way, you can flexibly control the volume of hot data.
    1. Before you configure the TTL, query the numbers of data entries in the source table and destination table.
      The query results are consistent.Stream
    2. Set the TTL for the Tablestore table to the latest 30 days.
      The source table contains only the data generated in the latest 30 days. The destination table contains all data. Hot data and cold data are separated.
    3. Query the numbers of data entries in the source table and destination table again.
      The number of data entries in the source table is 1,017,004, and that in the destination table is still 3,112,400.Order