All Products
Search
Document Center

Lindorm:Separately store hot data and cold data based on timestamps

Last Updated:Apr 09, 2024

This topic describes how to separately store hot data and cold data in Lindorm based on timestamps.

Background information

Lindorm can archive data to cold storage based on the timestamp and the specified hot and cold data boundary. If you do not specify a timestamp, the time when the data is written to the table is used to determine whether to archive data to cold storage.

Prerequisites

Procedure

Method 1: Use Apache HBase Shell to implement hot and cold data separation

  1. Create a table and specify a hot data and cold data boundary for the table.

    HBase(main):002:0> create 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}

    Parameters

    • NAME: The name of the column family for which you want to implement hot and cold data separation.

    • COLD_BOUNDARY: The boundary based on which hot and cold data separation is performed. The unit of the value of this parameter is seconds. For example, if the value of the COLD_BOUNDARY parameter is set to 86400, data is archived to cold storage after the data has been stored in hot storage for 86,400 seconds, which is equal to one day.

  2. (Optional) Change the hot and cold data boundary.

    HBase(main):005:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'42300'}
  3. (Optional) Disable cold and hot data separation.

    HBase(main):004:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>""}
    Note

    After you change the hot and cold data boundary or disable cold and hot data boundary, the data is transferred from cold storage to hot storage after Lindorm performs the compaction operation. To immediately transfer the data from cold storage to hot storage, manually run the major_compact command.

Method 2: Use ApsaraDB for HBase API for Java to implement hot and cold data separation

  1. Create a table and specify a hot data and cold data boundary for the table.

    Admin admin = connection.getAdmin();
    TableName tableName = TableName.valueOf("chsTable");
    HTableDescriptor descriptor = new HTableDescriptor(tableName);
    HColumnDescriptor cf = new HColumnDescriptor("f");
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
    descriptor.addFamily(cf);
    admin.createTable(descriptor);                        
  2. (Optional) Change the hot and cold data boundary.

    HTableDescriptor descriptor = admin
        .getTableDescriptor(tableName);
    HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
    admin.modifyTable(tableName, descriptor);
  3. (Optional) Disable cold and hot data separation.

    HTableDescriptor descriptor = admin
        .getTableDescriptor(tableName);
    HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, null);
    admin.modifyTable(tableName, descriptor);
    Note

    After you change the hot and cold data boundary or disable cold and hot data boundary, the data is transferred from cold storage to hot storage after Lindorm performs the compaction operation.

Method 3: Use Lindorm-cli to implement hot and cold data separation

  1. Create a table and specify a hot and cold data boundary for the table.

    USE test; 
    CREATE TABLE dt (  
      p1 integer, p2 integer, c1 varchar, c2 bigint,  constraint pk primary key(p1 desc)) WITH
    (COMPRESSION = 'ZSTD', CHS = '86400', CHS_L2 = 'storagetype=COLD');

    Parameters

    • CHS: The hot and cold data boundary for the table. Data that has been stored in the table for a period longer than this period is archived in cold storage. The unit of the value of this parameter is seconds. For example, if the value of this parameter is 86400, data is archived to cold storage after the data is stored in written to the table for 86,400 seconds, which is equal to one day.

    • COMPRESSION: The algorithm used to compress the data. This parameter applies to data in the entire table. The value of this parameter is not case-sensitive. Default value: NONE.

    • CHS_L2: The layer-2 attribute. In general, this parameter is configured to specify the storage type. Set this parameter to storagetype=COLD.

  2. (Optional) Change the hot and cold data boundary.

    ALTER TABLE dt SET 'CHS'='1000';
  3. (Optional) Disable cold and hot data separation.

    ALTER TABLE dt SET 'CHS'='', 'CHS_L2' = '';
    Note

    After you change the hot and cold data boundary or disable cold and hot data boundary, the data is transferred from cold storage to hot storage after Lindorm performs the compaction operation. To immediately transfer the data from cold storage to hot storage, manually run the major_compact command. For more information, see ALTER TABLE.

Write data

You can write data to a wide table that separately stores hot data and cold data in a manner that is the same as the manner in which you write data to a standard table. By default, the timestamp of data is the current system time when the data is written. If you use the ApsaraDB for HBase API to write data, you can specify a custom timestamp based on your requirement. When data is written to a table that separately stores hot data and cold data, it is first stored in hot storage that is of the Standard of Performance type. When the data has been stored for a period that is longer than the value specified by COLD_BOUNDARY or CHS, the data is archived to cold storage when Lindorm performs the compaction operation.

Note

The timestamp of data is used by Lindorm to determine whether the data needs to be archived to cold storage. Lindorm compares the current time and the timestamp of data. If the timestamp of data is a point in time three days after the current system time, the data is archived to cold storage three days later than data whose timestamp is the current system time. If the timestamp of data is a point in time three days before the current system time and the hot and cold data boundary is set to three days, the data is asynchronously archived to cold storage after it is written to the table.

Query data

Lindorm uses one table to store hot data and cold data. This way, you can query all data within one table. When you query data, you can configure the TimeRange parameter to specify a time range for the query. Lindorm determines whether to query only hot storage or cold storage, or query both hot and cold storage based on the time range that you specify. If you do not specify a time range in a query, cold data may be queried. In this case, the throughput of the query is throttled by the specification of cold storage. For more information, see Overview.

If you want to query data in hot storage, you can use the _l_hot_only_ hint in the query or configure the HOT_ONLY parameter to query only data in hot storage.

Important
  • Data in cold storage is intended for archiving and should be rarely accessed. If data in cold storage is frequently accessed by a large number of requests, check whether the hot and cold data boundary specified by the COLD_BOUNDARY is properly configured. If large amounts of data that needs to be frequently queried is archived to cold storage based on the boundary, the performance of the queries may be throttled.

  • If a field in a row of data stored in cold storage is updated, the updated field is stored in hot storage. If you specify the HOT_ONLY or TimeRange parameter to query only data in hot storage, only the updated field is returned. If you want LindormTable to return the data of the entire row, you must perform the query without specifying the HOT_ONLY or TimeRange parameter, or make sure that the time range specified by TimeRange covers the time period from the point in time when the row was inserted to the point in time when the row was last updated. Therefore, we recommend that you do not update data stored in cold storage.

Currently, the GET and SCAN methods are supported to query data.

Use the GET method to query data

  • Method 1: Use Apache HBase Shell to query data

    • Query data without specifying the HOT_ONLY parameter. In this case, data in cold storage may be queried.

      HBase(main):013:0> get 'chsTable', 'row1'
    • Query data by specifying the HOT_ONLY parameter. In this case, only data in hot storage is queried.

      HBase(main):015:0> get 'chsTable', 'row1', {HOT_ONLY=>true}
    • Query data by specifying the TimeRange parameter. Lindorm compares the values of TimeRange and COLD_BOUNDARY to determine whether to query only hot storage or cold storage, or query both hot and cold storage.

      HBase(main):016:0> get 'chsTable', 'row1', {TIMERANGE => [0, 1568203111265]}

      Parameters

      TimeRange: the time range within which the query is performed. The time in the range is a UNIX timestamp representing the number of milliseconds that have elapsed since the epoch time January 1, 1970, 00:00:00 UTC.

  • Method 2: Use ApsaraDB for HBase API for Java to query data

    • Query data without specifying the HOT_ONLY parameter. In this case, data in cold storage may be queried.

      Get get = new Get("row1".getBytes());
      System.out.println("result: " + table.get(get));
    • Query data by specifying the HOT_ONLY parameter. In this case, only data in hot storage is queried.

      get = new Get("row1".getBytes());
      get.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
    • Query data by specifying the TimeRange parameter. Lindorm compares the values of TimeRange and COLD_BOUNDARY to determine whether to query only hot storage or cold storage, or query both hot and cold storage.

      get = new Get("row1".getBytes());
      get.setTimeRange(0, 1568203111265)
  • Method 3: Use Lindorm SQL to query data

    SELECT /*+ _l_hot_only_ */ * FROM dt WHERE pk IN (1, 2, 3);

Use the SCAN method to query data within a specified range

Note
  • If you do not configure the HOT_ONLY and TimeRange parameters, or the value of TimeRange covers the timestamps of data in cold storage, Lindorm queries data in both cold storage and hot storage and merge the query results.

  • Only Apache HBase Shell and ApsaraDB for HBase API for Java support queries within specified ranges.

  • Method 1: Use Apache HBase Shell to query data

    • Query data without specifying the HOT_ONLY parameter. In this case, data in cold storage is queried.

      Lindorm(main):017:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9'}
    • Query data by specifying the HOT_ONLY parameter. In this case, only data in hot storage is queried.

      Lindorm(main):018:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', HOT_ONLY=>true}
    • Query data by specifying the TimeRange parameter. Lindorm compares the values of TimeRange and COLD_BOUNDARY to determine whether to query only hot storage or cold storage, or query both hot and cold storage.

      Lindorm(main):019:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]}
  • Method 2: Use ApsaraDB for HBase API for Java to query data

    TableName tableName = TableName.valueOf("chsTable");
    Table table = connection.getTable(tableName);
    // In this example, the HOT_ONLY parameter is not configured. Data in cold storage is queried.
    Scan scan = new Scan();
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        System.out.println("scan result:" + result);
    }
    // In this example, the HOT_ONLY parameter is configured. Only data in hot storage is queried.
    scan = new Scan();
    scan.setAttribute(AliLindormConstants.HOT_ONLY, Bytes.toBytes(true));
    // In this example, the TimeRange parameter is configured. Lindorm compares the values of TimeRange and COLD_BOUNDARY to determine whether to query only hot storage or cold storage, or query both hot and cold storage.
    scan = new Scan();
    scan.setTimeRange(0, 1568203111265);

Preferentially query hot data

In a SCAN query that is performed to query information such as all orders or chat records of a customer, LindormTable may scan data in both hot storage and cold storage. The query results are paginated based on the timestamps when the data rows were written to the table in descending order. In most cases, hot data appears before cold data. If you do not configure the HOT_ONLY parameter in a SCAN query, LindormTable scans hot data and cold data. As a result, the query response time increases. If you enable the hot data prioritization feature, LindormTable preferentially queries data in hot storage. Data in cold storage is queried only when the number of rows in hot storage is less than the minimum number of data rows to query. This way, access to data in cold storage is reduced and the response speed is improved.

Note

You can use only Apache HBase Shell or ApsaraDB for HBase API for Java to preferentially query hot data.

  • Method 1: Use Apache HBase Shell to query data

    Lindorm(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}

    Parameters

    COLD_HOT_MERGE: specifies whether to enable the hot data prioritization feature.

    • true: enable the hot data prioritization feature. When the COLD_HOT_MERGE parameter is set to true, LindormTable preferentially scans hot data. LindormTable scans cold data only when the number of data rows in hot storage is less than the number of rows that you want to query.

    • false: disable the hot data prioritization feature.

  • Method 2: Use ApsaraDB for HBase API for Java to query data

    scan = new Scan();
    scan.setAttribute(AliHBaseConstants.COLD_HOT_MERGE, Bytes.toBytes(true));
    scanner = table.getScanner(scan);

Usage notes

  • In scenarios in which data in specific fields of a row is updated, the row stores hot data and cold data at the same time. If you enable the hot data prioritization feature, the query results are returned in two batches. You can find two results of the same row key in the result set.

  • After you enable the hot data prioritization feature, hot data is returned before cold data. Therefore, the row key values of specific returned rows of cold data may be smaller than the row key values of specific returned rows of hot data because LindormTable returns hot data before cold data. The results that are returned for a SCAN query are not sequentially sorted. The rows of hot data and the rows of cold data are separately sorted based on row key values. For more information about how returned rows are sorted, see the following sample results. In some scenarios, you can specify a row key to ensure the order of the results of a SCAN query. For example, you use a table to store the information about orders that your customers make. You can specify a row key that consists of the column that store customer IDs and the column that stores the order creation time. This way, when you query the orders that were made by a customer, the returned orders are sorted based on the points in time when the orders were created.

    // In the following example, the row whose row key value is coldRow stores cold data and the row whose row key value is hotRow stores hot data.
    // In most cases, the coldRow row is returned before the hotRow row because rows in Lindorm are sorted in lexicographic order. 
    HBase(main):001:0> scan 'chsTable'
    ROW                                                                COLUMN+CELL
     coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
     hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
    2 row(s)
    
    // If you set the value of the COLD_HOT_MERGE parameter to true, LindormTable scans the row whose row key value is hotRow first. As a result, the hotRow row is returned before the coldRow row.
    HBase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
    ROW                                                                COLUMN+CELL
     hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
     coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
    2 row(s)

FAQ

  • Q: Is a row of cold data still cold data after it is updated?

    A: No. After a row of cold data is updated, the timestamp of this row is updated. Therefore, the cold data becomes hot data.

  • Q: Why cold data is returned for my query even if I want to query only hot data?

    A: You can configure the HOT_ONLY parameter or the _l_hot_only_ hint to query only hot data. Data is periodically archived to cold storage based on its timestamps. Therefore, some cold data may have not been archived to cold storage yet when you query the data. In this case, cold data is returned. To resolve this issue, you can specify a time range for the hot data that you want to query. Example:

    // You must use the _l_ts_min_ and _l_ts_max_ hints to specify a time range. The _l_ts_min_ hint indicates the difference between the current system time and the hot and cold data boundary. The _l_ts_max_ hint indicates the current system time. The unit of the hints must be the same.
    SELECT /*+ _l_hot_only_(true), _l_ts_min_(1000), _l_ts_max_(2001) */ * FROM test WHERE p1>1;
  • Q: Why does my query time out even if I specify a time range and use the HOT_ONLY hint in my query?

    A: This issue generally occurs after you migrate data to the table or enable hot and cold separation for the table. In this case, cold data is completely archived to cold storage and large mounts of cold data is still stored in hot storage. Therefore, the query may time out. To resolve this issue, you must perform the major compaction operation on the table. For more information about the syntax used to perform this operation, see ALTER TABLE.