Lindorm provides the wide table engine (LindormTable) that can automatically separate hot data and cold data based on your requirements. LindormTable archives cold data in cold storage and store hot data in hot storage based on the specified boundary.

Background information

In big data scenarios, business data such as order data and monitoring data grows over time and requires a large storage space. At the same time, large amounts of historical data is archived and rarely used. Enterprises require a cost-effective storage method to store this type of data to reduce costs. To help enterprises reduce storage costs for cold data and separate hot data and cold data in a simple manner, LindormTable provides the hot data and cold data separation feature. This way, hot data and cold data can be stored in different storage media. The unit price of cold storage is only 20% of the unit price of standard storage that is used to store hot data.

LindormTable can automatically separate hot data and cold data in the same wide table based on the boundary that you specify, and archive cold data in cold storage. You can write data to or query data from a wide table that separately stores hot data and cold data in a manner that is almost the same as the manner in which you write data to or query data from a standard table. When you query data from a table that separately stores hot data and cold data, you can specify a hint or a time range. Lindorm determines whether to scan hot data or cold data based on the hint or time range that you specify. This is an automated process that is transparent to you.

How it works

LindormTable determines whether the data that is queried is cold data based on the timestamp when the data is written to the table and the hot data and cold data boundary that you specify. LindormTable stores new data in hot storage and then transfers the data to cold storage after the age of the data exceeds the hot data and cold data boundary. You can change the hot data and cold data boundary based on your business requirements. Data can be transferred from cold storage to hot storage or from hot storage to cold storage.

Prerequisites

Specify a hot data and cold data boundary for a table

You can change the value of the COLD_BOUNDARY parameter to change the boundary for hot data and cold data separation. The unit of the value of the COLD_BOUNDARY parameter is seconds. For example, if the value of the COLD_BOUNDARY parameter is 86400, data is archived to cold storage after the data is stored in hot storage for 86,400 seconds, which is equal to one day.

Note You do not need to set the storagetype property of the table or column family to COLD for hot data and cold data separation. If the storagetype property of the table or column family is set to COLD, remove the property. For more information, see Introduction to the cold storage feature.

The following list describes the methods that can be used to configure hot data and cold data separation:

  • Method 1: Use Apache HBase Shell
    // Create a table that separately stores hot data and cold data.
    HBase(main):002:0> create 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}
    // Disable the hot data and cold data separation feature.
    HBase(main):004:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>""}
    // Enable the hot data and cold data separation feature or change the hot data and cold data boundary for an existing table. The unit of the value of the hot data and cold data separation boundary is seconds.
    HBase(main):005:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}
  • Method 2: Use ApsaraDB for HBase API for Java
    // Create a table that separately stores hot data and cold data.
    Admin admin = connection.getAdmin();
    TableName tableName = TableName.valueOf("chsTable");
    HTableDescriptor descriptor = new HTableDescriptor(tableName);
    HColumnDescriptor cf = new HColumnDescriptor("f");
    // The COLD_BOUNDARY parameter specifies the boundary for hot data and cold data separation. Unit: seconds. In this example, data is archived to cold storage after the data is stored for one day.
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
    descriptor.addFamily(cf);
    admin.createTable(descriptor);
    
    // Disable the hot data and cold data separation feature.
    // You must perform a major compaction operation before you move data from cold storage to hot storage.
    HTableDescriptor descriptor = admin
        .getTableDescriptor(tableName);
    HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
    // Disable the hot data and cold data separation feature.
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, null);
    admin.modifyTable(tableName, descriptor);
    
    // Enable the hot data and cold data separation feature or change the hot data and cold data boundary for an existing table.
    HTableDescriptor descriptor = admin
        .getTableDescriptor(tableName);
    HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
    // The COLD_BOUNDARY parameter specifies the boundary for hot data and cold data separation. Unit: seconds. In this example, data is archived to cold storage after the data is stored for one day.
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
    admin.modifyTable(tableName, descriptor);
  • Method 3: Use the Lindorm CQL shell

    You can use the Lindorm CQL shell to specify a hot data and cold data separation policy for a wide table in LindormTable.

    // The COLD_BOUNDARY parameter specifies the boundary for hot data and cold data separation. Unit: seconds. In this example, data is archived to cold storage after the data is stored for one day.
    create table tb (pk text PRIMARY KEY, c1 text) WITH extensions = {'COLD_BOUNDARY':'86400'}
    // Modify the storage policy and change the hot data and cold data boundary of an existing table.
    alter table tb  with extensions = {'STORAGE_POLICY' : 'DEFAULT', 'COLD_BOUNDARY':'1000'}
                        
  • Method 4: Use the Lindorm CLI
    USE test;
    // The CHS parameter specifies the period of time that data can be stored as hot data. The unit of the value of this parameter is seconds. In this example, data is archived to cold storage after the data is stored for one day. You need to set the CHS_L2 parameter to storagetype=COLD.  
    CREATE TABLE dt (  
      p1 integer, p2 integer, c1 varchar, c2 bigint,  constraint pk primary key(p1 desc)) WITH
    (COMPRESSION = 'ZSTD', CHS = '86400', CHS_L2 = 'storagetype=COLD');
    // Change the hot data and cold data boundary.
    ALTER TABLE dt SET 'CHS'='1000';
    
    // Disable the hot data and cold data separation feature.
    ALTER TABLE dt SET 'CHS'='', 'CHS_L2' = '';
                        

Write data

You can write data to a wide table that separately stores hot data and cold data in a manner that is the same as the manner in which you write data to a standard table. By default, the timestamp of data is the current system time when the data is written. If the storage type of your Lindorm instance is the Standard type or Performance type, data is first stored in hot storage. When the age of a row of data exceeds the value of the COLD_BOUNDARY parameter, LindormTable automatically transfers the row of data to cold storage during a major compaction operation. The process is completely transparent to you.

Query data

Lindorm allows you to use one table to store hot data and cold data. This way, you can query hot data and cold data from the same table. You can specify the HOT_ONLY hint to query only hot data if the age of the required data does not exceed the value of the COLD_BOUNDARY parameter. You can also configure the TimeRange parameter to specify the time range in which the required data was written to the table. LindormTable automatically determines whether to scan only hot data, only cold data, or both hot data and cold data based on the time range that you specify. LindormTable takes a longer period of time to query cold data than querying hot data. The throughput of querying cold data is lower than that of querying hot data. For more information, see Introduction to the cold storage feature.

Examples:

  • Use the GET method to query data
    • Method 1: Use Apache HBase Shell
      // In this example, the HOT_ONLY hint is not used. LindormTable may scan cold data.
      HBase(main):013:0> get 'chsTable', 'row1'
      // In this example, the HOT_ONLY hint is used. LindormTable scans only hot data. If row1 that is required in the query is stored in cold storage, no query result is returned.
      HBase(main):015:0> get 'chsTable', 'row1', {HOT_ONLY=>true}
      // In this example, the TIMERANGE parameter is specified. LindormTable determines the scope of data that needs to scanned based on the values of the TIMERANGE parameter and COLD_BOUNDARY parameter. The value of the TIMERANGE parameter is measured in milliseconds.
      HBase(main):016:0> get 'chsTable', 'row1', {TIMERANGE => [0, 1568203111265]}
    • Method 2: Use ApsaraDB for HBase API for Java
      Table table = connection.getTable("chsTable");
      // In this example, the HOT_ONLY hint is not used. LindormTable may scan cold data.
      Get get = new Get("row1".getBytes());
      System.out.println("result: " + table.get(get));
      // In this example, the HOT_ONLY hint is used. LindormTable scans only hot data. If row1 that is required in the query is stored in cold storage, no query result is returned.
      get = new Get("row1".getBytes());
      get.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
      // In this example, the TIMERANGE parameter is specified. LindormTable determines the scope of data that needs to scanned based on the values of the TIMERANGE parameter and COLD_BOUNDARY parameter. The value of the TIMERANGE parameter is measured in milliseconds.
      get = new Get("row1".getBytes());
      get.setTimeRange(0, 1568203111265)
    • Method 3: Use the Lindorm CQL shell
      // The hotdata keyword specifies hot data. You can use the Lindorm CQL shell to query all data or only hot data.
      select hotdata * from tb;
    • Method 4: Use the Lindorm SQL statement
      SELECT /*+ _l_hot_only_ */ * FROM dt WHERE pk IN (1, 2, 3);
      Note You can use a hint in the statement to configure the value of the _l_hot_only_ parameter. For more information, see Use hints to query hot data.
  • Use the SCAN method to query data in a specified range
    Note If you do not specify the HOT_ONLY hint or a time range in the SCAN statement, LindormTable scans both hot data and cold data, merges the results that are obtained from hot storage and cold storage, and then returns the merged result.
    • Method 1: Use Apache HBase Shell
      // In this example, the HOT_ONLY hint is not used. LindormTable scans both hot data and cold data.
      Lindorm(main):017:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9'}
      // In this example, the HOT_ONLY hint is used. LindormTable scans only hot data.
      Lindorm(main):018:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', HOT_ONLY=>true}
      // In this example, the TIMERANGE parameter is specified. LindormTable determines the scope of data that needs to scanned based on the values of the TIMERANGE parameter and COLD_BOUNDARY parameter. The value of the TIMERANGE parameter is measured in milliseconds.
      Lindorm(main):019:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]}
    • Method 2: Use ApsaraDB for HBase API for Java
      TableName tableName = TableName.valueOf("chsTable");
      Table table = connection.getTable(tableName);
      // In this example, the HOT_ONLY hint is not used. LindormTable scans both hot data and cold data.
      Scan scan = new Scan();
      ResultScanner scanner = table.getScanner(scan);
      for (Result result : scanner) {
          System.out.println("scan result:" + result);
      }
      // In this example, the HOT_ONLY hint is used. LindormTable scans only hot data.
      scan = new Scan();
      scan.setAttribute(AliLindormConstants.HOT_ONLY, Bytes.toBytes(true));
      // In this example, the TIMERANGE parameter is specified. LindormTable determines the scope of data that needs to scanned based on the values of the TIMERANGE parameter and COLD_BOUNDARY parameter. The value of the TIMERANGE parameter is measured in milliseconds.
      scan = new Scan();
      scan.setTimeRange(0, 1568203111265);
    Note
    • Cold storage is used only to archive data that is rarely accessed. In most cases, we recommend that you specify the HOT_ONLY hint or a time range to query only hot data. If cold data is often queried based on your business requirements, check whether the hot data and cold data boundary is set to an appropriate value.
    • After you update the value of a field in a row that is stored in cold storage, the new value of the field is stored in hot storage after the update. When this row is hit by a query that includes the HOT_ONLY hint or a time range that is used to specify only hot data, only the updated field is returned. If you want LindormTable to return the data of the entire row, you must delete the HOT_ONLY hint from the query statement or make sure that the specified time range covers the time period from the point in time when the row was inserted to the point in time when the row was last updated. Therefore, we recommend that you do not update data stored in the cold storage. If cold data needs to be frequently queried based on your business requirements, check whether the hot data and cold data boundary is set to an appropriate value.

View the sizes of hot data and cold data in a table

Important If no data is stored in the cold storage, data in the table may be stored in the random access memory (RAM). You can run the flush command to flush the data to disks, and then perform a major compaction operation. After the major compaction operation is complete, check the size of cold data.

You can view the sizes of hot data and cold data in a table in the console of the cluster management system. For more information, see Log on to the cluster management system.

Prioritize hot data selection

In scenarios in which a SCAN query is performed to query information such as all orders or chat records of a customer, LindormTable may scan hot data and cold data to query the required data. The query results are paginated based on the timestamps when the data rows were written to the table in descending order. In most cases, hot data appears before cold data. If you do not use the HOT_ONLY hint in a SCAN query, LindormTable scans hot data and cold data. As a result, the query response time increases. If you prioritize hot data selection in a query, LindormTable preferentially scans hot data. LindormTable scans cold data only if you specify it to return more results. For example, you specify LindormTable to return the next page of results. This way, the frequency of cold data access and the response time are reduced.

To enable the hot data prioritization feature, you need to set only the value of the COLD_HOT_MERGE parameter to true in your SCAN query statement. When the value of the COLD_HOT_MERGE feature is set to true, LindormTable preferentially scans hot data. LindormTable scans cold data only if you specify it to return more results.

  • Method 1: Use Apache HBase Shell
    Lindorm(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
  • Method 2: Use ApsaraDB for HBase API for Java
    scan = new Scan();
    scan.setAttribute(AliHBaseConstants.COLD_HOT_MERGE, Bytes.toBytes(true));
    scanner = table.getScanner(scan);
Note
  • In scenarios in which data in specific fields in a row is updated, the row stores hot data and cold data. If you enable the hot data prioritization feature, the query results are returned in two batches. You can find two results of the same row key in the result set.
  • After the hot data prioritization feature is enable, the row key values of specific returned rows of cold data may be smaller than the row key values of specific returned rows of hot data because LindormTable returns hot data before cold data. The results that are returned for a SCAN query are not sequentially sorted. The rows of hot data and the rows of cold data are separately sorted based on row key values. For information about how returned rows are sorted, see the following sample results. In some scenarios, you can specify a row key to ensure the order of the results of a SCAN query. For example, you use a table to store information about orders that your customers make. You can specify a row key that consists of the column that store customer IDs and the column that stores the order creation time. This way, when you query the orders that were made by a customer, the returned orders are sorted based on the points in time when the orders were created.
// In the following example, the row whose row key value is coldRow stores cold data and the row whose row key value is hotRow stores hot data.
// In most cases, the coldRow row is returned before the hotRow row because rows in ApsaraDB for Lindorm are sorted in lexicographic order. 
HBase(main):001:0> scan 'chsTable'
ROW                                                                COLUMN+CELL
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
2 row(s)

// If you set the value of the COLD_HOT_MERGE parameter to true, LindormTable scans the row whose row key value is hotRow first. As a result, the hotRow row is returned before the coldRow row.
HBase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
ROW                                                                COLUMN+CELL
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
2 row(s)

Separate hot data and cold data based on the age of data in a column

You can specify a column as the separation key to separate hot data and cold data. This way, LindormTable does not separate hot data and cold data based on the timestamps when data rows were written to the table. The following sample code shows how to create a table in which hot data and cold data are separated based on the age of data in a specified column.

USE test;// Create a table that separately stores hot data and cold data.
CREATE TABLE dt (  
  p1 integer, p2 integer, c1 varchar, c2 bigint,  constraint pk primary key(p1 desc)) 
WITH (COMPRESSION = 'ZSTD', CHS ='86400', CHS_L2 = 'storagetype=COLD');
// Specify the c2 column as the separation key column for hot data and cold data separation. 
ALTER TABLE dt SET 'CHS_COLUMN'='COLUMN=c2'; 
Note
  • You can only execute SQL statements to create tables that separate hot data and cold data based on the age of data in a specified column. You cannot use an HBase shell to create tables that separate hot data and cold data based on the age of data in a specified column.
  • The separation key column must be of the LONG type and values in the column must be 13-digit timestamps. If the specified column does not exist or contains no data or the data type is not LONG, LindormTable does not separate hot data and cold data. In this case, LindormTable writes all data to hot storage.
  • If you specify LindormTable to separate hot data and cold data based on the age of data in the separation key column, LindormTable archives a row of data in cold storage after the age of the data record in this column in the row reaches the specified hot data and cold data boundary. For example, if the difference between a timestamp value in the c2 column and the timestamp of the current point in time is 172,800 seconds, the hot data and cold data boundary is set to 86,400 seconds, and the difference between the timestamp value in the c1 column that stores the timestamp when the row of data is written to the table and the timestamp of the current point in time is 3,600 seconds, LindormTable archives the row of data in cold storage when a major compaction operation is performed.
  • Column-based hot data and cold data separation operations cannot be automatically performed based on the hot data and cold data boundary that you specify. Cold data can be archived only when major compaction operations are performed. If you need to adjust the frequency of hot data and cold data archiving, modify the interval at which major compaction operations are performed.

FAQ

  • When does LindormTable archive cold data into cold storage?

    LindormTable asynchronously transfers cold data from hot storage to cold storage based on the compaction mechanism. By default, LindormTable automatically performs compaction operations at an interval that is equal to half of the hot data and cold data boundary. The minimum interval is 86,400 seconds. If you set the hot data and cold data boundary to 259,200 seconds, LindormTable performs compaction operations at an interval of 129,600 seconds. If you set the hot data and cold data boundary to 86,400 seconds, LindormTable performs compaction operations at an interval of 86,400 seconds.

  • Can I manually trigger a compaction operation?

    Yes, you can manually trigger a compaction operation. You can use an HBase shell to run the major_compact 'tableName' command on a table to trigger a major compaction. This way, cold data is transferred from hot storage to cold storage, and all files in hot storage are merged and all files in cold storage are merged. We recommend that you do not frequently run this command because the execution of this command increases I/O loads.

  • Why is cold data in my table archived at a low rate?

    Make sure that the kernel version of your wide table engine is V2.1.20 or later. If the version is earlier than V2.1.20, upgrade the wide table engine to the latest version. Then, LindormTable archives cold data into cold storage at a specific interval. For more information, see the answer to Question 1. If you do not want to upgrade the version of the wide table engine, you can contact Technical support to help you optimize parameter configurations.

Usage notes

For more information, see Introduction to the cold storage feature.