All Products
Search
Document Center

ApsaraDB for HBase:Hot and cold data separation

Last Updated:Apr 17, 2024

ApsaraDB for HBase Performance-enhanced Edition allows you to separately store hot and cold data in different types of storage mediums. This improves the efficiency of hot data queries and reduces data storage costs.

Background information

In big data scenarios, business data such as order data or monitoring data grows over time and requires a large storage space. At the same time, a large amount of historical data is archived and rarely used. Enterprises require a cost-effective storage method to store this type of data to reduce costs. Therefore, ApsaraDB for HBase Performance-enhanced Edition introduces the cold and hot data separation feature to help enterprises minimize storage costs by using simplified O&M configurations. ApsaraDB for HBase Performance-enhanced Edition uses a new medium (cold storage) to store cold data, which allows you to reduce storage costs by two thirds compared with using ultra disks.

ApsaraDB for HBase Performance-enhanced Edition can automatically separate the cold and hot data stored in the same table based on the time boundary that you specify. Cold data is automatically archived in the cold storage. You can access a table that separately stores cold and hot data in a similar manner that you access a standard table. When you query the table that separately stores cold and hot data, you need to only specify query hints or time ranges for the system to determine whether to scan cold or hot data.

How it works

ApsaraDB for HBase Performance-enhanced Edition determines whether the data that is written to a table is cold data or hot data based on the timestamp of the data and the specified time boundary. The timestamp is in milliseconds. New data is stored in the hot storage. The data is moved to the cold storage over time. You can change the time boundary for separating cold and hot data based on your business requirements. Data can be moved from the cold storage to the hot storage or from the hot storage to the cold storage.

Usage notes

For more information about usage notes, see Cold storage.

Usage

To use cold storage, you must upgrade ApsaraDB for HBase Performance-enhanced Edition to V2.1.8 or later. You do not need to modify the client dependencies for reading and writing data. You need to only modify table schemas by using one of the following methods:

Enable cold storage

For more information about how to enable cold storage for a cluster, see Cold storage.

Specify a time boundary for a table

You can modify the COLD_BOUNDARY parameter to change the time boundary for separating cold and hot data. The time boundary is measured in seconds. For example, if the value of COLD_BOUNDARY is greater than or equal to 86400, new data is archived as cold data after 86,400 seconds, which is equal to one day.

You do not need to set the property of a column family to COLD for cold and hot data separation. If you have set the property to COLD, remove the property. For more information, see Cold storage.

Shell

// Create a table that separately stores cold and hot data.
hbase(main):002:0> create 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}
// Disable cold and hot data separation.
hbase(main):004:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>""}
// Enable cold and hot data separation for a table or change the time boundary. The time boundary is measured in seconds.
hbase(main):005:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}

Java API

// Create a table that separately stores cold and hot data.
Admin admin = connection.getAdmin();
TableName tableName = TableName.valueOf("chsTable");
HTableDescriptor descriptor = new HTableDescriptor(tableName);
HColumnDescriptor cf = new HColumnDescriptor("f");
// The COLD_BOUNDARY parameter specifies the time boundary for separating cold and hot data. Unit: seconds. In this example, new data is archived as cold data after one day.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
descriptor.addFamily(cf);
admin.createTable(descriptor);

// Disable cold and hot data separation.
// Note: You must perform a major compaction operation before you move data from the cold storage to the hot storage.
HTableDescriptor descriptor = admin
    .getTableDescriptor(tableName);
HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
// Disable cold and hot data separation.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, null);
admin.modifyTable(tableName, descriptor);

// Enable cold and hot data separation for a table or change the time boundary.
HTableDescriptor descriptor = admin
    .getTableDescriptor(tableName);
HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
// The COLD_BOUNDARY parameter specifies the time boundary for separating cold and hot data. Unit: seconds. In this example, new data is archived as cold data after one day.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
admin.modifyTable(tableName, descriptor);

Write data

You can write data to a table that separately stores cold and hot data in a similar manner that you write data to a standard table. For more information, see Use the HBase Java API to access ApsaraDB for HBase Performance-enhanced Edition clusters or Use the multi-language API to access ApsaraDB for HBase Performance-enhanced Edition clusters. The timestamp of the data is the time when the data is written to a table. New data is stored in the hot storage (standard disks). If the storage duration of the data exceeds the value specified by the COLD_BOUNDARY parameter, the system automatically moves the data to the cold storage during the major compaction process. The process is completely transparent to users.

Query data

ApsaraDB for HBase Performance-enhanced Edition allows you to use a table to store cold and hot data. You can query data only from one table. If you want to query data whose storage duration is less than the value specified by the COLD_BOUNDARY parameter, you can configure the HOT_ONLY hint in a GET or SCAN statement to query only hot data. You can also configure the TimeRange parameter in a GET or SCAN statement to specify the time range of the data that you want to query. The system automatically determines whether the data that you want to query is hot data or cold data based on the time range that you specify. The time required to query cold data is longer than the time required to query hot data. The throughput of reading cold data is lower than the throughput of reading hot data.

Examples

Get

  • Shell

    // In this example, the HOT_ONLY hint is not used. The system may scan cold data.
    hbase(main):013:0> get 'chsTable', 'row1'
    // In this example, the HOT_ONLY hint is used. The system scans only hot data. If row1 is stored in the cold storage, no query result is returned.
    hbase(main):015:0> get 'chsTable', 'row1', {HOT_ONLY=>true}
    // In this example, the TimeRange parameter is specified. The system determines the scope of data that needs to be scanned based on the values of TimeRange and COLD_BOUNDARY. The value of TIMERANGE is measured in milliseconds.
    hbase(main):016:0> get 'chsTable', 'row1', {TIMERANGE => [0, 1568203111265]}
  • Java

    Table table = connection.getTable("chsTable");
    // In this example, the HOT_ONLY hint is not used. The system may scan cold data.
    Get get = new Get("row1".getBytes());
    System.out.println("result: " + table.get(get));
    // In this example, the HOT_ONLY hint is used. The system scans only hot data. If row1 is stored in the cold storage, no query result is returned.
    get = new Get("row1".getBytes());
    get.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
    // In this example, the TimeRange parameter is specified. The system determines the scope of data that needs to be scanned based on the values of TimeRange and COLD_BOUNDARY. The value of TIMERANGE is measured in milliseconds.
    get = new Get("row1".getBytes());
    get.setTimeRange(0, 1568203111265)

Scan

If you do not configure the HOT_ONLY hint or a time range for the SCAN statement, both the cold data and hot data are queried. The query results are merged and returned based on how the SCAN operation of ApsaraDB for HBase works.

  • Shell

    // In this example, the HOT_ONLY hint is not used. The system scans both hot data and cold data.
    hbase(main):017:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9'}
    // In this example, the HOT_ONLY hint is used. The system scans only hot data.
    hbase(main):018:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', HOT_ONLY=>true}
    // In this example, the TimeRange parameter is specified. The system determines the scope of data that needs to be scanned based on the values of TimeRange and COLD_BOUNDARY. The value of TIMERANGE is measured in milliseconds.
    hbase(main):019:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]}
  • Java

    TableName tableName = TableName.valueOf("chsTable");
    Table table = connection.getTable(tableName);
    // In this example, the HOT_ONLY hint is not used. The system scans both hot data and cold data.
    Scan scan = new Scan();
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        System.out.println("scan result:" + result);
    }
    // In this example, the HOT_ONLY hint is used. The system scans only hot data.
    scan = new Scan();
    scan.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
    // In this example, the TimeRange parameter is specified. The system determines the scope of data that needs to be scanned based on the values of TimeRange and COLD_BOUNDARY. The value of TIMERANGE is measured in milliseconds.
    scan = new Scan();
    scan.setTimeRange(0, 1568203111265);
Note
  1. The cold storage is used only to archive data that is rarely accessed. In most cases, we recommend that you specify the HOT_ONLY hint or a time range to query only hot data. If your cluster receives a large number of queries that hit cold data, you can check whether the time boundary is set to an appropriate value.

  2. If you update a field in a row that is stored in the cold storage, the field is moved to the hot storage after the update. When this row is hit by a query that contains the HOT_ONLY hint or a time range that is configured to hit hot data, only the updated field in the hot storage is returned. If you want the system to return the entire row, you must delete the HOT_ONLY hint from the query statement or make sure that the time period from when this row is inserted to when this row is last updated is within the specified time range. We recommend that you do not update data that is stored in the cold storage. If you need to frequently update cold data, we recommend that you adjust the time boundary to move the data to the hot storage.

Query the sizes of cold and hot data

You can check the sizes of cold and hot data in a table on the User tables tab of ClusterManager. For more information, see Cluster management system.

Note

If no data is stored in the cold storage, data in the table may be stored in the random access memory (RAM). You can run the flush command to flush the data to disks, and then perform a major compaction operation. After the major compaction operation is complete, check the size of cold data.

Prioritize hot data selection

In scenarios in which a SCAN query is performed to query information such as all orders or chat records of a customer, the system may scan hot data and cold data to query the required data. The query results are paginated based on the timestamps when the data rows are written to the table in descending order. In most cases, hot data appears before cold data. If you do not use the HOT_ONLY hint in a SCAN query, the system scans hot data and cold data. As a result, the query response time increases. If you prioritize hot data selection in a query, the system preferentially scans hot data. Cold data is queried only if you want to view more query results. For example, you can click the next page icon on the page if you want the system to return more results. This way, the frequency of cold data access and the response time are reduced.

To prioritize hot data selection, you need to only set the value of the COLD_HOT_MERGE parameter to true in a SCAN query. This indicates that the system scans hot data first. If you want to view more query results, the system scans cold data.

Shell

hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}

Java

scan = new Scan();
scan.setAttribute(AliHBaseConstants.COLD_HOT_MERGE, Bytes.toBytes(true));
scanner = table.getScanner(scan);
Note
  • In scenarios in which data in specific fields in a row is updated, the row stores hot data and cold data. If you enable the hot data prioritization feature, the query results are returned in two batches. You can find two results of the same row key in the result set.

  • After the hot data prioritization feature is enabled, the row key values of the specified returned rows of cold data may be smaller than the row key values of the specified returned rows of hot data because the system returns hot data before cold data. The results that are returned for a SCAN query are not sequentially sorted. The rows of hot data and the rows of cold data are separately sorted based on row key values. For more information about how returned rows are sorted, see the following sample results. In some scenarios, you can specify a row key to ensure the order of the results for a SCAN query. For example, you use a table to store the information about orders. You can specify a row key that consists of the column storing customer IDs and the column storing the order creation time. This way, when you query the orders of a customer, the returned orders are sorted based on the order creation time.

// In this example, the row whose row key value is coldRow stores cold data and the row whose row key value is hotRow stores hot data.
// In most cases, the row whose row key value is coldRow is returned before the row whose row key value is hotRow because rows in ApsaraDB for HBase are sorted in lexicographical order. 
hbase(main):001:0> scan 'chsTable'
ROW                                                                COLUMN+CELL
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
2 row(s)

// If you set COLD_HOT_MERGE to true, the system scans the row whose row key value is hotRow first. As a result, the row whose row key value is hotRow is returned before the row whose row key value is coldRow.
hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
ROW                                                                COLUMN+CELL
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
2 row(s)