All Products
Search
Document Center

ApsaraDB for HBase:Cold and hot data separation

Last Updated:Mar 14, 2024

ApsaraDB for HBase Performance-enhanced Edition allows you to separate cold data from hot data in the same table. After you specify a time boundary for a table, the system automatically stores cold data in cold storage and hot data in hot storage.

Background information

In a big data scenario, business data such as order data or monitoring data grows over time. As your business develops, rarely used data is archived. Enterprises may want to use cost-effective storage to store this type of data to reduce costs. ApsaraDB for HBase Performance-enhanced Edition requires low O&M costs and supports cold and hot data separation to help enterprises reduce data storage costs. ApsaraDB for HBase Performance-enhanced Edition uses a new medium (cold storage) to store cold data, which allows you to reduce storage costs by two thirds compared with using ultra disks.

ApsaraDB for HBase Performance-enhanced Edition can automatically separate cold and hot data stored in the same table based on the time boundary that you specify. Cold data is automatically archived in the cold storage. You can access a table that separately stores cold and hot data in a similar manner that you access a standard table. When you query the table that separately stores cold and hot data, you need only to specify query hints or time ranges for the system to determine whether to scan cold or hot data. This is an automated process that is transparent to users.

For more information, see the Cost optimization for large amounts of data - Use ApsaraDB for HBase to separate cold and hot data topic in Yunqi Community.

How it works

ApsaraDB for HBase Performance-enhanced Edition determines whether the data that is written to a table is cold data based on the timestamp of the data and the time boundary that is specified by users. The timestamp is in milliseconds. New data is stored in the hot storage. The data is moved to the cold storage over time. You can change the time boundary for separating cold and hot data based on your business requirements. Data can be moved from the cold storage to the hot storage or from the hot storage to the cold storage.

Procedure

To use cold storage, you must upgrade ApsaraDB for HBase Performance-enhanced Edition to V2.1.8 and later. The version of the client dependency AliHBase-Connector must be later than 1.0.7/2.0.7. The version of HBase Shell must be later than alihbase-2.0.7-bin.tar.gz.

Before you use the Java API, install the SDK for Java and configure the parameters based on the steps in Use the HBase Java API to access ApsaraDB for HBase Performance-enhanced Edition clusters.

Before you use HBase Shell, follow the steps in Use HBaseue Shell to access an ApsaraDB for HBase Performance-enhanced Edition instance to download and configure HBase Shell.

Activate cold storage

Activate cold storage for the cluster that you want to manage. For more information, see Cold storage.

Specify a time boundary for a table

You can modify the COLD_BOUNDARY parameter to change the time boundary for separating cold and hot data. The time boundary is measured in seconds. For example, if COLD_BOUNDARY is greater than or equal to 86400, new data is archived as cold data after 86,400 seconds, which is equal to one day.

You do not need to set the property of the column family to COLD for cold and hot data separation. If you have set the property to COLD, remove the property. For more information, see Cold storage.

Shell

// Create a table that separately stores cold and hot data.
hbase(main):002:0> create 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}
// Disable cold and hot data separation.
hbase(main):004:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>""}
// Enable cold and hot data separation for a table or change the time boundary. The time boundary is measured in seconds.
hbase(main):005:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}

Java API

// Create a table that separately stores cold and hot data.
Admin admin = connection.getAdmin();
TableName tableName = TableName.valueOf("chsTable");
HTableDescriptor descriptor = new HTableDescriptor(tableName);
HColumnDescriptor cf = new HColumnDescriptor("f");
// The COLD_BOUNDARY parameter specifies the time boundary for separating cold and hot data. Unit: seconds. In this example, new data is archived as cold data after one day.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
descriptor.addFamily(cf);
admin.createTable(descriptor);

// Disable cold and hot data separation.
// Take note of the following item: You must perform a major compaction before you move the data from the cold storage to the hot storage.
HTableDescriptor descriptor = admin
    .getTableDescriptor(tableName);
HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
// Disable cold and hot data separation.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, null);
admin.modifyTable(tableName, descriptor);

// Enable cold and hot data separation for a table or change the time boundary.
HTableDescriptor descriptor = admin
    .getTableDescriptor(tableName);
HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
// The COLD_BOUNDARY parameter specifies the time boundary for separating cold and hot data. Unit: seconds. In this example, new data is archived as cold data after one day.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
admin.modifyTable(tableName, descriptor);

Write data

You can write data to a table that separately stores cold and hot data in a similar manner that you write data to a standard table. For more information, see Use the HBase Java API to access ApsaraDB for HBase Performance-enhanced Edition clusters or Use the multi-language API to access ApsaraDB for HBase Performance-enhanced Edition clusters. The timestamp of the data is the time when the data is written to a table. New data is stored in the hot storage (standard disks). If the storage duration of the data exceeds the value specified by the COLD_BOUNDARY parameter, the system automatically moves the data to the cold storage during the major compaction process. The process is completely transparent to users.

Read data

ApsaraDB for HBase allows you to use a table to store cold and hot data. You can query data only from one table. If you want to query data whose storage duration is less than the value specified by the COLD_BOUNDARY parameter, you can configure the HOT_ONLY hint in a GET or SCAN statement to query only hot data. You can also configure the TimeRange parameter in a GET or SCAN statement to specify the time range of the data that you want to query. The system automatically determines whether the data that you want to query is hot or cold based on the time range that you specify. The time required to query cold data is longer than the time required to query hot data. The throughput of reading cold data is lower than the throughput of reading hot data.

Examples

Get

Shell

// The query that does not contain the HOT_ONLY hint may hit cold data.
hbase(main):013:0> get 'chsTable', 'row1'
// The query that contains the HOT_ONLY hint hits only hot data. If row1 is stored in the cold storage, no query result is returned.
hbase(main):015:0> get 'chsTable', 'row1', {HOT_ONLY=>true}
// Query data within a time range that is specified by the TIMERANGE parameter. The system determines whether the query hits cold or hot data based on the values of the TIMERANGE and COLD_BOUNDARY parameters. The value of the TIMERANGE parameter is measured in milliseconds.
hbase(main):016:0> get 'chsTable', 'row1', {TIMERANGE => [0, 1568203111265]}

Java

Table table = connection.getTable("chsTable");
// The query that does not contain the HOT_ONLY hint may hit cold data.
Get get = new Get("row1".getBytes());
System.out.println("result: " + table.get(get));
// The query that contains the HOT_ONLY hint hits only hot data. If row1 is stored in the cold storage, no query result is returned.
get = new Get("row1".getBytes());
get.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
// Query data within a time range that is specified by the TIMERANGE parameter. The system determines whether the query hits cold or hot data based on the values of the TIMERANGE and COLD_BOUNDARY parameters. The value of the TIMERANGE parameter is measured in milliseconds.
get = new Get("row1".getBytes());
get.setTimeRange(0, 1568203111265)

Scan

If you do not configure the HOT_ONLY hint or a time range for the SCAN statement, cold data and hot data are queried. The query results are merged and returned based on how the SCAN operation of ApsaraDB for HBase works.

Shell

// The query that does not contain the HOT_ONLY hint hits cold and hot data.
hbase(main):017:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9'}
// The query that contains the HOT_ONLY hint hits only hot data.
hbase(main):018:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', HOT_ONLY=>true}
// Query data within a time range that is specified by the TIMERANGE parameter. The system determines whether the query hits cold or hot data based on the values of the TIMERANGE and COLD_BOUNDARY parameters. The value of the TIMERANGE parameter is measured in milliseconds.
hbase(main):019:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]}

Java

TableName tableName = TableName.valueOf("chsTable");
Table table = connection.getTable(tableName);
// The query that does not contain the HOT_ONLY hint hits cold and hot data.
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    System.out.println("scan result:" + result);
}
// The query that contains the HOT_ONLY hint hits only hot data.
scan = new Scan();
scan.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
// Query data within a time range that is specified by the TIMERANGE parameter. The system determines whether the query hits cold or hot data based on the values of the TIMERANGE and COLD_BOUNDARY parameters. The value of the TIMERANGE parameter is measured in milliseconds.
scan = new Scan();
scan.setTimeRange(0, 1568203111265);
Note
  1. The cold storage is used only to archive data that is rarely accessed. Only a few queries can hit cold data. We recommend that you configure the HOT_ONLY hint or a time range for most queries to hit only hot data. If your cluster receives a large number of queries that hit cold data, you can check whether the time boundary is set to an appropriate value.

  2. If you update a field in a row that is stored in the cold storage, the field is moved to the hot storage after the update. When this row is hit by a query that carries the HOT_ONLY hint or has a time range that is configured to hit hot data, only the updated field in the hot storage is returned. If you want the system to return the entire row, you must delete the HOT_ONLY hint from the query statement or make sure that the specified time range covers the time period from when this row is inserted to when this row is last updated. We recommend that you do not update data that is stored in the cold storage. If you need to frequently update cold data, we recommend that you adjust the time boundary to move the data to the hot storage.

Query the sizes of cold and hot data

You can check the sizes of cold and hot data in a table on the User tables tab of Cluster management system. If the ColdStorageSize field displays 0, cold data may be stored in the memory. You can run the flush command to flush the data to disks, perform a major compaction, and then check the size of the cold data.

Advanced features

Prioritize hot data selection

The system may look up cold and hot data for SCAN queries, for example, queries that are submitted to retrieve all order or chat records. The query results are paginated based on the timestamps of the data in descending order. In most cases, hot data appears before cold data. If the SCAN queries do not carry the HOT_ONLY hint, the system must scan cold and hot data. As a result, the query response time increases. If you prioritize hot data selection, cold data is queried only if you want to view more query results. For example, you can click the next page icon on the page if you want the system to return more results. This way, the frequency of cold data access is minimized and the response time is reduced.

To prioritize hot data selection, you need only to set the COLD_HOT_MERGE property to true in your SCAN query. This feature allows the system to scan hot data first. If you want to view more query results, the system starts scanning cold data.

Shell

hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}

Java

scan = new Scan();
scan.setAttribute(AliHBaseConstants.COLD_HOT_MERGE, Bytes.toBytes(true));
scanner = table.getScanner(scan);
Note
  • In scenarios in which only specific columns of a row are updated, the row stores cold and hot data. If you scan this row after you enable hot data prioritization, the query results are returned in two batches. You can find two results for the same rowkey in the result set.

  • The system cannot ensure that the rowkey of the returned cold data is greater than the rowkey of the hot data because the system returns hot data before cold data. This indicates that the results that are returned for SCAN queries are not sequentially sorted. However, the records in the returned cold or hot data are sorted in the rowkey order, as shown in the following demo. In some scenarios, you can sort the SCAN results by creating appropriate rowkeys. For example, for a table of order records, you can create a rowkey that consists of the user ID and the order creation time. This way, you can retrieve records based on the user ID and sort records based on the creation time.

// In this example, the row whose rowkey is "coldRow" stores cold data and the row whose rowkey is "hotRow" stores hot data.
// In most cases, the row whose rowkey is "coldRow" is returned before the row whose rowkey is "hotRow" because rows in ApsaraDB for HBase are sorted in lexicographical order. 
hbase(main):001:0> scan 'chsTable'
ROW                                                                COLUMN+CELL
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
2 row(s)

// If you set COLD_HOT_MERGE to true, the system scans the row whose rowkey is "hotRow" first. As a result, the row whose rowkey is "hotRow" is returned before the row whose rowkey is "coldRow".
hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
ROW                                                                COLUMN+CELL
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
2 row(s)

Separate cold and hot data based on the fields in a rowkey

In addition to the timestamps of key-value pairs, ApsaraDB for HBase Performance-enhanced Edition can separate cold and hot data based on specific fields in a rowkey. For example, ApsaraDB for HBase Performance-enhanced Edition can parse the timestamp field in a rowkey to separate cold and hot data. If the separation of cold and hot data based on the timestamps of key-value pairs cannot meet your requirements, submit a ticket or consult the ApsaraDB for HBase Q&A DingTalk group.

Precautions

Take note of the precautions in Cold storage.