In a big data scenario, business data, such as order or monitoring data, may grow over time. As your business develops, the rarely used data is archived. Enterprises may want to use a cost-effective storage to store this type of data to reduce costs. ApsaraDB for HBase Performance-enhanced Edition supports cold and hot data separation to help enterprises save data storage costs. ApsaraDB for HBase Performance-enhanced Edition uses a new medium (cold storage) to store cold data, which allows you to save two thirds of the costs compared with using ultra disks.

ApsaraDB for HBase Performance-enhanced Edition can automatically separate the cold and hot data stored in a table based on user settings. The cold data is automatically archived in the cold storage. When you query a table that has cold and hot data stored separately, you only need to specify query hints or time ranges for the system to determine whether to scan the cold or hot data. This is an automated process which is transparent to users.

How it works

ApsaraDB for HBase Performance-enhanced Edition determines whether the data written into a table is cold data based on the timestamp of the data and the time boundary (in milliseconds) set by users. New data is stored in the hot storage. It ages with time and is finally moved to the cold storage. You can change the time boundary for separating cold and hot data as needed. Data can be moved from the cold storage to hot storage or from the hot storage to cold storage.

Use cold and hot data separation

Note: To use cold storage, you must upgrade ApsaraDB for HBase Performance-enhanced Edition to a version later than 2.1.8. The version of the client dependency AliHBase-Connector must be later than 1.0.7/2.0.7. The version of HBase Shell must be later than alihbase-2.0.7-bin.tar.gz.

Before you use the Java API to access ApsaraDB for HBase, follow the steps described in Use the Java API to access ApsaraDB for HBase to install the Java SDK and configure parameters.

Before you use HBase Shell to access ApsaraDB for HBase, follow the steps described in Use HBase Shell to access ApsaraDB for HBase to download and configure HBase Shell.

Activate cold storage

Follow the steps described in Cold storage to activate cold storage for your ApsaraDB for HBase cluster.

Set a time boundary for a table

You can modify the COLD_BOUNDARY parameter to change the time boundary for separating cold and hot data. The time boundary is measured in seconds. For example, if COLD_BOUNDARY => 86400, newly inserted data ages out after 86,400 seconds (1 day) and is then archived as cold data.

You do not need to set the property of the column family to COLD for cold and hot data separation. If you have already set it to COLD, follow the instructions in Cold storage to remove the property.

HBase Shell

// Create a table that has cold and hot data stored separately.
hbase(main):002:0> create 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}
// Disable cold and hot data separation.
hbase(main):004:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>""}
// Enable cold and hot data separation for a table or change the time boundary.
hbase(main):005:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'}

Java API

// Create a table that has cold and hot data stored separately.
Admin admin = connection.getAdmin();
TableName tableName = TableName.valueOf("chsTable");
HTableDescriptor descriptor = new HTableDescriptor(tableName);
HColumnDescriptor cf = new HColumnDescriptor("f");
// The COLD_BOUNDARY parameter specifies the time boundary in seconds for separating cold and hot data. In this example, new data ages out after 86,400 seconds (1 day) and is then archived as cold data.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
descriptor.addFamily(cf);
admin.createTable(descriptor);

// Disable cold and hot data separation.
// Note: You must run a major compaction before you can move the data from the cold storage to hot storage.
HTableDescriptor descriptor = admin
    .getTableDescriptor(tableName);
HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
// Disable cold and hot data separation.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, null);
admin.modifyTable(tableName, descriptor);

// Enable cold and hot data separation for a table or change the time boundary.
HTableDescriptor descriptor = admin
    .getTableDescriptor(tableName);
HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
// The COLD_BOUNDARY parameter specifies the time boundary in seconds for separating cold and hot data. In this example, new data ages out after 86,400 seconds (1 day) and is then archived as cold data.
cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400");
admin.modifyTable(tableName, descriptor);

Insert data

You can write a table that has cold and hot data stored separately in the same way as writing a regular table. For more information about how to insert data to a table, see Use the Java API to access ApsaraDB for HBase or Use the API of a non-Java language to access ApsaraDB for HBase. The timestamp of data is the time when the data is written into a table. New data is stored in the hot storage (standard disks). When the age of the data exceeds the value specified in the COLD_BOUNDARY parameter, the system automatically moves the data to the cold storage during the major compaction process. The process is completely transparent to users.

Query data

ApsaraDB for HBase Performance-enhanced Edition allows you to use a table to store both cold and hot data. This saves you the effort of dealing with more than one table when you query data. You can set the HOT_ONLY hint in a GET or SCAN statement to query only hot data if you can confirm that the age of the target data is less than the value specified in the COLD_BOUNDARY parameter. You can also set the TimeRange parameter in a GET or SCAN statement to specify the time range of the data to be queried. The system automatically determines whether the target data is hot or cold based on the specified time range. It takes more time to query cold data than querying hot data. The throughput of reading cold data is lower than that of reading hot data. For more information, see Cold storage.

Examples

Get

HBase Shell

// Query data without the HOT_ONLY hint. The query may hit cold data.
hbase(main):013:0> get 'chsTable', 'row1'
// Query data with the HOT_ONLY hint. The query only hits the hot data. If row1 is stored in the cold storage, no query result is returned.
hbase(main):015:0> get 'chsTable', 'row1', {HOT_ONLY=>true}
// Query data within a specified time range. The system determines whether the query hits the cold or hot data based on the TIMERANGE and COLD_BOUNDARY settings. The value of the TIMERANGE parameter is measured in milliseconds.
hbase(main):016:0> get 'chsTable', 'row1', {TIMERANGE => [0, 1568203111265]}

Java API

Table table = connection.getTable("chsTable");
// Query data without the HOT_ONLY hint. The query may hit cold data.
Get get = new Get("row1".getBytes());
System.out.println("Get operation: "   cache.get(key));
// Query data with the HOT_ONLY hint. The query only hits the hot data. If row1 is stored in the cold storage, no query result is returned.
get = new Get("row1".getBytes());
get.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
// Query data within a specified time range. The system determines whether the query hits the cold or hot data based on the TIMERANGE and COLD_BOUNDARY settings. The value of the TIMERANGE parameter is measured in milliseconds.
get = new Get("row1".getBytes());
get.setTimeRange(0, 1568203111265)

Scan

Note: If you do not set the HOT_ONLY hint or specify a time range for the SCAN statement, both the cold and hot data is queried. The query results are merged and returned to you. This is determined by the principle how the HBase SCAN operation works.

HBase Shell

// Query data without the HOT_ONLY hint. The query hits both the cold and hot data.
hbase(main):017:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9'}
// Query data with the HOT_ONLY hint. The query only hits the hot data.
hbase(main):018:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', HOT_ONLY=>true}
// Query data within a specified time range. The system determines whether the query hits the cold or hot data based on the TIMERANGE and COLD_BOUNDARY settings. The value of the TIMERANGE parameter is measured in milliseconds.
hbase(main):019:0> scan 'chsTable', {STARTROW =>'row1', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]}

Java API

TableName tableName = TableName.valueOf("chsTable");
Table table = connection.getTable(tableName);
// Query data without the HOT_ONLY hint. The query hits both the cold and hot data.
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    System.out.println("scan result:"   result);
}
// Query data with the HOT_ONLY hint. The query only hits the hot data.
scan = new Scan();
scan.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true));
// Query data within a specified time range. The system determines whether the query hits the cold or hot data based on the TIMERANGE and COLD_BOUNDARY settings. The value of the TIMERANGE parameter is measured in milliseconds.
scan = new Scan();
scan.setTimeRange(0, 1568203111265);

Notes:

1. The cold storage is only used to archive data that is rarely accessed. Only a few of queries can hit the cold data. Most of the queries carry the HOT_ONLY hint or have a time range specified to hit the hot data only. If your cluster receives a large number of queries hitting the cold data, you need to check whether the time boundary is set appropriately.

2. If you update a field in a row stored in the cold storage, the field is moved to the hot storage after it is updated. When this row is hit by a query that carries the HOT_ONLY hint or has a time range targeting the hot data, only the updated field in the hot storage is returned. If you want the system to return the entire row, you must delete the HOT_ONLY hint from the query or make sure that the specified time range covers the time period from when this row was inserted to when this row was last updated. We recommend that you do not update the data stored in the cold storage. If you need to update the cold data frequently, we recommend that you adjust the time boundary to move the data to the hot storage.

Query the sizes of cold and hot data

You can check the sizes of cold and hot data in a table on the User tables tab of the cluster management system. If the ColdStorageSize field displays 0, this indicates that the cold data is still stored in RAM. You can run the flush command to flush the data to disks, and then run a major compaction to check the size of the cold data.

Advanced features

Prioritize hot data selection

The system may lookup both the cold and hot data upon SCAN queries such as queries submitted to retrieve all the order or chat records. The query results are paginated based on the timestamps of the data in descending order. In most cases, the hot data is displayed ahead of the cold data. If the SCAN queries do not carry the HOT_ONLY hint, the system must scan both the cold and hot data, resulting in performance degradation. If you have prioritized hot data selection, cold data is queried only when you want to view more query results. For example, you want the system to return more results by clicking Next on the page. This minimizes the cold data access frequency and reduces the response time.

To prioritize hot data selection, you only need to set the COLD_HOT_MERGE property to true in your scan query. This feature allows the system to scan hot data first. If you want to view more query results, the system then starts scanning cold data.

HBase Shell

hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
Java API
java
scan = new Scan();
scan.setAttribute(AliHBaseConstants.COLD_HOT_MERGE, Bytes.toBytes(true));
scanner = table.getScanner(scan);
			
Notes:1. When hot data prioritization is enabled, if a row stores both cold and hot data is queried, the query results are returned in two batches. This means that you will find two results for the same rowkey in the result set. 2. The system cannot ensure that the rowkey of the returned cold data is greater than that of the hot data because it returns the hot data first. This means that the results returned upon SCAN queries are not sequentially sorted. However, the records in the returned cold or hot data are still sorted in the rowkey order, as shown in the following demo. In some scenarios, you can sort the SCAN results by designing the rowkeys appropriately. For example, you can create a rowkey that consists of user IDs and order creation times. Records can be retrieved based on user IDs and sorted based on creation times.

// Assume that the rowkey "coldRow" stores cold data and the rowkey "hotRow" stores hot data.
// In most cases, the rowkey "coldRow" is returned ahead of the rowkey "hotRow" because rows in HBase are sorted in lexicographical order.
hbase(main):001:0> scan 'chsTable'
ROW                                                                COLUMN CELL
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
2 row(s)

// When COLD_HOT_MERGE is set to true, the system scans the rowkey "hotRow" first. As a result, the rowkey "hotRow" is returned ahead of the rowkey "coldRow".
hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true}
ROW                                                                COLUMN CELL
 hotRow                                                               column=f:value, timestamp=1565848800000, value=hot_value
 coldRow                                                              column=f:value, timestamp=1560578400000, value=cold_value
2 row(s)
			
Separate cold and hot data based on the fields in a rowkey.In addition to the timestamps of KV pairs, ApsaraDB for HBase Performance-enhanced Edition can also separate cold and hot data based on specific fields in a rowkey. For example, it can parse the timestamp field in a rowkey to separate cold and hot data. If separating cold and hot data based on the timestamps of KV pairs cannot meet your requirements, submit a ticket or consult the ApsaraDB for HBase Q&A DingTalk group.Considerations
For considerations about using cold and hot data separation, see [Cold storage](TODO).