Background information

In a big data scenario, business data, such as order or monitoring data, may grow over time. As your business develops, the rarely used data is archived. Enterprises may want to use a cost-effective storage to store this type of data to reduce costs. ApsaraDB for HBase Performance-enhanced Edition supports cold and hot data separation to help enterprises save data storage costs. ApsaraDB for HBase Performance-enhanced Edition uses a new medium (cold storage) to store cold data, which allows you to save two thirds of the costs compared with using ultra disks.

HBase Enhanced Edition realizes the separation of cold and hot data in the same table The system automatically archives the cold data in the table to cold storage based on the hot and cold demarcation lines set by the user. When you query a table that has cold and hot data stored separately, you only need to specify query hints or time ranges for the system to determine whether to scan the cold or hot data. This is an automated process which is transparent to users.

For detailed introduction, please refer to Yunqi Community Extreme Cost Optimization for Massive Data-ApsaraDB for HBase Integrated Cold and Heat Separation.

How it works

ApsaraDB for HBase Performance-enhanced Edition determines whether the data written into a table is cold data based on the timestamp of the data and the time boundary (in milliseconds) set by users. New data is stored in the hot storage. It ages with time and is finally moved to the cold storage. You can change the time boundary for separating cold and hot data as needed. Data can be moved from the cold storage to hot storage or from the hot storage to cold storage.How it works

Use cold and hot data separation

The cold storage function requires the HBase enhanced server to upgrade to version 2.1.8 or above, the client dependency requirement AliHBase-Connector 1.0.7/2.0.7 or above, and the Shell requirement alihbase-2.0.7-bin.tar.gz or above.

Before using the Java API, The steps in complete the Java SDK installation and parameter configuration.

Before using the HBase shell, The steps in complete the download and configuration of the shell.

Activate cold storage

Enable the cold storage function of the cluster.

Set a time boundary for a table

You can modify the COLD_BOUNDARY parameter to change the time boundary for separating cold and hot data. The unit of COLD_BOUNDARY is seconds, for example, COLD_BOUNDARY => 86400 means that data written 86400 seconds (one day) ago will be automatically archived to cold storage media.

In the process of using cold and hot separation, it is not necessary to set the attribute of column cluster to COLD. If the attribute of column cluster has been set to COLD, Remove the attributes of cold storage.

Shell


     //Create hot and cold separation table
     hbase(main):002:0> create 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400 '} 
     //Cancel hot and cold separation 
     hbase(main):004:0> alter 'chsTable',{NAME=>'f', COLD_BOUNDARY=>""} 
     // Set the hot and cold separation for the existing table, or modify the hot and cold separation dividing line. The unit is second 
     hbase(main):005:0> alter 'chsTable', {NAME=>'f', COLD_BOUNDARY=>'86400'} 
   

Java API


     // Create a new hot and cold separation table 
     Admin admin = connection.getAdmin(); 
    TableName tableName = TableName.valueOf("chsTable"); 
    HTableDescriptor descriptor = new HTableDescriptor(tableName); 
    HColumnDescriptor cf = new HColumnDescriptor("f"); 
    // COLD_BOUNDARY sets the cutoff point of cold and hot separation time in seconds. The example indicates that the data archived one day ago is cold data 
   cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400"); 
   descriptor.addFamily(cf); 
   admin.createTable(descriptor); 
   //Cancel hot and cold separation 
   //Note: Major compaction is required to return data from cold storage to hot storage 
    HTableDescriptor descriptor = admin .getTableDescriptor(tableName); 
    HColumnDescriptor cf = descriptor.getFamily("f".getBytes()); 
    //Cancel cold and heat separation cf.setValue(AliHBaseConstants.COLD_BOUNDARY, null); 
    admin.modifyTable(tableName, descriptor); 
    //Set the cold and heat separation function for the existing table. Or modify the hot and cold separation boundary 
    HTableDescriptor descriptor = admin .getTableDescriptor(tableName);
    HColumnDescriptor cf = descriptor.getFamily("f".getBytes());
    // COLD_BOUNDARY sets the cold and hot separation time demarcation point in seconds. the example indicates that the data archived 1 day ago is cold data
    cf.setValue(AliHBaseConstants.COLD_BOUNDARY, "86400"); 
    admin.modifyTable(tableName, descriptor); 
   

Write data

The data writing method of the hot and cold separated table is exactly the same as that of the common table. The timestamp for writing data uses the The current system time. . New data is stored in the hot storage (standard disks). Over time, if the write time of this line of data exceeds COLD_BOUNDARY The set value will be archived to cold data at major_compact, which is completely transparent to users.

Data query

ApsaraDB for HBase Performance-enhanced Edition allows you to use a table to store both cold and hot data. This saves you the effort of dealing with more than one table when you query data. During the query process, if the user clearly knows that the data to be queried is in hot data (the write time is less than the value set by COLD_BOUNDARY), you can set it on Get or Scan. HOT_ONLY Hint to tell the server to query only hot zone data. You can also set the TimeRange parameter in a GET or SCAN statement to specify the time range of the data to be queried. The system automatically determines whether the target data is hot or cold based on the specified time range. The latency of querying cold zone data is much higher than that of hot zone data, and Query throughput is limited by cold storage .

Examples

Get

Shell


     //queries without HotOnly Hint may query cold data 
     hbase(main):013:0> get'chsTable', 'row1 '
     // queries with HotOnly Hint will only query hot data. if row1 is in cold storage, the query will have no result 
     hbase(main):015:0> get'chsTable', 'row1', {HOT_ONLY=>true}
    // queries with TimeRange. The system will determine which area of data to query according to the comparison between the set TimeRange and COLD_BOUNDARY hot and cold boundary (note that the unit of TimeRange is millisecond timestamp) 
     hbase(main):016:0> get'chsTable', 'row1', {TIMERANGE => [0, 1568203111265]} 
   

Java


     Table table = connection.getTable("chsTable"); // Queries without HotOnly Hint may query cold data 
     Get get = new Get("row1".getBytes()); 
     System.out.println("result: "+ table.get(get)); //For queries with HotOnly Hint, only the hot data part will be checked. For example, row1 is in cold storage, and the query will have no result 
     get = new Get(" row1 ".getBytes()); 
     get.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true)); //Query with TimeRange, the system will decide which area of data to query (note that the unit of TimeRange is millisecond timestamp) 
     get = new Get(" row1 ".getBytes()) according to the comparison between the set TimeRange and COLD_BOUNDARY hot and cold boundary; 
     get.setTimeRange(0, 1568203111265) 
   

Scan

If scan does not set Hot Only, or TimeRange contains cold zone time, cold data and hot data will be accessed in parallel to merge the results, which is determined by the Scan principle of HBase
Shell

      // For queries without HotOnly Hint, cold data 
      hbase(main):017:0> scan 'chsTable',{STARTROW =>'row1 ', STOPROW=>'row9'} 
      //For queries with HotOnly Hint, only 
       hbase(main):018:0> scan 'chsTable',{STARTROW =>'row1 ', STOPROW=>'row9', HOT_ONLY=>true} 
     // For queries with TimeRange, the system will determine which area of data to query according to the comparison between the set TimeRange and COLD_BOUNDARY hot and cold boundary (note that the unit of TimeRange is millisecond timestamp) 
       hbase(main):019:0> scan 'chsTable', {STARTROW =>'row1 ', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]} 
    
Java

      TableName tableName = TableName.valueOf("chsTable"); 
      Table table = connection.getTable(tableName); 
      //Queries without HotOnly Hint will definitely find cold data 
       Scan scan = new Scan(); 
       ResultScanner scanner = table.getScanner(scan); 
            for (Result result : scanner) { 
                     System.out.println("scan result:" + result);
                    } //Queries with HotOnly Hint will only query some 
       scan = new Scan() of hot data; 
       scan.setAttribute(AliHBaseConstants.HOT_ONLY, Bytes.toBytes(true)); //For queries with TimeRange, the system will determine which area of data to query according to the comparison between the set TimeRange and COLD_BOUNDARY hot and cold boundary (note that the unit of TimeRange is millisecond timestamp) 
       scan = new Scan(); scan.setTimeRange(0, 1568203111265); 
    
Note
  1. The cold storage is only used to archive data that is rarely accessed. Only a few of queries can hit the cold data. Most of the queries carry the HOT_ONLY hint or have a time range specified to hit the hot data only. If your cluster receives a large number of queries hitting the cold data, you need to check whether the time boundary is set appropriately.
  2. If you update a field in a row stored in the cold storage, the field is moved to the hot storage after it is updated. When this row is hit by a query that carries the HOT_ONLY hint or has a time range targeting the hot data, only the updated field in the hot storage is returned. If you want the system to return the entire row, you must delete the HOT_ONLY hint from the query or make sure that the specified time range covers the time period from when this row was inserted to when this row was last updated. We recommend that you do not update the data stored in the cold storage. If you need to update the cold data frequently, we recommend that you adjust the time boundary to move the data to the hot storage.

Query the sizes of cold and hot data

You can check the sizes of cold and hot data in a table on the User tables tab of the cluster management system. If the ColdStorageSize field displays 0, this indicates that the cold data is still stored in RAM. You can run the flush command to flush the data to disks, and then run a major compaction to check the size of the cold data.Query the sizes of cold and hot data

Advanced features

Prioritize hot data selection

The system may lookup both the cold and hot data upon SCAN queries such as queries submitted to retrieve all the order or chat records. The query results are paginated based on the timestamps of the data in descending order. In most cases, the hot data is displayed ahead of the cold data. If the SCAN queries do not carry the HOT_ONLY hint, the system must scan both the cold and hot data, resulting in performance degradation. If you have prioritized hot data selection, cold data is queried only when you want to view more query results. For example, you want the system to return more results by clicking Next on the page. This minimizes the cold data access frequency and reduces the response time.

Turn on hot data priority query, just set it on Scan COLD_HOT_MERGE Property is sufficient. This feature allows the system to scan hot data first. If you want to view more query results, the system then starts scanning cold data.

Shell


     hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true} 
   

Java


     scan = new Scan();
     scan.setAttribute(AliHBaseConstants.COLD_HOT_MERGE, Bytes.toBytes(true)); 
     scanner = table.getScanner(scan); 
   
Note
  • When hot data prioritization is enabled, if a row stores both cold and hot data is queried, the query results are returned in two batches. This means that you will find two results for the same rowkey in the result set.
  • The system cannot ensure that the rowkey of the returned cold data is greater than that of the hot data because it returns the hot data first. This means that the results returned upon SCAN queries are not sequentially sorted. However, the records in the returned cold or hot data are still sorted in the rowkey order, as shown in the following demo. In some scenarios, you can sort the SCAN results by designing the rowkeys appropriately. For example, you can create a rowkey that consists of user IDs and order creation times. Records can be retrieved based on user IDs and sorted based on creation times.

       // Assume that the row with rowkey "coldRow" is cold data, and the row with rowkey "hotRow" is hot data
       // Under normal circumstances, because the row of hbase is arranged lexicographic order, the row with rowkey "coldRow" will return before the row with "hotRow". 
     Hbase (main):001:0> scan 'chsTable' 
        ROW                               COLUMN   CELL 
      coldRow                            column=f:value, timestamp=1560578400000, value=cold_value 
      hotRow                             column=f:value, timestamp=1565848800000, value=hot_value 
      2 row(s) 
       // when setting the COLD_HOT_MERGE, the rowkey sequence of scan is destroyed, and hot data is returned first than cold data, so in the returned result, "hot" is ranked in front of "cold"  
     hbase(main):002:0> scan 'chsTable', {COLD_HOT_MERGE=>true} 
        ROW                              COLUMN CELL 
        hotRow                           column=f:value, timestamp=1565848800000, value=hot_value 
       coldRow                           column=f:value, timestamp=1560578400000, value=cold_value
       2 row(s) 
   

Separate cold and hot data based on the fields in a rowkey

In addition to the timestamps of KV pairs, ApsaraDB for HBase Performance-enhanced Edition can also separate cold and hot data based on specific fields in a rowkey. For example, it can parse the timestamp field in a rowkey to separate cold and hot data. If separating cold and hot data based on the timestamps of KV pairs cannot meet your requirements, submit a ticket or consult the ApsaraDB for HBase Q&A DingTalk group.

Note

For considerations about using cold and hot data separation, see Cold storage.