How do I separately store hot data and cold data based on custom time columns - Lindorm

LindormTable allows you to separately store hot and cold data based on custom time columns. After you purchase Capacity storage for a Lindorm instance, you can specify a custom time column in a table or secondary index and separately store hot and cold data based on the column. This improves the query efficiency of hot data while reducing the storage costs of cold data. This topic describes how to separately store hot data and cold data based on custom time columns and usage notes.

Prerequisites

Cold storage is enabled for the Lindorm instance. For more information, see Overview.
The version of LindormTable is 2.4.3 or later. For more information about how to view or upgrade the version of LindormTable, see Release notes of LindormTable and Upgrade the minor engine version of a Lindorm instance.
You are connected to LindormTable by using Lindorm-cli. For more information, see Use Lindorm-cli to connect to and use LindormTable.
The STORAGE_POLICY attribute of the tables or column families cannot be COLD, which indicates that the table for which you want to enable hot and cold data separation cannot be a table in the cold storage. For more information about how to modify the attribute of column families, see Configure cold storage.

Usage notes

Hot and cold data separation supports only tables created by using SQL statements. Tables created by using a HBase shell are not supported.
When the value in the custom time column of a row exceeds the specified hot and cold data boundary, the entire row is archived in cold storage.
Take note of the following rules when you configure custom time columns:
- The custom time column must be a primary key column.
- The custom time column cannot be the first column of the primary key.
- The data type of the custom time column must be BIGINT. The values in the custom time column are UNIX timestamps representing the number of milliseconds that have elapsed since the epoch time January 1, 1970, 00:00:00 UTC. If the specified custom time column does not exist, data is not written to the column, or the type of data in the column is incorrect, Lindorm cannot determine whether to store the row in cold storage or hot storage. In this case, all data in the row is stored in hot storage.

Procedure

Specify a hot and cold data boundary for a table

Specify a custom time column and hot and cold data boundary.
- Method 1: Specify the column and boundary for a table when you create the table.
  You can execute the following statement to create a table named dt, set the hot and cold data boundary to one day, and specify p2 as the column based on which hot and cold data is separately stored:
```
CREATE TABLE dt (p1 integer, p2 bigint, p3 bigint, c1 varchar,  constraint pk primary key(p1, p2, p3)) WITH(COMPRESSION = 'ZSTD', CHS ='86400', CHS_L2 = 'storagetype=COLD', CHS_COLUMN = 'COLUMN=p2'); 
```
  Parameters
  - CHS: The hot and cold data boundary for the table. Data that has been stored in the table for a period longer than this period is archived in cold storage. The unit of the value of this parameter is seconds. For example, if the value of this parameter is 86400, data is archived to cold storage after the data is stored in written to the table for 86,400 seconds, which is equal to one day.
  - COMPRESSION: The algorithm used to compress the data. This parameter applies to data in the entire table. The value of this parameter is not case-sensitive. Default value: NONE.
  - CHS_L2: The layer-2 attribute. In general, this parameter is configured to specify the storage type. Set this parameter to storagetype=COLD.
  - CHS_COLUMN: The custom time column.
- Method 2: Use the ALTER TABLE statement to add attributes to an existing table.
```
-- Hot and cold data separation is not enabled for the table when the table is created.
CREATE TABLE dt (p1 integer, p2 bigint, p3 bigint, c1 varchar,  constraint pk primary key(p1, p2, p3));

-- Enable hot and cold data separation for the dt table, set the hot and cold data boundary to one day, and specify p2 as the column based on which hot and cold data is separately stored.
ALTER TABLE dt SET 'CHS' ='86400', 'CHS_L2' = 'storagetype=COLD', 'CHS_COLUMN' = 'COLUMN=p2';
```
(Optional) Modify the hot and cold data boundary and custom time column of the table.
- You can execute the following statement to modify the hot and cold data boundary of the secondary index table:
```
ALTER TABLE dt SET 'CHS'='1000';
```
- You can execute the following statement to modify the custom time column of the secondary index table:
```
ALTER TABLE dt SET 'CHS_COLUMN'='COLUMN=p3'; 
```
(Optional) Disable hot and cold data separation for the table.
```
ALTER TABLE dt SET 'CHS'='', 'CHS_L2' = '', 'CHS_COLUMN'='';
```
Note
After you change the hot and cold data boundary or disable hot and cold data separation, the data is transferred from cold storage to hot storage after Lindorm performs the compaction operation. To immediately transfer the data from cold storage to hot storage, manually run the major_compact command.

Configure a hot and cold data boundary for a secondary index

By default, secondary indexes are stored in a table. Therefore, you can enable hot and cold data separation for a secondary index.

In this example, a secondary index is created for the dt table that is created in the preceding section.

Create a secondary index for the table and configure a custom time column and a hot and cold data boundary for the secondary index.
```
CREATE INDEX idx on dt (c1) WITH(CHS = '86400', CHS_L2 = 'storagetype=COLD', CHS_COLUMN='COLUMN=p2');
```
Note
The custom time column of a secondary index must be the same as the custom time column of the table.
(Optional) Modify the hot and cold data boundary and custom time column of the secondary index table.
- You can execute the following statement to modify the hot and cold data boundary of the secondary index table:
```
ALTER TABLE `dt.idx` SET 'CHS'='10000';
```
- You can execute the following statement to modify the custom time column of the secondary index table:
```
ALTER TABLE `dt.idx` SET 'CHS_COLUMN'='COLUMN=p2'; 
```
Note
The name of a secondary index table is in the following format: <Table name>.<Secondary index name>. The name contains a period (.), which is a special character. Therefore, you must enclose the name of the table name to escape it when you specify the table name in the statement. For example, if the name of the table is test.idx, the table name that you specify in the statement must be `test.idx`.
(Optional) Disable hot and cold data separation for the secondary index table.
```
ALTER TABLE `dt.idx` SET 'CHS'='', 'CHS_L2' = '', 'CHS_COLUMN'='';
```

Write BLOB data

You can write data to a table that separately stores hot data and cold data in the same manner as that you write data to a standard table. Data that is written to a table with hot and cold data separation enabled is stored in hot storage first, such as standard and performance-based storage. When a row of data has been stored in hot storage for a period that is longer than the specified boundary, the data is archived to cold storage when compaction is performed. The period for which a row of data is stored can be calculated by using the following formula: Current time - Value of the time column in the row.

Data queries

Lindorm uses one table to store hot data and cold data. This way, you can query all data within one table. We recommend that you specify a time range for the value of the custom time column as a query condition. Lindorm determines whether to query only the hot storage or cold storage, or query both the hot and cold storage based on the time range that you specify. If you do not specify a time range as a condition in a query, cold data may be queried. In this case, the throughput of the query is throttled by the specification of the cold storage. For more information, see Overview.

In the value of custom column cannot be specified as a query condition, you can also use the _l_hot_only_ hint in the query to query only data in the hot storage.

Examples

Use the GET method to query data

// p2 indicates the custom time column.
SELECT * FROM dt WHERE p1 = 10 AND p2 = 10;

Use the SCAN method to query data within a specified range
```
// p2 indicates the custom time column.
SELECT * FROM dt WHERE p2 > 10 AND p2 < 1000;
```
Important
If LindormTable determines that a SELECT query is an inefficient query, LindormTable does not process the query by default and throws an exception. For more information, see SELECT.

Use hints to query only hot data

SELECT /*+ _l_hot_only_(true) */ * FROM dt WHERE p1>1;

Best practices

In Internet of Vehicles (IoV) scenarios, the primary key of a table generally consists of the Vehicle Identification Number (VIN) column and the timestamp column. The timestamp column is usually specified as the custom time column based on which hot and cold data is determined. When you query the data of a vehicle within a specific time range, Lindorm can determine whether to query cold data or query only hot data based on the specified time range.

USE test;
CREATE TABLE dt (
  vin varchar, ts bigint, c1 varchar, c2 varchar, constraint pk primary key(vin, ts)) WITH
(COMPRESSION = 'ZSTD', CHS ='86400', CHS_L2 = 'storagetype=COLD', CHS_COLUMN = 'COLUMN=ts'); // The time column must be included in the primary key.
// Query the data of a vehicle within a time range.
SELECT * FROM dt WHERE vin='xxxx' AND ts > 1675065122000 AND ts < 1675975222000;

FAQs

Q: Is a row of cold data still stored in the cold storage after it is updated?
A: If the data that you update is not in the custom time column, the row is stilled stored in the cold storage. If the data that you update is in the custom time column, Lindorm determines whether to store this row in the cold storage based on the updated data in the custom time column. For example, a table contains the following columns: p1, p2, c1, and c2. The primary key of the table includes the p1 and p2 columns. In a row of data, the values in the p1, p2, c1, and c2 columns are row1, 2023.1.28, c1, and c2 individually. The hot and cold data boundary for the table is one day, and the current date is January 30, 2023. In this case, this row is determined as cold data and is stored in the cold storage. If you update data in the c1 and c2 columns, this row is still stored in the cold storage. If you update the value in p2 to 2023.1.30, this row is determined as hot data until Feb 2, 2023 based on the hot and cold data boundary.
Note
If a custom time column is specified as the primary key, the values in the column cannot be updated.
Q: Is a row archived to the cold storage if no value is specified for the custom time column of the row?
A: No, the row is not archived to the cold storage. Hot and cold data separation is implemented based on the value in the custom time column. A row with the custom time column unspecified is stored in the hot storage.