What are data partitioning policies and how to use the policies - Lindorm

The partition index feature is used to resolve the storage issues and the issues that are triggered by high-concurrency access when a wide table that store large amounts of data is used. When you create a search index for a table, you can specify a data partitioning policy. The server automatically partitions and stores data in the table. When you query the data, LindormTable automatically performs partition pruning. This topic describes data partitioning policies and how to use the policies.

Prerequisites

The search index feature is enabled for your Lindorm instance. For more information, see Enable the search index feature.
Lindorm-cli is used to connect to LindormTable. For more information, see Use Lindorm-cli to connect to and use LindormTable.

Scenarios

Your business data has time attributes. Examples of such business data include data of Internet of vehicles (IoV), order details, and message logs.
Your business data has a characteristic that you can use to classify the data. For example, data in a merchant data table is classified based on merchant IDs and you specify the column that has the merchant IDs as a query condition. Data in an IoT device-data table is classified based on device IDs and you specify the column that has the device IDs as a query condition.

Data partitioning policies

HASH partitioning
Time range partitioning
Multi-level hash partitioning (advanced usage)

Preparations

Before you use the partition index feature, you must execute the following statement to create a test table:

CREATE TABLE IF NOT EXISTS search_table (user_id bigint, storeId varchar, goodsId varchar, goodsPrice smallint, orderTime bigint, info varchar, constraint primary key (user_id asc));

Hash partitioning

After you use this policy, the corresponding data is hashed and stored. This prevents unbalanced data distribution. If you use this policy in scenarios in which large amounts of data is written, balanced data distribution can be achieved. By default, the search index feature performs hash partitioning based on the primary key of a wide table in a Lindorm instance. You can also specify a custom hash partition key.

The following sample code provides examples on how to perform hash partitioning:

Create a search index and perform hash partitioning. By default, the hash partitioning is performed based on the primary key of the wide table and the number of partitions is two times the number of search nodes.
```
CREATE INDEX IF NOT EXISTS idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice);
```
Create a search index, set the number of partitions to 64, and perform hash partitioning based on the storeId column of the wide table. In this case, first-level hash partitioning is performed.
```
CREATE INDEX IF NOT EXISTS idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice) partition by hash(storeId) partitions 64;
```
Important
- If a column is used as a condition in most of your queries, you can specify the column as a custom partition key.
- In scenarios where custom partition keys are used, you can set partitions to a larger value based on your requirements. If you are not sure about how to configure the number of partitions, you can set it to 64 for a test.
- If you perform hash partitioning based on one column of a wide table and the column that is specified as the partition key has frequently accessed data, a large volume of data can be written to a partition. For example, 10% of data in the table can be queried based on the storeId column. This reduces the query performance and the write performance. In this case, first-level hash partitioning is performed. We recommend that you perform multi-level hash partitioning based on two or three columns in a wide table. In this case, second-level hash partitioning or third-level hash partitioning is performed. For more information, see Multi-level hash partitioning (advanced usage).
- Limits on custom partition keys:
  - Partition key values cannot be changed.
  - Partition key values cannot be empty.

Time range partitioning

You can perform time range partitioning on time series data. For example, you can perform time range partitioning based on weeks or months. If you perform time range partitioning on time series data, data within the same time range can be stored in the same partition and expired data in a partition can be automatically removed.

The following sample code provides examples on how to perform time range partitioning on time series data:

Create an index, specify that the first partition is created based on 30 days before the point in time when you create the index, set the interval at which a new partition is created to 7 days, set the retention period of your data in a partition to 90 days, and perform time range partitioning by the orderTime column.
```
CREATE INDEX idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice, orderTime)
partition by range time(orderTime) partitions 4
with (indexState=ACTIVE, RANGE_TIME_PARTITION_START='30', RANGE_TIME_PARTITION_INTERVAL='7', RANGE_TIME_PARTITION_TTL='90');
```
Create an index, specify that the first partition is created based on 180 days before the point in time when you create the index, set the interval at which a new partition is created to 30 days, set the retention period of your data in a partition to 180 days, set the unit of values in a partition key column to seconds, and perform time range partitioning by the orderTime column.
```
CREATE INDEX idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice, orderTime)
partition by range time(orderTime) partitions 4 
with (indexState=ACTIVE, RANGE_TIME_PARTITION_START='180', RANGE_TIME_PARTITION_INTERVAL='30', RANGE_TIME_PARTITION_TTL='180', RANGE_TIME_PARTITION_FIELD_TIMEUNIT='s');
```

The following table describes the parameters that are used in the preceding code.

Parameter	Required	Description
RANGE_TIME_PARTITION_START	Yes	The number of days before you create the index. The first partition is created based on the point in time n days before the point in time when you create the index. The number of days is the value of n. This parameter can be specified in scenarios in which your table has historical data. If the timestamp of a row of historical data is earlier than the point in time based on which the first partition is created, an error message is returned.
RANGE_TIME_PARTITION_INTERVAL	Yes	The interval at which a new partition is created. For example, If you set RANGE_TIME_PARTITION_INTERVAL to '7', a partition is created at an interval of seven days.
RANGE_TIME_PARTITION_TTL	No	The retention period of your data in a partition. For example, if you set RANGE_TIME_PARTITION_TTL to '180', the retention period is 180 days and the data that is retained for more than 180 days in a partition is automatically removed.
RANGE_TIME_PARTITION_FIELD_TIMEUNIT	No	The unit of values in the partition key column. Default value: ms. The value ms means milliseconds. If you set the value of this parameter to s, each value in the partition key column is 10 digits in length. The value s means seconds. If you set the value of this parameter to ms, each value in the partition key column is 13 digits in length.

Multi-level hash partitioning (advanced usage)

Set salt_factor to 4, set the number of partitions to 64, and perform hash partitioning based on the storeId and goodsId columns in a wide table of a Lindorm instance. In this case, second-level hash partitioning is performed.

// Data with the same values in the storeId column and different values in the goodsId column is separately stored in four partitions.
CREATE SEARCH INDEX idx ON search_table (storeId, goodsId, goodsPrice) partition by hash(storeId(salt_factor=4),goodsId) partitions 64;

Important

The salt_factor parameter is used to partition data rows that contains the same value of the storeId field. We recommend that you set the salt_factor parameter to a small value, and the number of subpartitions is determined by the result of 2 to the power of the value of the salt_factor parameter. If the number of partitions is 16 and the value of the salt_factor parameter is greater than 4, data partitions cannot be split to subpartitions by using the HASH partitioning strategy. Examples on how to rehash partitions based on the value of the salt_factor parameter:
- salt_factor=1: specifies that data rows that include the same value in the storeId field and different values in the goodsId field are partitioned to the number of subpartitions that is equal to 1/2 of the total number of partitions.
- salt_factor=2: specifies that data rows that include the same value in the storeId field and different values in the goodsId field are partitioned to the number of subpartitions that is equal to 1/4 of the total number of partitions.
- salt_factor=3: specifies that data rows that include the same value in the storeId field and different values in the goodsId field are partitioned to the number of subpartitions that is equal to 1/8 of the total number of partitions.
Multi-level hash partitions, including level-2 hash partitions and level-3 hash partitions, can be used to split data partitions and also help greatly improve the query efficiency. For example, if data is partitioned based on values in the storeId column and values in the goodsId column, LindormSearch scans only one partition when query conditions are specified based on both the storeId column and goodsId column. This reduces the data scope that the system scans and improves the query efficiency.
If you use the 2-level HASH partitioning method, you must specify a salt factor for the level-1 partition key. In the preceding example, the level-1 partition key is the storeId column. If you use the 3-level HASH partitioning method, you must specify a salt factor for the level-1 partition key and a salt factor for the level-2 partition key.
Limits on custom partition keys:
- Partition key values cannot be changed.
- Partition key values cannot be empty.