Partition indexing is a feature designed to address storage and high-concurrency access challenges for large wide tables. When you create a search index, you can specify a data partitioning policy. The server then automatically splits and stores the data. During queries, the system automatically performs partition pruning. This topic describes data partitioning policies and how to use them.
Prerequisites
Scenarios
The Lindorm search index provides two partitioning policies: hash partitioning and time partitioning. You can select a policy based on the business scenarios described below.
Hash partitioning
Query partition pruning: If your query conditions always include an equality query on a specific column, consider using hash partitioning on that column. This method ensures that identical values are always routed to the same partition group. During a query, partition pruning can be used to retrieve data only from the specific partitions that contain the query value. For example, for a device table, if query conditions always include a device ID, you can set the device ID column as the hash partition key.
Balanced partition storage: When you select a hash partition key, you must also consider the data distribution. If the selected partition key causes data hot spots, you can use the multi-level hash partitioning feature to distribute the data more evenly across multiple partitions.
Time partitioning
Query partition pruning: If your business data has a time attribute, such as data from the Internet of Vehicles (IoV), order details, or message logs, and your queries often use a time range, you can use time partitioning. For example, you might query order data from the last 7 days or the last month. This feature narrows the query scope to specific shards based on the specified time range.
Control the storage limit of a single shard: If a single index or shard stores a large amount of data, its write and query performance can be affected. If your index data grows rapidly, you can use the time partitioning policy to control the amount of data in a single shard.
Preparations
Before you use partition indexing, create a test table using the following statement:
CREATE TABLE IF NOT EXISTS search_table (user_id BIGINT, storeId VARCHAR, goodsId VARCHAR, goodsPrice SMALLINT, orderTime BIGINT, info VARCHAR, PRIMARY KEY (user_id asc));Hash partitioning
The Lindorm search index supports up to three levels of partitioning. A hash partition configuration cannot be changed after it is created. The following examples show the syntax for hash partitioning.
Primary partition
Create a search index. By default, hash partitioning is based on the primary key of the wide table. The default number of partitions is twice the number of search nodes.
CREATE INDEX IF NOT EXISTS idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice);Create a search index with hash partitioning on the
storeIdcolumn and64partitions. ThePARTITION BY hash(storeId)clause specifiesstoreIdas the partition key, and thepartitions 64clause sets the number of partitions to64.CREATE INDEX IF NOT EXISTS idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice) PARTITION BY hash(storeId) PARTITIONS 64;NoteWhen you use hash partitioning, you must estimate the required number of partitions based on the data volume. A single partition should contain 50 million to 100 million records and have a storage size of 30 GB to 50 GB.
Multi-level partition
If the total data volume of the index is large (tens of billions of records or more) and the primary partition key causes uneven data distribution, you can use multi-level partitioning to optimize the data distribution.
The wide table engine must be version 2.8.1 or later. The search engine must be version 3.9.22 or later.
Multi-level partitioning cannot be used with time partitioning.
Create a search index with hash partitioning based on the
storeIdandgoodsIdcolumns of the wide table. Set the salt factors (salt_factor) for these columns to2and4, respectively. Thepartitions 64clause sets the total number of partitions to64.CREATE INDEX IF NOT EXISTS idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice) PARTITION BY hash(storeId(salt_factor=2), goodsId(salt_factor=4)) partitions 64;NoteThe partition salt factor (salt_factor) is a small integer that controls the degree of hashing for different partition key values. Follow these steps to determine the total number of partitions and the salt factor for each partition key:
Estimate the required number of partitions based on the total data volume. Then, select a power of 2 (
) that is close to your estimate as the total number of partitions. For example, if the total data volume is 2 billion entries and each partition contains about 50 million entries, you need approximately 40partitions. In this case, you can selectas the total number of partitions. If the total number of partitions is
, the sum of the salt factors for all partition keys must be 6. You can assign salt factors to control the hashing range for different partition keys.For example, if the salt factor for
storeIdis2and the salt factor forgoodsIdis 4, data with the samestoreIdvalue but differentgoodsIdvalues is distributed acrossof the total partitions. If the salt factor for storeIdis3and the salt factor forgoodsIdis3, data with the samestoreIdvalue but differentgoodsIdvalues is distributed acrossof the total partitions.
Create a partition index with
_idas the secondary partition key.NoteIf the primary partition key causes uneven data distribution and no suitable secondary partition key is available for your business, specify
_idas the secondary partition key. `_id` corresponds to the composite primary key of the wide table.CREATE INDEX IF NOT EXISTS idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice) PARTITION BY hash(storeId(salt_factor=2), _id(salt_factor=4)) partitions 64;
Time partitioning
For time series data, such as order data or message logs, you can specify a time column to partition the data by a time range. For example, you can partition data by week or month. Data within the same time range is stored in the same partition, and data in old partitions can be automatically deleted based on a TTL policy.
Create an index and partition it by the business time column
orderTime.RANGE_TIME_PARTITION_START='30'specifies that partitions are created starting from 30 days ago.RANGE_TIME_PARTITION_INTERVAL='7'automatically creates a new partition every 7 days.RANGE_TIME_PARTITION_TTL='90'sets the default Time to Live (TTL) for partition data to 90 days. Data older than 90 days is automatically deleted.CREATE INDEX idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice, orderTime) PARTITION BY RANGE time(orderTime) partitions 4 WITH ( indexState=ACTIVE, RANGE_TIME_PARTITION_START='30', RANGE_TIME_PARTITION_INTERVAL='7', RANGE_TIME_PARTITION_TTL='90' );ImportantThe RANGE_TIME_PARTITION_START parameter controls the start time for partitioning. It specifies the number of days before the current date from which to start creating partitions.
For example, assume the earliest date in the `orderTime` column is March 16, 2021. To retain all historical data, you must calculate the number of days between March 16, 2021, and the day you create the index. Then, set this number as the value for the RANGE_TIME_PARTITION_START parameter. After the start time is set, records with an `orderTime` value earlier than the start time are not indexed.
Create an index partitioned by the business time column
orderTime, with new partitions automatically created monthly starting from six months ago. By default, data is retained for six months and the unit for the partition field is set to seconds (s).CREATE INDEX idx USING SEARCH ON search_table (storeId, goodsId, goodsPrice, orderTime) partition by range time(orderTime) partitions 4 with ( indexState=ACTIVE, RANGE_TIME_PARTITION_START='180', RANGE_TIME_PARTITION_INTERVAL='30', RANGE_TIME_PARTITION_TTL='180', RANGE_TIME_PARTITION_FIELD_TIMEUNIT='s' );
The following table describes the parameters for time range partitioning.
Parameter | Required | Description |
RANGE_TIME_PARTITION_START | Yes | Specifies the number of days before the index creation to start creating partitions. This applies to scenarios with historical data. If the timestamp of historical data is earlier than the partition start time, an error is reported. |
RANGE_TIME_PARTITION_INTERVAL | Yes | Specifies the interval in days for creating new partitions. For example, |
RANGE_TIME_PARTITION_TTL | No | Specifies the number of days to retain partition data. For example, |
RANGE_TIME_PARTITION_FIELD_TIMEUNIT | No | Specifies the unit of the time partition field for your business. The default unit is milliseconds (ms).
|
RANGE_TIME_PARTITION_MAX_OVERLAP | No | If the data being written has a future timestamp, this parameter specifies the maximum allowed interval between the future timestamp and the current time, in days. If not specified, there is no limit. |
RANGE_TIME_PARTITION_FORMAT | No | Specifies the write format of the time partition field for your business. The default value is |