Hot data partition splitting - PolarDB - Alibaba Cloud Documentation Center

This topic describes how to split a hot data partition in PolarDB-X.

As a distributed database, PolarDB-X distributes the partitions of a partitioned table to different storage nodes as evenly as possible and exploits the overall system resources to prevent single points of failure. The data of partitions in a RANGE or LIST partitioned table is distributed based on your definition. In a HASH partitioned table, PolarDB-X uses the consistent hashing algorithm to map the values of the partition key to a specific hash value, which is then mapped to the hash space where a partition resides. In a HASH partitioned table whose data is evenly distributed based on the partition key such as the primary key, PolarDB-X ensures that data is evenly distributed to partitions. Otherwise, data may be unevenly distributed to partitions, which causes data skew.

In this example, an order table is used to show how data skew is caused in a HASH partitioned table. The primary key of the table is the id column whose values are auto-increment IDs. The following sample code provides an example on how to create the table:

CREATE TABLE orders (
id int(11) NOT NULL AUTO_INCREMENT,
seller_id int(11) DEFAULT NULL,
PRIMARY KEY (id)
)

In this order table, if you want to evenly distribute its data, you can partition this table based on the unique primary key by using the HASH partitioning policy. PolarDB-X uses the consistent hashing algorithm to ensure that data is evenly distributed to partitions. However, when you query data from this table, data is more frequently queried based on the seller_id column instead of the id column. As shown in the following figure, if you want to query the data of a seller whose ID is 88, the system cannot perform partition pruning based on the seller_id column. In this case, the system needs to scan all partitions, which is inefficient.

Alternatively, you can change the partition key and partition this table based on the seller_id column by using the HASH partitioning policy. This way, when you query data from this table by the seller_id column, you can apply the partition pruning technology in the optimizer to exclude most irrelevant partitions. The system needs to only scan partial partitions to meet your business requirements. The following figure shows the method. However, if you use this method to partition the table based on the seller_id column, data of the same seller is distributed to the same partition. As a result, the data volume of the partition that contains the data of the big sellers is excessively large and data skew occurs. If the data of the big seller is distributed to one partition such as P5, as shown in the following figure, this partition becomes a hot data partition.

Solution

First, specify a column as the second partition key column.

alter table orders partition by key(seller_id,id) partitions 5

This operation does not change the data distribution of the orders table. No data is rehashed, the number of partitions is still 5, and the id column is specified as the second partition key column. Only the partition metadata of the table is modified at a low cost. In this case, the id column is not used for route computing and is used only as a placeholder of the partition key.

The following figure intuitively compares the hash space of each partition in the orders table before and after the id column is specified as the second partition key column.

Before the id column is specified

After the id column is specified

If only one partition key column is used and a hot data partition exists, a large amount of data is mapped to a specific hash value instead of a value range. The hot data partition cannot be further split. If two partition key columns are used, the hash space changes from one-dimensional to two-dimensional, which provides more flexibility and allows you to further split a hot data partition. In PolarDB-X, you can split hot data partitions based on the second partition key column. For example, the data of the big seller whose ID is 88 in the orders table in the preceding example is distributed to the P5 partition. You can execute the following statement to split the data of the big seller:

alter table orders split into H88_ partitions 2 by hot value(88)

The following figure shows the splitting results.

To split the hot data partition, the system splits the data of the big seller whose ID is 88 in the P5 partition into N partitions based on the id column. In the example, the value of N is 2 and the data is split into two partitions named H88_1 and H88_2.

The data of the sellers whose ID is not 88 is not affected before and after the data of the big seller whose ID is 88 is split. For example, the data in the P1 and P2 partitions remain unchanged. Only the hot data is changed. Before the data of the big seller whose ID is 88 is split, the data of the seller is automatically routed to the P5 partition. After the data of the big seller whose ID is 88 is split, the data of the seller is routed to the H88_1 and H88_2 partitions.

Limits

You can split a hot data partition only in a KEY partitioned table. The KEY partitioning policy is a type of the HASH partitioning policy.

References

Split a hot data partition