How to perform schema design in AnalyticDB for MySQL, distribution key, partition key - AnalyticDB for MySQL

This topic describes how to design the schema of an AnalyticDB for MySQL table to optimize the table performance. The schema includes the table type, distribution key, partition key, primary key, and clustered index of the AnalyticDB for MySQL table.

Select a table type

AnalyticDB for MySQL supports replicated tables and standard tables. When you select a table type, take note of the following items:

A replicated table stores a data replica on each node of the AnalyticDB for MySQL cluster to which the table belongs. We recommend that you limit the volume of data stored in each replicated table to a maximum of 20,000 rows.
A standard table is a partitioned table and can make full use of the query capability of a distributed system to improve query efficiency. Each standard table can store up to hundreds of billions of data entries.

Select a distribution key

If you want to import incremental data, you can specify a distribution key and a partition key when you create a standard table. Use the DISTRIBUTED BY HASH(column_name,...) clause to specify the distribution key and divide the table into partitions based on the hash value of the column_name field. For more information, see CREATE TABLE.

Syntax
```
DISTRIBUTED BY HASH(column_name,...)
```
Usage notes
- Select fields whose values are evenly distributed as the distribution key, such as the transaction ID, device ID, user ID, or auto-increment columns.
  Note
  Do not select fields of the date, time, and timestamp types as the distribution key. The preceding fields are prone to data skew during data writing and may degrade write performance. Most queries are limited to a period of time such as a day or a month. In this case, the data that you want to query may exist on only a single node, and queries cannot leverage the processing capabilities of all nodes in a distributed database system. We recommend that you select fields of the date and time types as the partition key. For more information, see the "Select a partition key" section of this topic.
- To reduce data shuffles, select fields that can be used to join tables as the distribution key. For example, if you want to query historical orders based on the customer, you can select the customer_id field as the distribution key.
- To filter data based on a distribution key, select the frequently used query condition fields as the distribution key.
- Each table can contain only a single distribution key. Each distribution key can contain one or more fields. Select fewer fields to make the distribution key more suitable for a variety of complex queries.
- If you do not specify a distribution key when you create a table, the following rules apply:
  - If the table has a primary key, AnalyticDB for MySQL uses the primary key as the distribution key.
  - If the table does not have a primary key, AnalyticDB for MySQL adds the __adb_auto_id__ field to the table and uses the field as the primary key and the distribution key.

Select a partition key

If a single shard contains a large amount of data after you specify a distribution key, you can divide the shard into partitions to improve data access performance. When you create a table, use the PARTITION BY clause to specify a partition key to partition the table. For more information, see CREATE TABLE.

Syntax
- Use the value of the column_name field to partition the table.
```
PARTITION BY VALUE(column_name)
```
- Convert the value of the column_name field into the %Y%m%d format. Example: 20210101.
```
PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m%d'))
```
- Convert the value of the column_name field into the %Y%m format. Example: 202101.
```
PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m'))
```
- Convert the value of the column_name field into the %Y format. Example: 2021.
```
PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y'))
```
Usage notes
- When a table contains a large amount of data, specify an appropriate partition key for the table. If the table does not contain partitions or is improperly partitioned, the performance of the AnalyticDB for MySQL cluster may degrade.
- You can partition a table by year, month, day, or the original value. To ensure the query and write performance or stability of the AnalyticDB for MySQL cluster, specify a partition key to assign an appropriate amount of data to each partition.
- A partition that contains 300 million to 1 billion data entries is considered appropriate. For example, if a partition contains less than 300 million data entries, you can partition the table by month instead of by day. If a partition contains more than 1 billion data entries, you can partition the table by day instead of by month.
- We recommend that you keep data at rest in partitions. If multiple existing partitions are frequently updated each day, check whether the partition key is appropriate.
- You can use the LIFECYCLE N parameter to manage the lifecycle of a table. Partitions are sorted and only N partitions are retained.
  Important
  Each table supports a limited number of partitions. The data in a partitioned table cannot be permanently retained. For more information, see Limits.

Select a primary key

A primary key is used as a unique identifier of each record. When you create a table, you can use the PRIMARY KEY clause to specify a primary key. For more information, see CREATE TABLE.

Syntax
```
PRIMARY KEY (column_name,...)
```
Usage notes
- Only tables that have a primary key support operations such as DELETE and UPDATE.
- The primary key of an AnalyticDB for MySQL table can be a single field or a combination of multiple fields. To ensure high performance, we recommend that you select one or a small number of numeric fields as the primary key.
- A primary key must include a distribution key and a partition key. We recommend that you place a distribution key and a partition key in the front section of a composite primary key.

Select a clustered index

In a clustered index, the logical order of key values determines the physical order of corresponding rows in a table. When you select a clustered index, take note of the following items:

Each table supports a single clustered index. For information about how to create a clustered index, see CREATE TABLE.
We recommend that you select a field that is always included in data queries as the clustered index. For example, you want to allow students to query only their own final exam grades in an educational administration system. In this case, you can set the student ID to the clustered index to ensure data locality and improve data query performance.
A clustered index sorts the entire table, which results in lower data write performance and high CPU utilization. We recommend that you do not use clustered indexes.

Example

Create a table named customer that meets the following requirements:

The table data is partitioned by using the login_time field that is in the %Y%m%d format.
Only data in the last 30 partitions is retained, which indicates that the lifecycle value of the table is 30.
The table data is distributed by using the customer_id field.
The primary key is set to login_time, customer_id, phone_num.

Execute the following statement to create the customer table:

CREATE TABLE customer (
customer_id bigint NOT NULL COMMENT 'Customer ID',
customer_name varchar NOT NULL COMMENT 'Customer name',
phone_num bigint NOT NULL COMMENT 'Phone number',
city_name varchar NOT NULL COMMENT 'City',
sex int NOT NULL COMMENT 'Gender',
id_number varchar NOT NULL COMMENT 'ID card number',
home_address varchar NOT NULL COMMENT 'Home address',
office_address varchar NOT NULL COMMENT 'Office address',
age int NOT NULL COMMENT 'Age',
login_time timestamp NOT NULL COMMENT 'Logon time',
PRIMARY KEY (login_time, customer_id, phone_num)
 )
DISTRIBUTED BY HASH(customer_id)
PARTITION BY VALUE(DATE_FORMAT(login_time, '%Y%m%d')) LIFECYCLE 30
COMMENT 'Customer information table';

FAQ

Q: How do I query all partitions of a table and the statistics about the partitions after I divide the table into partitions?

A: Execute the following statements to query all partitions of the table and the statistics about the partitions:

SELECT partition_id, --The name of the partition.
          row_count, -- The total number of rows in the partition.
          local_data_size, -- The local data storage of the partition.
          index_size, -- The index size of the partition.
          pk_size, -- The size of the primary key index of the partition.
          remote_data_size -- The remote data storage of the partition.
FROM information_schema.kepler_partitions
WHERE schema_name = '$DB'
 AND table_name ='$TABLE' 
 AND partition_id > 0;

Important

If no compaction is triggered, you cannot query the partitions that are included in incremental data. To query all partitions in real time, execute the SELECT DISTINCT $partition_column from $db.$table; statement.

Q: What factors affect the number of shards? Can I manually change the number of shards?
A: The number of shards is automatically calculated based on the specifications of the AnalyticDB for MySQL cluster that you specify when you create the cluster. You cannot manually change the number of shards.
Q: Is the number of shards affected when I change the configurations of an AnalyticDB for MySQL cluster?
A: No, the number of shards is not affected when you change the configurations of the cluster.
Q: Can I change the distribution key or partition key of an AnalyticDB for MySQL table?
A: No, you cannot change the distribution key or partition key of an AnalyticDB for MySQL table. For information about how to use a different distribution key or partition key, see the "Change a partition or distribution key" section of the ALTER TABLE topic.