how to design analyticdb for mysql table schema,distribution key,partition key - AnalyticDB

This topic describes how to design the schema of an AnalyticDB for MySQL table to optimize performance. A schema includes the table type, distribution key, partition key, primary key, and clustered index key.

Select a table type

AnalyticDB for MySQL supports replicated tables and standard tables. When you select a table type, consider the following points:

A replicated table stores a data replica on each node of the cluster. We recommend that you limit the data volume in each replicated table to a maximum of 20,000 rows.
A standard table, also known as a partitioned table, can use the query capability of a distributed system to improve query performance. A standard table can store a large amount of data, from tens of millions to hundreds of billions of rows.

Select a distribution key

To import incremental data, specify a distribution key and a partition key when you create a standard table. This allows for incremental data synchronization. When you create the table, use the DISTRIBUTED BY HASH(column_name,...) clause to specify the distribution key. The table is then sharded based on the hash values of the column_name field. For more information, see CREATE TABLE.

Syntax
```
DISTRIBUTED BY HASH(column_name,...)
```
Usage notes
- Select fields whose values are evenly distributed as the distribution key, such as transaction IDs, device IDs, user IDs, or auto-increment columns.
  Note
  Do not select fields of the DATE, TIME, or TIMESTAMP types as the distribution key. These fields can cause data skew during data writes and degrade write performance. Most queries are limited to a time range, such as the last day or month. In this case, the data to be queried may exist on only a single node. This prevents you from using the processing capabilities of all nodes in a distributed database. We recommend that you use fields of the DATE or TIME types as subpartition keys. For more information, see Select a partition key.
- To reduce data shuffles, select fields that are used to join tables as the distribution key. For example, if you want to query historical orders by customer, you can select the customer_id field as the distribution key.
- Select fields that frequently appear in query conditions as the distribution key. This allows for partition pruning based on the distribution key.
- Each table can have only one distribution key. A distribution key can contain one or more fields. Select as few fields as possible to make the distribution key more versatile for various complex queries.
- If you do not specify a distribution key when you create a table, the system handles it as follows:
  - If the table has a primary key, AnalyticDB for MySQL uses the primary key as the default distribution key.
  - If the table does not have a primary key, AnalyticDB for MySQL adds an __adb_auto_id__ field and uses it as the primary key and distribution key.

Select a partition key

If a single shard contains a large amount of data after you specify a distribution key, you can further partition the shard using a partition key. This improves data access performance. When you create a table, use the PARTITION BY clause to define subpartitions. The data is then divided as specified. For more information, see CREATE TABLE.

Syntax
- Partition the table using the value of the column_name field. The syntax is as follows:
```
PARTITION BY VALUE(column_name)
```
- Partition the table using the value of the column_name field converted into the %Y%m%d date format, such as 20210101. The syntax is as follows:
```
PARTITION BY VALUE{(DATE_FORMAT(column_name, '%Y%m%d'))|(FROM_UNIXTIME(column_name, '%Y%m%d'))}
```
- Partition the table using the value of the column_name field converted into the %Y%m date format, such as 202101. The syntax is as follows:
```
PARTITION BY VALUE{(DATE_FORMAT(column_name, '%Y%m'))|(FROM_UNIXTIME(column_name, '%Y%m'))}
```
- Partition the table using the value of the column_name field converted into the %Y date format, such as 2021. The syntax is as follows:
```
PARTITION BY VALUE{(DATE_FORMAT(column_name, '%Y'))|(FROM_UNIXTIME(column_name, '%Y'))}
```
Usage notes
- When a table contains a large volume of data, the choice of subpartitions is critical. If the table does not have subpartitions or its subpartitions are split improperly, the performance of the AnalyticDB for MySQL cluster can be severely affected. For information about how to diagnose the suitability of partition fields, see Distribution field reasonability diagnostics.
- Currently, partitioning is supported only by year, month, day, or original value. A partitioning granularity that is too large or too small affects query and write performance, and can even affect the stability of the AnalyticDB for MySQL cluster.
- Keep subpartitions in a static state as much as possible. We do not recommend frequently updating subpartitions. For example, if you have a scenario where multiple historical subpartitions are frequently updated daily, consider whether the subpartition field used is appropriate.
- You can use the LIFECYCLE N keyword to manage the table lifecycle. Partitions are sorted, and partitions that exceed N are filtered out.
  Important
  The maximum number of partitions supported in each table is limited. Therefore, data in a partitioned table cannot be retained permanently. For more information about partition limits, see Limits.

Select a primary key

A primary key serves as the unique identifier for each record. When you create a table, you can use the PRIMARY KEY clause to define a primary key. For more information, see CREATE TABLE.

Syntax
```
PRIMARY KEY (column_name,...)
```
Usage notes
- Only tables that have a primary key support data update operations, such as DELETE and UPDATE.
- The primary key of an AnalyticDB for MySQL table can be a single field or a combination of multiple fields. To achieve better table performance, we recommend that you use numeric fields as the primary key and use as few fields as possible.
- The primary key must contain the distribution key and the partition key. We recommend that you place the distribution key and partition key at the beginning of a composite primary key.

Select a clustered index key

The logical order of key values in a clustered index determines the physical order of the corresponding rows in a table. When you select a clustered index key, consider the following points:

Each table supports only one clustered index. For information about how to create one, see CREATE TABLE.
Use fields that are always included in queries as the clustered index key. For example, in a school's student information system, each student only needs to view their own final grades. In this case, you can define the student ID as the clustered index to ensure data locality and improve query performance.
A clustered index sorts the entire table, which consumes resources such as CPU resources. Use clustered indexes judiciously.

Example

Create a table named customer that meets the following requirements:

Partition the table data based on the customer's logon time (the login_time column), and convert the logon time to the %Y%m%d date format.
Retain data only from the last 30 partitions (lifecycle is 30).
Distribute the data based on the customer ID (the customer_id column).
Set login_time, customer_id, phone_num as the composite primary key.

The CREATE TABLE statement is as follows:

CREATE TABLE customer (
customer_id bigint NOT NULL COMMENT 'Customer ID',
customer_name varchar NOT NULL COMMENT 'Customer name',
phone_num bigint NOT NULL COMMENT 'Phone number',
city_name varchar NOT NULL COMMENT 'City',
sex int NOT NULL COMMENT 'Gender',
id_number varchar NOT NULL COMMENT 'ID card number',
home_address varchar NOT NULL COMMENT 'Home address',
office_address varchar NOT NULL COMMENT 'Office address',
age int NOT NULL COMMENT 'Age',
login_time timestamp NOT NULL COMMENT 'Logon time',
PRIMARY KEY (login_time, customer_id, phone_num)
 )
DISTRIBUTED BY HASH(customer_id)
PARTITION BY VALUE(DATE_FORMAT(login_time, '%Y%m%d')) LIFECYCLE 30
COMMENT 'Customer information table';

FAQ

Q: After I create subpartitions, how do I view all subpartitions of a table and their statistics?

A: You can run the following SQL statement to view all subpartitions of the table and their statistics:

SELECT partition_id, -- Partition name
          row_count, -- Total number of rows in the partition
          local_data_size, -- Size of the local storage occupied by the partition
          index_size, -- Index size of the partition
          pk_size, -- Size of the primary key index of the partition
          remote_data_size -- Size of the remote storage occupied by the partition
FROM information_schema.kepler_partitions
WHERE schema_name = '$DB'
 AND table_name ='$TABLE' 
 AND partition_id > 0;

Important

Partitions in incremental data for which compaction has not been triggered are not displayed. To view a real-time list of all subpartitions, you can run the select distinct $partition_column from $db.$table; statement.

Q: What factors affect the number of shards? Can I change the number of shards myself?
A: The number of shards is automatically calculated based on the initial specifications of the cluster when it is created. You cannot change the number of shards.
Q: Does changing the cluster specifications affect the number of shards?
A: Cluster upgrades or downgrades do not affect the number of shards.
Q: Does AnalyticDB for MySQL support changing the distribution key or partition key?
A: No, it does not. To change the distribution key or partition key, see ALTER TABLE.