This topic describes how to design the schema of an AnalyticDB for MySQL table to optimize table performance. The schema includes the table type, partition key, distribution key, primary key, and clustered index of the AnalyticDB for MySQL table.

Select a table type

AnalyticDB for MySQL supports replicated tables and fact tables. When you select a table type, take note of the following items:
  • A replicated table stores a data replica on each node of the cluster to which the table belongs. We recommend that you limit the volume of data stored in each replicated table to a maximum of 20,000 rows.
  • A fact table is also a partitioned table. It can make full use of the query capability of a distributed system to improve query efficiency. Each fact table can store hundreds of billions of data entries.

Select a distribution key

If you need to import incremental data for your business, you can specify both a distribution key and a partition key to synchronize incremental data when you create a fact table. To create a table in an AnalyticDB for MySQL cluster, you must use the DISTRIBUTED BY HASH(column_name,...) statement to specify the distribution key. Then, the table is partitioned by the hash value of the columns specified by column_name. For more information, see CREATE TABLE.

  • Syntax
    DISTRIBUTED BY HASH(column_name,...)
  • Precautions
    • Select fields whose values are evenly distributed as the distribution key, such as the transaction ID, device ID, user ID, or auto-increment columns.
      Note Do not select fields of the date, time, and timestamp types as the distribution key. These fields are prone to data skew during data writing and may degrade write performance. Most queries are limited to a period of time such as a day or a month. In this case, the data to be queried may exist on only a single node, and queries cannot leverage the processing capabilities of all nodes in a distributed database system. We recommend that you select fields of the date and time types as the partition key. For more information, see the "Select a partition key" section of this topic.
    • Select fields that can be used to join tables as the distribution key to reduce data shuffles. For example, you can select the customer_id field as the distribution key if you want to query historical orders by customer.
    • Select the frequently used query condition fields as the distribution key to filter data.
    • Each table can contain only a single distribution key. Each distribution key can contain one or more fields. Select fewer fields to make the distribution key more suitable for a variety of complex queries.
    • If you do not specify the distribution key when you create a table, the following rules apply:
      • If the table has a primary key, the primary key is used as the distribution key.
      • If the table does not have a primary key, the __adb_auto_id__ field is added and used as the primary key and the distribution key.

Select a partition key

If a single shard contains a large amount of data after the distribution key is specified, you can specify partitions in the shard by using the partition key to improve the performance of the shard during data access. When you create a table, you can specify list partitions by using the PARTITION BY clause. Then, data is classified based on the preceding partitions. For more information, see CREATE TABLE.

  • Syntax
    • Use the value of column_name to partition the table.
      PARTITION BY VALUE(column_name)
    • Change the value of column_name into the %Y%m%d format. Example: 20210101.
      PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m%d'))
    • Change the value of column_name into the %Y%m format. Example: 202101.
      PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m'))
    • Change the value of column_name into the %Y format. Example: 2021.
      PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y'))
  • Precautions
    • If a table contains a large amount of data, specify an appropriate number of list partitions in the table. Otherwise, the performance of the AnalyticDB for MySQL cluster may degrade.
    • A table can be partitioned by year, month, day, or the original value. Properly partition a table to ensure an appropriate amount of data in each partition. Otherwise, the query and write performance or stability of the AnalyticDB for MySQL cluster may degrade.
    • The appropriate range of data entries in a single list partition is from 300 million to 1 billion. If a list partition contains less than 300 million data entries, you can partition the table by month instead of by day. If a list partition contains more than 1 billion data entries, you can partition the table by day instead of by month.
    • We recommend that you keep data at rest in list partitions. If multiple existing list partitions are frequently updated each day, check whether the partition key fields are appropriate.
    • LIFECYCLE N is used for lifecycle management of a table. Partitions are sorted and only N partitions are retained. The rest is filtered out.
      Important Data within a partitioned table cannot be permanently retained because of the limited number of partitions that a table can support. For more information about the limit on the number of partitions, see Limits.

Select a primary key

A primary key serves as a unique identifier for a record. When you create a table, you can specify the primary key by using the PRIMARY KEY clause. For more information, see CREATE TABLE.

  • Syntax
    PRIMARY KEY (column_name,...)
  • Precautions
    • Operations such as DELETE and UPDATE are supported only for tables for which a primary key is specified.
    • The primary key of an AnalyticDB for MySQL table can be a single field or a combination of multiple fields. We recommend that you select one or a small number of numeric fields as the primary key to ensure high performance.
    • A primary key must include a distribution key and a partition key. We recommend that you put a distribution key and a partition key in the front section of a composite primary key.

Select a clustered index

The logical order of key values in a clustered index determines the physical order of the corresponding rows in a table. When you select a clustered index, take note of the following items:
  • Each table can support a single clustered index. For more information about how to create a clustered index, see CREATE TABLE.
  • We recommend that you select a field that is always included in data queries as the clustered index. For example, assume that you want to allow students to be able to query only their own final exam grades in an educational administration system. In this case, the student ID can be used as the clustered index to ensure data locality and improve data query performance.
  • A clustered index sorts the entire table, which causes lower data write performance and high CPU utilization. In this case, we recommend that you do not use clustered indexes.

Example

Create a table named customer that meets the following requirements:
  • The data in the customer table is partitioned by using the login_time field that is in the %Y%m%d format.
  • Only data in the last 30 partitions is retained, which indicates that the lifecycle of the table is 30.
  • The data in the customer table is distributed by using the customer_id field.
  • login_time, customer_id, phone_num is set as the composite primary key.
The following statement is used to create the customer table:
CREATE TABLE customer (
customer_id bigint NOT NULL COMMENT 'Customer ID',
customer_name varchar NOT NULL COMMENT 'Customer name',
phone_num bigint NOT NULL COMMENT 'Phone number',
city_name varchar NOT NULL COMMENT 'City',
sex int NOT NULL COMMENT 'Gender',
id_number varchar NOT NULL COMMENT 'ID card number',
home_address varchar NOT NULL COMMENT 'Home address',
office_address varchar NOT NULL COMMENT 'Office address',
age int NOT NULL COMMENT 'Age',
login_time timestamp NOT NULL COMMENT 'Logon time',
PRIMARY KEY (login_time, customer_id, phone_num)
 )
DISTRIBUTED BY HASH(customer_id)
PARTITION BY VALUE(DATE_FORMAT(login_time, '%Y%m%d')) LIFECYCLE 30
COMMENT 'Customer information table';

FAQ

  • Q: How do I query all list partitions of a table and their statistics after I partition the table into list partitions?
    A: Execute the following statement to query all list partitions of the table and their statistics:
    SELECT partition_id, -- Partition name
              row_count, -- Total number of rows in the partition
              local_data_size, -- Size of the space occupied by the partition stored in local storage
              index_size, -- Size of the partition index
              pk_size, -- Size of the primary key of the index
              remote_data_size -- Amount of space occupied by the partition stored in remote storage
    FROM information_schema.kepler_partitions
    WHERE schema_name = '$DB'
     AND table_name ='$TABLE' 
     AND partition_id > 0;
    Important Partitions in incremental data for which compaction is not triggered are not returned. To view all real-time list partitions, execute the SELECT distinct $partition_column from $db.$table; statement.
  • Q: What factors affect the number of shards? Can I manually adjust the number of shards?

    A: The number of shards is calculated by the system based on the initial specifications of the cluster that are specified when the cluster is created. The number of shards cannot be manually adjusted.

  • Q: Is the number of shards affected if I change cluster specifications?

    A: No, the number of shards is not affected by changes to cluster specifications.

  • Q: Can I change distribution or partition keys in AnalyticDB for MySQL?

    A: No, you cannot change distribution or partition keys in AnalyticDB for MySQL. If you want to use a different distribution or partition key for your table, perform the operations described in the "Modify a partition or distribution key" section of the ALTER TABLE topic.