This topic describes how to design the schema of an AnalyticDB for MySQL table to optimize the table performance. The schema includes the table type, partition key, distribution key, primary key, and clustered index of the AnalyticDB for MySQL table.

Select the table type

AnalyticDB for MySQL supports replicated tables and fact tables. When you select the table type, take note of the following items:
  • A replicated table stores a data replica on each node of the cluster to which it belongs. We recommend that you limit the volume of data stored in each replicated table to a maximum of 20,000 rows.
  • A fact table is also a partitioned table. It can make full use of the query capability of a distributed system to improve query efficiency. Each fact table can store up to hundreds of millions of data entries.

Select the distribution key

If you need to import incremental data for your business, you can specify both the distribution key and the partition key to synchronize incremental data when you create a fact table. To create a table in an AnalyticDB for MySQL cluster, you must use the DISTRIBUTED BY HASH(column_name,...) statement to specify the distribution key. Then, the table is partitioned by the hash value of the columns specified by the column_name field. For more information, see CREATE TABLE.

  • Syntax
    DISTRIBUTED BY HASH(column_name,...)
  • Precautions
    • Select fields that can be used to join tables as the distribution key. For example, you can select the customer_id field as the distribution key when you want to query historical orders by customer.

      AnalyticDB for MySQL allows you to select multiple fields as the distribution key.

    • Select fields whose values are evenly distributed as the distribution key, such as the transaction ID, device ID, user ID, or auto-increment columns.
    • If you do not specify the distribution key when you create a table, the following rules apply:
      • If the table has a primary key, the primary key is used as the distribution key.
      • If the table does not have a primary key, the __adb_auto_id__ field is added and used as the primary key and the distribution key.
    • By default, an AnalyticDB for MySQL V3.0 Elastic mode for Cluster Edition cluster that has 8 or 16 cores has 32 shards.
    • By default, the number of shards in an AnalyticDB for MySQL V3.0 Elastic mode for Cluster Edition cluster that has 32 or more cores is determined by the number of elastic I/O units (EIUs). The following table describes the mappings between the number of shards and the number of EIUs.
      EIUs Shards
      EIUs = 1 32
      2 <= EIUs < 4 64
      4 <= EIUs < 8 128
      8 <= EIUs < 16 192
      16 <= EIUs < 32 256
      32 <= EIUs < 64 512
      EIU >= 64 1,024
    • By default, the number of shards in an AnalyticDB for MySQL V3.0 Reserved mode for Cluster Edition cluster is determined by the number of node groups. The following table describes the mappings between the number of shards and the number of node groups.
      Node groups Shards
      Node groups = 1 16
      2 <= Node groups < 4 32
      4 <= Node groups < 8 64
      8 <= Node groups < 16 128
      16 <= Node groups < 32 256
      32 <= Node groups < 64 512
      Node groups >= 64 1,024
      Note For frequently asked questions about the default number of shards, see FAQ.

Select the partition key

If the amount of data in a single shard is large after the distribution key is specified, you can specify partitions in the shard by using the partition key to improve the performance of the shard during data access. When you create a table, you can specify list partitions by using the PARTITION BY clause. Then, data is classified based on the preceding partitions. For more information, see CREATE TABLE.

  • Syntax
    • Use the value of the column_name field to partition the table. Syntax:
      PARTITION BY VALUE(column_name)
    • Change the value of the column_name field into the %Y%m%d format, such as 20210101. Syntax:
      PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m%d'))
    • Change the value of the column_name field into the %Y%m format, such as 202101. Syntax:
      PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m'))
    • Change the value of the column_name field into the %Y format, such as 2021. Syntax:
      PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y'))
  • Precautions
    • If a table contains a large amount of data, specify an appropriate number of list partitions in the table. Otherwise, the performance of the AnalyticDB for MySQL cluster may degrade.
    • A table can be partitioned by year, month, day, or the original value. Properly partition a table to ensure an appropriate amount of data in each partition. Otherwise, the query and write performance or stability of the AnalyticDB for MySQL cluster may degrade.
    • The appropriate range of data entries in a single list partition is from 300 million to 1 billion. If a list partition contains less than 300 million data entries, you can partition the table by month instead of by day. If a list partition contains more than 1 billion data entries, you can partition the table by day instead of by month.
    • We recommend that you keep data at rest in list partitions. If you frequently update a list partition, select an appropriate field for this list partition.
    • LIFECYCLE N is used for lifecycle management of the table. Partitions are sorted and only N partitions are retained. The rest is filtered out.
      Note Data within a partitioned table cannot be permanently retained because of the limited number of partitions that a table can support. For more information about the limit on the number of partitions, see Limits.

Select the primary key

A primary key serves as a unique identifier for a record. When you create a table, you can define the primary key by using the PRIMARY KEY clause. For more information, see CREATE TABLE.

  • Syntax
    PRIMARY KEY (column_name,...)
  • Precautions
    • Operations such as DELETE and UPDATE are supported only for tables for which primary keys are defined.
    • The primary key of an AnalyticDB for MySQL table can be a single field or a combination of multiple fields. We recommend that you use a single field of the NUMERIC type as the primary key to ensure high performance.
    • A primary key must include a distribution key and a partition key. We recommend that you put a distribution key and a partition key in the front section of a composite primary key.

Select the clustered index

The logical order of the key values in a clustered index determines the physical order of the corresponding rows in a table. When you select the clustered index, take note of the following items:
  • Each table can support a single clustered index. For more information about how to create a clustered index, see CREATE TABLE.
  • We recommend that you select a field that is always included in data queries as the clustered index. For example, you want to allow students to be able to query only their own final exam grades in an educational administration system. In this case, the student ID can be used as the clustered index to ensure data locality and improve data query performance.
  • A clustered index sorts the entire table, which causes lower data write performance and high CPU utilization. In this case, we recommend that you do not use clustered indexes.

Example

Create a customer table that meets the following requirements:
  • The customer table is partitioned by using the login_time field, and the logon time is changed into the %Y%m%d format.
  • Only data in the last 30 partitions is retained, which indicates that the lifecycle of the table is 30.
  • The data in the customer table is distributed by using the customer_id field.
  • The login_time, customer_id, phone_num field combination is set as the composite primary key.
The following statements are used to create the customer table:
CREATE TABLE customer (
customer_id bigint NOT NULL COMMENT 'Customer ID',
customer_name varchar NOT NULL COMMENT 'Customer name',
phone_num bigint NOT NULL COMMENT 'Phone number',
city_name varchar NOT NULL COMMENT 'City',
sex int NOT NULL COMMENT 'Gender',
id_number varchar NOT NULL COMMENT 'ID card number',
home_address varchar NOT NULL COMMENT 'Home address',
office_address varchar NOT NULL COMMENT 'Office address',
age int NOT NULL COMMENT 'Age',
login_time timestamp NOT NULL COMMENT 'Logon time',
PRIMARY KEY (login_time, customer_id, phone_num)
 )
DISTRIBUTED BY HASH(customer_id)
PARTITION BY VALUE(DATE_FORMAT(login_time, '%Y%m%d')) LIFECYCLE 30
COMMENT 'Customer information table';

FAQ

  • Q: How do I view all list partitions of a table and their statistics after I partition the table into list partitions?
    A: Execute the following statements to view all list partitions of the table and their statistics:
    SELECT partition_id, -- Partition name
              row_count, -- Total number of rows in the partition
              local_data_size, -- Size of the space occupied by the partition stored in local storage
              index_size, -- Size of the partition index
              pk_size, -- Size of the primary key of the index
              remote_data_size -- Size of the space occupied by the partition stored in remote storage
    FROM information_schema.kepler_partitions
    WHERE schema_name = '$DB'
     AND table_name ='$TABLE' 
     AND partition_id > 0;
    Notice Partitions in incremental data for which compaction is not triggered are not returned. To view all real-time list partitions, execute the select distinct $partition_column from $db.$table; statement.
  • Q: What factors affect the number of shards? Can I manually adjust the number of shards?

    A: The number of shards is calculated by the system based on the initial specifications of the cluster that are specified when the cluster is created. You cannot manually adjust the number of shards.

  • Q: Is the number of shards affected if I modify the specifications of the cluster?

    A: No, the number of shards is not affected if you modify the specifications of the cluster.