This topic describes how to select the table type, partition key, distribution key, primary key, and clustered index of an AnalyticDB for MySQL table to optimize the table performance.

Select a replicated table or a fact table

  • A replicated table stores a data replica on each node of the cluster to which it belongs. We recommend that you limit the size of each replicated table to store a maximum of 20,000 rows.
  • A fact table is also a partition table. It is designed to make full use of the query capability of a distributed system. Each fact table can store tens of millions or even hundreds of millions of data entries.

Select an appropriate distribution key

To create a fact table in an AnalyticDB for MySQL cluster, you must use the DISTRIBUTED BY HASH(column_name,...) statement to specify the distribution key. Then, the table is partitioned by the hash value of the columns specified by column_name. AnalyticDB for MySQL allows you to select multiple columns as the distribution key.

You can specify the distribution key based on the following rules:

  • Select fields that can be used to join tables as the distribution key. For example, you can select the user_id field as the distribution key when you analyze data by user or group users.
  • Select fields whose values are evenly distributed as the distribution key, such as the transaction ID, device ID, user ID, or auto-increment columns.

Select an appropriate partition key

If you need to import incremental data for your business, you can specify both the distribution key and partition key when you create a fact table. You can use the partition key to synchronize incremental data.

When you create a fact table, you can use the PARTITION BY {VALUE(column_name) | VALUE(date_format(column_name, ?)} statement to specify partitions.

For example, PARTITION BY VALUE(column_name) specifies that the table is partitioned based on the column specified by the column_name field. PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m%d')) specifies that the table is partitioned based on the column specified by the column_name field after the column is formatted to a date format such as 20190101.

# Use the value of the ds field to partition the table.
PARTITION BY VALUE(ds)
# Format the value of the ds field into the day and use the day to partition the table.
PARTITION BY VALUE(DATE_FORMAT(ds, '%Y%m%d'))
# Format the value of the ds field into the month and use the month to partition the table.
PARTITION BY VALUE(DATE_FORMAT(ds, '%Y%m'))
# Format the value of the ds field into the year and use the year to partition the table.
PARTITION BY VALUE(DATE_FORMAT(ds, '%Y'))

You can specify partitions based on the following rules:

  • If a table contains a large amount of data, specify an appropriate number of list partitions in the table. Otherwise, the performance of AnalyticDB for MySQL may be reduced.
  • If you use the PARTITION BY Value(column_name | date_format) statement to specify list partitions while creating a table, the table is partitioned as specified. A table can be partitioned by year, month, day, or the original value. Partition a table properly to ensure an appropriate amount of data in each partition. Otherwise, the read and write performance or stability of AnalyticDB for MySQL may be reduced.
  • The appropriate range of data entries in a single list partition is 300 million to 1 billion. If a list partition contains less than 300 million entries, you can partition the table by month instead of by day. If a list partition contains more than 1 billion entries, you can partition the table by day instead of by month.
  • We recommend that you keep data at rest in list partitions. If you frequently update a list partition, select an appropriate field for this list partition.

Select a clustered index

The logical order of the key values in a clustered index determines the physical order of the corresponding rows in a table. Each table supports only one clustered index.

You can specify the clustered index based on the following rules:

Select a field that is always included in data queries as the clustered index. For example, on a data analysis platform for e-commerce sellers, each seller views only their own data. You can select the seller ID as the clustered index to ensure data locality and improve data query performance.

Select an appropriate primary key

You can specify the primary key in a table to remove duplicate data. Only tables with the primary key specified allow you to update data by using the DELETE or UPDATE statement.

You can specify the primary key based on the following rules:

  • Select fields of the numeric type as the primary key to ensure better table performance.

    AnalyticDB for MySQL allows you to combine strings or fields as the primary key.

  • The primary key must contain the distribution key and partition key.