Schema design determines how AnalyticDB for MySQL stores and retrieves data. A well-designed schema keeps rows that are queried together on the same node, balances each node's workload, and avoids data hotspots. This document covers table types, distribution keys, partition keys, primary keys, and clustered index keys.
Table types
AnalyticDB for MySQL provides two table types. Choose based on data volume:
| Table type | How data is stored | Recommended data volume |
|---|---|---|
| Replicated table | A full copy on every cluster node | Up to 20,000 rows |
| Standard table (partitioned table) | Sharded across nodes by distribution key | Tens of millions to hundreds of billions of rows |
Use replicated tables for small dimension tables that are frequently joined against large fact tables. Use standard tables for everything else.
Distribution key
Design goal: Minimize data movement during query execution by co-locating rows that are queried together on the same node.
When you create a standard table, specify a distribution key using the DISTRIBUTED BY HASH clause. AnalyticDB for MySQL shards the table by hashing the distribution key value and routing each row to the corresponding node.
DISTRIBUTED BY HASH(column_name, ...)Choosing a distribution key
Select a distribution key based on these criteria, in order of priority:
Even value distribution — pick a high-cardinality column so rows spread evenly across nodes. Good choices include transaction IDs, device IDs, user IDs, and auto-increment columns. Uneven distribution causes data skew and degrades write performance.
Join columns — if a query joins two tables on a specific column, use that column as the distribution key for both tables. This routes matching rows to the same node and eliminates cross-node data shuffles. For example, to query historical orders by customer, use
customer_idas the distribution key.Select fields used for grouping (GROUP BY) as the distribution key. During a group operation, the query optimizer can use the data distribution property to perform local grouping. This reduces network overhead between inner nodes and can save up to 60% of the computing time for the group operator.
Frequent filter columns — columns that appear often in
WHEREclauses let AnalyticDB for MySQL read only the relevant shards and skip the rest.Fewest fields possible — each table has exactly one distribution key, but it can span multiple columns. Fewer columns in the distribution key make it useful across a wider variety of queries.
Do not use DATE, TIME, or TIMESTAMP columns as the distribution key. Date-type values cluster data into a small number of hash buckets, causing severe data skew and degrading write performance. Most queries also filter by a time range, which concentrates all matching rows on a single node—defeating the distributed architecture. Use fields of the DATE or TIME types as partition keys instead.
Default behavior when no distribution key is specified:
If the table has a primary key, AnalyticDB for MySQL uses the primary key as the distribution key.
If the table has no primary key, the system adds an
__adb_auto_id__column and uses it as both the primary key and the distribution key.For AnalyticDB for MySQL clusters in Elastic mode with 8 or 16 cores, the default number of shards is 32.
For AnalyticDB for MySQL clusters in Elastic mode with 32 or more cores, the default number of shards depends on the number of Elastic I/O Units (EIUs). The relationship is as follows:
Number of EIUs
Number of shards
EIU=1
32
2 <= EIU < 4
64
4 <= EIU < 8
128
8 <= EIU < 16
192
16 <= EIU < 32
256
32 <= EIU < 64
512
EIU >= 64
1024
For AnalyticDB for MySQL clusters in Reserved mode, the default number of shards depends on the number of node groups. The relationship is as follows:
Number of node groups
Number of shards
Number of node groups=1
16
2 <= Number of node groups < 4
32
4 <= Number of node groups < 8
64
8 <= Number of node groups < 16
128
16 <= Number of node groups < 32
256
32 <= Number of node groups < 64
512
Number of node groups >= 64
1024
NoteFor more information about the default number of shards, see FAQ.
Partition key
Design goal: Further partition each shard to improve data access performance. If a single shard contains a large amount of data after you specify a distribution key, you can subdivide each shard using a partition key. You can also use the LIFECYCLE setting to manage the data lifecycle and enable efficient deletion of old data at scale.
Specify a partition key using the PARTITION BY VALUE clause:
-- Partition by the raw column value
PARTITION BY VALUE(column_name)
-- Partition by day (e.g., 20210101)
PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m%d'))
PARTITION BY VALUE(FROM_UNIXTIME(column_name, '%Y%m%d'))
-- Partition by month (e.g., 202101)
PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m'))
PARTITION BY VALUE(FROM_UNIXTIME(column_name, '%Y%m'))
-- Partition by year (e.g., 2021)
PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y'))
PARTITION BY VALUE(FROM_UNIXTIME(column_name, '%Y'))Use the LIFECYCLE N keyword to set a retention window. AnalyticDB for MySQL keeps the most recent N partitions and automatically drops older ones.
Choosing a partition key
Supported granularities: year, month, day, or raw value. There is no finer granularity.
A data volume of 300 million to 1 billion rows for each subpartition is considered a reasonable division. If the data volume is less than 300 million rows, the partitioning granularity is too small. You can increase the granularity, for example, from day to month. If the data volume is greater than 1 billion rows, the partitioning granularity is too large. You can decrease the granularity, for example, from month to day.
Match granularity to data volume: too coarse a granularity (for example, partitioning by year for a table that grows by millions of rows per day) creates oversized partitions that degrade both query and write performance. Too fine a granularity creates an excessive number of small partitions with the same result.
Treat partitions as stable units: avoid designs where many historical partitions are updated frequently every day. Frequent updates across many partitions indicate a poor partition column choice.
Set a lifecycle: because the maximum number of partitions per table is limited, data in a partitioned table cannot be retained indefinitely. Always define a
LIFECYCLEvalue to prevent hitting this limit. For partition limits, see Limits.
The maximum number of partitions per table is limited. Data in a partitioned table cannot be retained permanently. For details, see Limits.
You can run the following command to view the partition information of a table.
SELECT
partition_id, -- Partition name
row_count, -- Total number of rows in the partition
local_data_size, -- Size of the local storage occupied by the partition
index_size, -- Index size of the partition
pk_size, -- Size of the primary key index of the partition
remote_data_size -- Size of the remote storage occupied by the partition
FROM information_schema.kepler_partitions
WHERE schema_name = '<db_name>' AND table_name ='<tbl_name>';To diagnose whether your partition column is a good choice, use Distribution field reasonability diagnostics.
Primary key
A primary key uniquely identifies each row. Only tables with a primary key support DELETE and UPDATE operations.
PRIMARY KEY (column_name, ...)Choosing a primary key
Must include the distribution key and partition key: the primary key must contain every column in the distribution key and every column in the partition key. Place the distribution key and partition key columns first in the composite primary key definition.
Prefer numeric columns: numeric primary key fields improve lookup performance compared to string fields.
Use as few columns as possible: a narrower primary key reduces index overhead.
For details on the PRIMARY KEY syntax, see CREATE TABLE.
Clustered index key
A clustered index physically orders table rows by the index key value. Queries that consistently filter on the clustered index key benefit from strong data locality—matching rows are stored together on disk.
When to use a clustered index
Clustered indexes are not intended for all tables. Each table supports only one clustered index, and maintaining it consumes CPU resources. Use a clustered index only when:
Queries consistently filter on a specific column or set of columns, and
The query response time improvement justifies the additional CPU cost of maintaining the index.
Example: in a student information system where each student queries only their own grades, define the student ID as the clustered index key. All rows for a given student are stored contiguously, so the query reads a single sequential block instead of scanning the table.
For the syntax to create a clustered index, see CREATE TABLE.
Example
Create a customer table with the following schema:
Distribution key:
customer_id— distributes rows evenly and supports customer-based join queriesPartition key:
login_timepartitioned by day (%Y%m%d) — enables efficient expiration of old dataLifecycle: 30 — retains data for the 30 most recent partitions
Primary key:
(login_time, customer_id, phone_num)— starts with the partition key and distribution key
CREATE TABLE customer (
customer_id bigint NOT NULL COMMENT 'Customer ID',
customer_name varchar NOT NULL COMMENT 'Customer name',
phone_num bigint NOT NULL COMMENT 'Phone number',
city_name varchar NOT NULL COMMENT 'City',
sex int NOT NULL COMMENT 'Gender',
id_number varchar NOT NULL COMMENT 'ID card number',
home_address varchar NOT NULL COMMENT 'Home address',
office_address varchar NOT NULL COMMENT 'Office address',
age int NOT NULL COMMENT 'Age',
login_time timestamp NOT NULL COMMENT 'Logon time',
PRIMARY KEY (login_time, customer_id, phone_num)
)
DISTRIBUTED BY HASH(customer_id)
PARTITION BY VALUE(DATE_FORMAT(login_time, '%Y%m%d')) LIFECYCLE 30
COMMENT 'Customer information table';The primary key begins with login_time (partition key) and customer_id (distribution key), followed by phone_num to enforce row-level uniqueness.
FAQ
How do I view all partitions of a table and their statistics?
Run the following query against information_schema.kepler_partitions:
SELECT
partition_id, -- Partition name
row_count, -- Total rows in the partition
local_data_size, -- Local storage used by the partition
index_size, -- Index size
pk_size, -- Primary key index size
remote_data_size -- Remote storage used by the partition
FROM information_schema.kepler_partitions
WHERE schema_name = '$DB'
AND table_name = '$TABLE'
AND partition_id > 0;Partitions in incremental data for which compaction has not been triggered do not appear in this query. To get a real-time list of all partitions, run SELECT DISTINCT $partition_column FROM $db.$table;.
What determines the number of shards? Can I change it?
The number of shards is calculated automatically from the cluster specifications at creation time. It cannot be changed manually.
Does scaling the cluster up or down change the number of shards?
No. Cluster upgrades and downgrades do not affect the number of shards.
Can I change the distribution key or partition key after creating a table?
No, it does not. To change the distribution key or partition key, see ALTER TABLE.
What's next
CREATE TABLE — full syntax reference for table creation
ALTER TABLE — options for modifying an existing table
Distribution field reasonability diagnostics — diagnose distribution and partition field choices
Limits — partition and table limits