The choice of appropriate distribution keys is vital to improving the table query performance. This topic describes the distribution key choices.
Choose one or more columns where data is evenly distributed
If you choose one or more distribution columns where data is unevenly distributed, data skew may occur. In this case, some compute nodes may store more data and process more query requests than other compute nodes. However, based on the Cannikin Law, a query is not complete until all compute nodes finish data processing. As a result, the query is only as fast as the slowest compute node. Therefore, we recommend that you do not choose distribution columns with Boolean or date values.
Choose a distribution column that is frequently used in JOIN clauses
The following figures show three join methods. You can join two tables by using a collocated join. If the join key is also the distribution key, two tables can be joined within the associated compute nodes without data movement. If you do not choose a distribution column that is frequently used in JOIN clauses, you must perform a redistribute motion on the larger one of the two tables that you want to join and then perform a redistributed join. You can also perform a broadcast motion on the smaller one of the two tables and then perform a broadcast join. Both the redistribute and broadcast motions increase network overheads.
Choose a frequently used query condition as the distribution key
This enables AnalyticDB for PostgreSQL to filter compute nodes based on the distribution key. If you do not specify a distribution key, the primary key of the table is used as the distribution key. If the table does not have a primary key, the first column is used as the distribution key.
A distribution key can be defined from one or more columns. Example:
create table t1(c1 int, c2 int) distributed by (c1,c2);
Exercise caution when you choose random distribution
Random distribution does not support collocated joins or compute node filtering.