The selection of appropriate distribution keys is vital to improving the table query performance. This topic describes how to choose distribution keys.

Choose the columns in which data is evenly distributed

If the distribution columns that you choose have unevenly distributed data, data skew may occur, where some compute nodes contain much more data than the others. This leads to heightened loads on the compute nodes and prolonged response times. Therefore, we recommend that you do not choose distribution columns of the Boolean, time, or date type.

Choose a distribution column that is frequently used for joining tables

This way, collocated joins can be implemented, as shown in the following figures. If the join key is the same as the distribution key, the join can be completed within the relevant compute nodes without data movement. If the distribution key is different from the join key in a query, the system must perform a redistribute motion or a broadcast motion before it can join the tables. Both redistributed and broadcast joins cause heavier network overheads than a collocated join.

Figure 1. Collocated join
Collocated join
Figure 2. Redistributed join
Redistributed join
Figure 3. Broadcast join
Broadcast join

Choose the columns whose values are frequently used as query conditions

This allows AnalyticDB for PostgreSQL to filter compute nodes based on the distribution key before it sends query requests to them. If you do not specify a distribution key when you create a table, the primary key of the table is used as the distribution key. If the table does not have a primary key, the first column is used as the distribution key.

The distribution key can be set to more than one columns. Example:

create table t1(c1 int, c2 int) distributed by (c1,c2);

We recommend that you do not choose random distribution

Random distribution does not support collocated joins or compute node filtering.