AnalyticDB for PostgreSQL uses a massively parallel processing (MPP) architecture to distribute data across compute nodes in hash, random, or replicated mode. When queries are being executed, data must be redistributed or broadcast to implement data exchange. Data exchange makes up a large portion of the query execution costs and consumes a large amount of network resources. We recommend that you minimize or prevent data exchange.
To minimize data exchange, you must analyze the database query schema to find out which queries incur the highest costs. Then, you can design your database based on the analysis. Query costs are determined by the amount of time and computing resources required for query execution. Other factors that affect query costs include the query frequency and the query impact on other queries and database operations. You must find out which tables are referenced in the most costly queries, evaluate whether these tables assume important roles in query execution, and consider how these tables are joined and aggregated.
In star schema scenarios where a large table is used to join multiple small tables, we recommend that you use a column that is included in both the large table (fact table) and the largest small table (dimension table) as the distribution column. The largest dimension table is determined by the amount of data that participates in common join operations, but not the size of the table. If WHERE clauses are involved in table queries, only part of rows are joined. For other dimension tables, we recommend that you use their primary keys or foreign keys as partition keys based on their actual join conditions.
If the preceding method cannot be used to implement a collocated join between a dimension table and a fact table, we recommend that you use replicated distribution.
If a table does not have a suitable distribution column or participate in joins, we recommend that you use random distribution.