AnalyticDB for PostgreSQL distributes data among multiple compute nodes as a massively parallel processing (MPP) data warehousing service. This may cause data skew. This topic describes how to identify and eliminate data skew.

Typically, data skew may lead to slow query, out of memory (OOM), or full disk issues. Data skew is caused by the following reasons:

  • Data storage: Data is unevenly stored among multiple compute nodes.
  • Data redistribution: Data is unevenly distributed after a shuffle during queries.
  • Window function query: After data is redistributed by using PARTITION BY clauses and window functions, the data volume on specific compute nodes is significantly larger than that on other compute nodes.

Identify data skew

Data skew caused by data storage can be identified with ease from data distribution among all compute nodes. Data skew caused by data redistribution and window function queries can be identified only from monitoring results. This section describes how to identify data skew caused by database storage and table storage.

  • Execute the following statement to identify data skew caused by database storage:
    select gp_execution_dbid(), datname, pg_size_pretty(pg_database_size(datname)) 
    from gp_dist_random('pg_database') order by 2, 1, pg_database_size(datname) desc;
  • Execute the following statement to identify data skew caused by table storage:
    select gp_execution_dbid(),datname, pg_size_pretty(pg_total_relation_size('<Table name>'))
    from gp_dist_random('gp_id');

Eliminate data skew

  • Data skew caused by data storage

    Data skew caused by data storage occurs due to an inappropriate distribution key. For example, assume that a column in a table contains a large number of identical values. If this column is selected as the distribution key for hash distribution, the data volume on the compute nodes where these values reside is much larger than that on other compute nodes.

    To resolve this issue, we recommend that you select one or more columns where data is evenly distributed as the distribution key. You can also use random distribution.

  • Data skew caused by data redistribution

    Data skew caused by data redistribution occurs in two scenarios. It may occur when one or more columns where GROUP BY operations are performed are not selected as the distribution key. It may also occur when table joins incur a shuffle.

    For the first scenario, AnalyticDB for PostgreSQL uses the following multi-stage aggregation procedure to resolve the data skew issue:

    1. Aggregates data among compute nodes.
    2. Redistributes the aggregated data by using PARTITION BY clauses among compute nodes.
    3. Aggregates the redistributed data among compute nodes.
    4. Returns the results to the coordinator node.

    For the second scenario, you can use a redistribution method similar to the solution for data skew caused by data storage. We recommend that you select the join key as the distribution key to implement collocated joins.

  • Data skew caused by window function queries

    AnalyticDB for PostgreSQL does not provide optimization suggestions for data skew caused by window function queries.