This topic describes how to optimize grouping and aggregation queries in AnalyticDB for MySQL.

Grouping and aggregation process

AnalyticDB for MySQL is a distributed data warehouse. By default, it performs the following steps to execute a distributed aggregate query:

  1. Perform partial aggregation on data.

    Partial aggregate nodes use only a small amount of memory. The aggregation process is complete in a streaming manner, which prevents workloads of partial aggregate nodes from piling up.

  2. After partial aggregation is complete, redistribute data among nodes based on partial aggregation results obtained by grouping and then perform final aggregation.

    Partial aggregation results are transferred over networks to the nodes of a downstream stage. (For more information, see Execution plan elements.) The amount of data to be transferred over networks is small because the partial aggregation is performed on the data. This reduces the network pressure. After the data is redistributed, final aggregation is performed. On the final aggregate node, the values and aggregation state of a group must be maintained in the memory until all data is processed. This ensures that no new data needs to be processed for a specific group value. Therefore, the final aggregate node may occupy a large amount of memory.

For example, the following SQL statement for grouping and aggregation is executed:

SELECT sum(A), max(B) FROM tb1 GROUP BY C,D;

When the preceding statement is executed to perform grouping and aggregation, partial aggregation is first performed on data on the Node1 and Node2 nodes of the upstream stage. Partial aggregation results are partial sum(A), partial sum(B), C, and D. These partial aggregation results are transferred over networks to the Node 3 and Node4 nodes of the downstream stage for final aggregation, as shown in the following figure.

Two-step aggregation process

Use hints to optimize grouping and aggregation

  • Scenarios

    In most scenarios, two-step aggregation can strike a good balance between memory and network resources. However, in special scenarios, two-step aggregation may not be the best choice. For example, large numbers of groups must be processed by using grouping and aggregation because the GROUP BY column has a large number of unique values.

    Assume a scenario that requires mobile numbers or user IDs for grouping. If you use the two-step aggregation method, partial aggregation is performed although only a small amount of data can be aggregated. Moreover, the partial aggregation step involves multiple operations, such as calculating hash values of groups, deduplication, and executing aggregation functions. The amount of data to be transferred over networks is not reduced in the partial aggregation step due to large numbers of groups. However, large amounts of computing resources are consumed.

  • Solution

    To solve the preceding problem of a low aggregation rate, you can add the /*aggregation_path_type=single_agg*/ hint to skip partial aggregation and directly perform final aggregation when you execute a query. This reduces unnecessary computing overheads.

    Note If the /*aggregation_path_type=single_agg*/ hint is used in an SQL statement, all grouping and aggregation queries in the SQL statement use the specified optimization process. Therefore, the best method is to first analyze the characteristics of aggregation operators in the original execution plan, evaluate the benefits brought by the hint, and then decide whether to use this optimization scheme.
  • Optimization description

    If the aggregation rate is low, the amount of data to be transferred over networks is not reduced on the Node1 and Node2 nodes during partial aggregation and consumes large amounts of computing resources.

    After optimization, partial aggregation is not performed on the Node1 and Node2 nodes. All data (A, B, C, and D) is directly aggregated by the Node3 and Node4 nodes of the downstream stage, which reduces the amount of required computing resources, as shown in the following figure.

    Large numbers of groups in a grouping and aggregation query
    Note This optimization may not reduce memory usage. If the aggregation rate is low, large amounts of data are accumulated in memory for deduplication and aggregation to ensure that all data for a specific group value is processed.