Boost AP Query Performance by Removing Redundant GROUP BY and ORDER BY - PolarDB

Redundant GROUP BY or ORDER BY clauses force the database to perform unnecessary sorting or hashing operations, consuming significant CPU and memory resources. This overhead is especially costly for analytical processing (AP) queries on large datasets. PolarDB for MySQL automatically detects and removes these redundant operations using functional dependency (FD) analysis — delivering up to 29% faster query performance on TPC-H benchmarks without any changes to your application code.

Scope

Product series: Cluster Edition, Standard Edition
Version: MySQL 8.0.2, revision 8.0.2.2.33 or later

Enable the feature

Two parameters control this optimization:

Parameter	Level	Default	Valid values
`loose_groupby_elimination_mode`	Global/Session	`REPLICA_ON`	`REPLICA_ON`, `ON`, `OFF`
`loose_orderby_elimination_mode`	Global/Session	`REPLICA_ON`	`REPLICA_ON`, `ON`, `OFF`

Valid values:

REPLICA_ON (default): Enables the feature on read-only (RO) nodes only.
ON: Enables the feature on all nodes.
OFF: Disabled.

Parameter naming by interface:

PolarDB console: Parameters include the loose_ prefix for compatibility with MySQL configuration files. Find and modify loose_groupby_elimination_mode and loose_orderby_elimination_mode.
Database session (using the SET command): Remove the loose_ prefix and use the original parameter name — groupby_elimination_mode and orderby_elimination_mode.

How it works

The optimizer eliminates redundant GROUP BY and ORDER BY clauses by inferring functional dependencies (FDs) from table metadata and query conditions.

What is a functional dependency?

If column A uniquely determines column B, B is functionally dependent on A — written as A -> B. For example, in a user table where user_id is the primary key, user_id -> nickname holds because each user_id maps to exactly one nickname.

Sources of functional dependency:

Source	Example
Primary keys and unique keys	`user_id` (primary key) determines all other columns in the row
Equi-join conditions	`t1.pk = t2.fk` propagates the uniqueness of `t1.pk` into `t2`
Constant filter conditions	`WHERE status = 'active'` makes `status` a constant in the result

Optimization rules:

Rule	Condition	Result
`GROUP BY` elimination	`GROUP BY` columns include a primary key or uniquely identify each row	Entire `GROUP BY` clause removed
`GROUP BY` elimination (constants)	All `GROUP BY` columns are constants	`GROUP BY` removed and `LIMIT 1` added
`ORDER BY` elimination	All `ORDER BY` columns are constants	Entire `ORDER BY` clause removed
`GROUP BY`/`ORDER BY` simplification	FD exists between columns in the clause (e.g., `GROUP BY x, y` where `x -> y`)	Clause simplified (e.g., to `GROUP BY x`)

Optimization scenarios

Scenario 1: Grouping by primary or unique key

Trigger conditions:

The GROUP BY clause includes a table's primary key.
Other columns in the GROUP BY list are functionally determined by that key.

Business scenario: Calculate the total order count per user and display their username.

Original SQL:

-- user_id is the primary key of the user table and uniquely determines user_name.
SELECT
    u.user_id,
    u.user_name,
    COUNT(o.order_id)
FROM user u
JOIN orders o ON u.user_id = o.user_id
GROUP BY
    u.user_id,
    u.user_name;

What the optimizer does: Because user_id is the primary key, the FD user_id -> user_name holds. The optimizer simplifies GROUP BY u.user_id, u.user_name to GROUP BY u.user_id, removing the redundant grouping on user_name.

Optimized SQL:

SELECT
    u.user_id,
    u.user_name,
    COUNT(o.order_id)
FROM user u
JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id;

Verify the optimization:

SHOW WARNINGS;

The Message column shows the rewritten query with GROUP BY u.user_id only, confirming that u.user_name has been removed from the grouping key.

Scenario 2: Grouping by constants

Trigger conditions:

All columns in the GROUP BY clause appear as equality constants in the WHERE clause.

Business scenario: Query a specific row by exact column values.

Original SQL:

SELECT a, b FROM t1 WHERE a = 1 AND b = 1 GROUP BY a, b;

What the optimizer does: Both a and b are constants in the WHERE clause, so the result set contains at most one row. The optimizer removes the GROUP BY and adds LIMIT 1.

Optimized SQL:

SELECT a, b FROM t1 WHERE a = 1 AND b = 1 LIMIT 1;

Verify the optimization:

SHOW WARNINGS;

The Message column shows the rewritten query with LIMIT 1 and no GROUP BY clause.

Scenario 3: Multi-table join with complex grouping (TPC-H Q10)

Trigger conditions:

A multi-table join propagates a primary key's FD through equi-join conditions.
All non-key columns in the GROUP BY list are transitively determined by the primary key.

Original SQL (TPC-H Q10):

SELECT
    c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment,
    SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM customer, orders, lineitem, nation
WHERE c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND c_nationkey = n_nationkey
  -- Other filter conditions...
GROUP BY
    c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
ORDER BY revenue DESC
LIMIT 20;

What the optimizer does:

Primary key dependency: c_custkey is the primary key of the customer table, so it determines c_name, c_acctbal, c_phone, c_address, c_comment, and c_nationkey.
Dependency propagation: The join condition c_nationkey = n_nationkey and the fact that n_nationkey is the primary key of nation means c_custkey also determines n_name.
Simplification: All non-key columns in the GROUP BY list are functionally dependent on c_custkey. The optimizer simplifies the clause to GROUP BY c_custkey.

Optimized SQL:

SELECT
    c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment,
    SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM customer, orders, lineitem, nation
WHERE c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND c_nationkey = n_nationkey
  -- Other filter conditions...
GROUP BY c_custkey
ORDER BY revenue DESC
LIMIT 20;

Verify the optimization:

SHOW WARNINGS;

The Message column shows the rewritten query with GROUP BY c_custkey only, confirming that all other columns have been removed from the grouping key.

Scenario 4: ORDER BY simplification with a computed column

Trigger conditions:

The ORDER BY clause includes a column that is a deterministic function of other columns in the same clause.
That column is defined as UNIQUE NOT NULL.

Table schema:

CREATE TABLE t1 (
  a INT,
  b INT,
  c INT AS (a + b) UNIQUE NOT NULL
);

Original SQL:

EXPLAIN SELECT a, b FROM t1 ORDER BY a, b, c;

What the optimizer does: Column c is computed from a and b (c = a + b) and defined as UNIQUE NOT NULL, establishing the FD (a, b) -> c. The optimizer simplifies ORDER BY a, b, c to ORDER BY a, b.

Verify the optimization:

SHOW WARNINGS;

+-------+------+-------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                       |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` order by `test`.`t1`.`a`,`test`.`t1`.`b` |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------+

The rewritten query in the Message column confirms that c has been removed from the sort key.

Performance benchmark

Tests on a TPC-H 100 GB standard dataset show that enabling GROUP BY elimination significantly improves Q10 query performance.

Test scenario	With elimination	Without elimination	Improvement
In-Memory Column Index (IMCI) at 1 DOP (within 1 day of data generation)	48 seconds	68 seconds	29%
In-Memory Column Index (IMCI) at 32 DOP (within 32 days of data generation)	1.9 seconds	2.6 seconds	27%

The TPC-H implementation described here is based on TPC-H benchmarking methodology and cannot be compared with published TPC-H benchmark results. These tests do not fully comply with all TPC-H requirements. Actual performance improvements may vary based on query complexity, data distribution, and cluster specifications.