Optimize Subquery Performance by Pushing WHERE Clauses into IN Subqueries - PolarDB

When a query combines an outer WHERE condition with an IN subquery that uses GROUP BY, PolarDB for MySQL can push the outer condition directly into the subquery before materialization. This reduces the number of rows materialized and improves query performance.

Prerequisites

Before you begin, ensure that your cluster runs one of the following engine versions. To check the version, see Query the engine version.

MySQL 8.0.1, revision version 8.0.1.1.42 or later
MySQL 8.0.2, revision version 8.0.2.19 or later

How it works

When a query matches the following pattern, the optimizer pushes the outer WHERE condition into the subquery before materialization:

SELECT ... FROM t
WHERE <condition on t.col>
  AND (t.col, ...) IN (SELECT col, ... FROM t2 GROUP BY ...);

Because every row in the result set must satisfy t.col = subquery_col, the condition on t.col implies an equivalent constraint on subquery_col. The optimizer rewrites the subquery to apply this constraint earlier, so fewer rows are materialized and scanned.

Example of the rewrite: The query

SELECT * FROM t WHERE a < 5 AND (a, b) IN (SELECT c, MAX(d) FROM t2 GROUP BY c);

is internally rewritten to:

SELECT * FROM t WHERE a < 5 AND (a, b) IN (SELECT c, MAX(d) FROM t2 WHERE c < 5 GROUP BY c);

Where the pushed condition lands depends on whether the correlated column appears in GROUP BY:

Column position	Pushed to
In `GROUP BY`	Subquery `WHERE` clause
Not in `GROUP BY` (for example, `MAX(col)`)	Subquery `HAVING` clause

Limitations

The following cases do not support condition pushdown:

LIMIT clause in the subquery — the subquery result is already bounded, so applying the filter after LIMIT would change query semantics.
```
SELECT * FROM t WHERE a > 1 AND (a, b) IN (SELECT c, MAX(d) FROM t2 GROUP BY c LIMIT 2);
```
Non-deterministic outer condition or subquery reference — the outer condition column references a subquery or is non-deterministic (for example, RAND()), potentially yielding different results under the same input conditions.
```
SELECT * FROM t WHERE a > 5*RAND() AND (a, b) IN (SELECT c, MAX(d) FROM t2 GROUP BY c);
```
Stored function in the outer condition — complex stored functions might not benefit from repeated executions and are therefore not considered for pushdown.
```
CREATE FUNCTION f1() RETURNS INT
BEGIN
...
END;

SELECT * FROM t WHERE a > f1() AND (a, b) IN (SELECT c, MAX(d) FROM t2 GROUP BY c);
```

Enable condition pushdown

Configure the parameter

Set the loose_subquery_cond_pushdown_mode parameter at the Global level. For instructions, see Configure cluster and node parameters.

Parameter	Level	Description
`loose_subquery_cond_pushdown_mode`	Global	Controls condition pushdown from `WHERE` clauses to `IN` subqueries. Valid values: `OFF` (default, feature disabled), `ON` (enables condition pushdown), `REPLICA_ON` (enabled on read-only nodes only).

Use optimizer hints

To control pushdown for individual queries without changing loose_subquery_cond_pushdown_mode, use optimizer hints with QB_NAME to name the target query block:

SUBQUERY_CONDITION_PUSHDOWN(@<qb_name>) — forces pushdown for the named query block.
NO_SUBQUERY_CONDITION_PUSHDOWN(@<qb_name>) — suppresses pushdown for the named query block.

CREATE TABLE t1 (a INT, b INT, c INT, d INT);
CREATE TABLE t2 (e INT, f INT, g INT);

-- Suppress pushdown for the subquery named subq1
SELECT /*+ NO_SUBQUERY_CONDITION_PUSHDOWN(@subq1) */ * FROM t1
WHERE t1.c < 25
  AND (t1.a, t1.c) IN (SELECT /*+ QB_NAME(subq1) */ t2.e, MAX(t2.g) FROM t2 WHERE t2.e < 5 GROUP BY t2.e);

-- Force pushdown for the subquery named subq1
SELECT /*+ SUBQUERY_CONDITION_PUSHDOWN(@subq1) */ * FROM t1
WHERE t1.c < 25
  AND (t1.a, t1.c) IN (SELECT /*+ QB_NAME(subq1) */ t2.e, MAX(t2.g) FROM t2 WHERE t2.e < 5 GROUP BY t2.e);

Examples

The examples below use the following tables:

CREATE TABLE t1 (a INT, b INT, c INT, d INT);
CREATE TABLE t2 (e INT, f INT, g INT);

Example 1: Condition pushed to HAVING

Run EXPLAIN FORMAT=TREE to compare execution plans before and after enabling condition pushdown:

EXPLAIN FORMAT=TREE SELECT * FROM t1
WHERE t1.c < 25
  AND (t1.a, t1.c) IN (SELECT t2.e, MAX(t2.g) FROM t2 WHERE t2.e < 5 GROUP BY t2.e);

Output:

-- Before enabling loose_subquery_cond_pushdown_mode
-> Filter: ((t1.c < 25) and <in_optimizer>((t1.a,t1.c),(t1.a,t1.c) in (select #2)))  (cost=*** rows=***)
    -> Table scan on t1  (cost=*** rows=***)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((t1.a = `<materialized_subquery>`.e) and (t1.c = `<materialized_subquery>`.`MAX(t2.g)`))
            -> Limit: 1 row(s)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (e=t1.a, MAX(t2.g)=t1.c)
                    -> Materialize with deduplication
                        -> Table scan on <temporary>
                            -> Aggregate using temporary table
                                -> Filter: (t2.e < 5)  (cost=*** rows=***)
                                    -> Table scan on t2  (cost=*** rows=***)

-- After enabling loose_subquery_cond_pushdown_mode
-> Filter: ((t1.c < 25) and <in_optimizer>((t1.a,t1.c),(t1.a,t1.c) in (select #2)))  (cost=*** rows=***)
    -> Table scan on t1  (cost=*** rows=***)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((t1.a = `<materialized_subquery>`.e) and (t1.c = `<materialized_subquery>`.`MAX(t2.g)`))
            -> Limit: 1 row(s)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (e=t1.a, MAX(t2.g)=t1.c)
                    -> Materialize with deduplication
                        -> Filter: (max(t2.g) < 25)
                            -> Table scan on <temporary>
                                -> Aggregate using temporary table
                                    -> Filter: (t2.e < 5)  (cost=*** rows=***)
                                        -> Table scan on t2  (cost=*** rows=***)

Note

t1.c is equivalent to MAX(t2.g) in the subquery, so the outer condition t1.c < 25 implies MAX(t2.g) < 25. Because g is not in GROUP BY, the optimizer pushes the condition as a HAVING filter — visible as Filter: (max(t2.g) < 25) inside the materialization step.

Example 2: Condition pushed to WHERE

EXPLAIN FORMAT=TREE SELECT * FROM t1
WHERE ((t1.a < 2 OR t1.a = 5) AND t1.b > 3)
  AND (t1.a, t1.b, t1.c) IN (
    SELECT t2.e, t2.f, MAX(t2.g)
    FROM t2
    WHERE t2.e < 5
    GROUP BY t2.e, t2.f
  );

Output:

-- Before enabling loose_subquery_cond_pushdown_mode
-> Filter: ((t1.b > 3) and ((t1.a < 2) or (t1.a = 5)) and <in_optimizer>((t1.a,t1.b,t1.c),(t1.a,t1.b,t1.c) in (select #2)))  (cost=*** rows=***)
    -> Table scan on t1  (cost=*** rows=***)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((t1.a = `<materialized_subquery>`.e) and (t1.b = `<materialized_subquery>`.f) and (t1.c = `<materialized_subquery>`.`MAX(t2.g)`))
            -> Limit: 1 row(s)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (e=t1.a, f=t1.b, MAX(t2.g)=t1.c)
                    -> Materialize with deduplication
                        -> Table scan on <temporary>
                            -> Aggregate using temporary table
                                -> Filter: (t2.e < 5)  (cost=*** rows=***)
                                    -> Table scan on t2  (cost=*** rows=***)

-- After enabling loose_subquery_cond_pushdown_mode
-> Filter: ((t1.b > 3) and ((t1.a < 2) or (t1.a = 5)) and <in_optimizer>((t1.a,t1.b,t1.c),(t1.a,t1.b,t1.c) in (select #2)))  (cost=*** rows=***)
    -> Table scan on t1  (cost=*** rows=***)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((t1.a = `<materialized_subquery>`.e) and (t1.b = `<materialized_subquery>`.f) and (t1.c = `<materialized_subquery>`.`MAX(t2.g)`))
            -> Limit: 1 row(s)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (e=t1.a, f=t1.b, MAX(t2.g)=t1.c)
                    -> Materialize with deduplication
                        -> Table scan on <temporary>
                            -> Aggregate using temporary table
                                -> Filter: ((t2.e < 5) and (t2.f > 3) and ((t2.e < 2) or (t2.e = 5)))  (cost=*** rows=***)
                                    -> Table scan on t2  (cost=*** rows=***)

Note

t1.a and t1.b map to t2.e and t2.f, both of which are in GROUP BY. The optimizer pushes the compound condition ((t1.a < 2 OR t1.a = 5) AND t1.b > 3) directly to the subquery's WHERE clause as ((t2.e < 2 OR t2.e = 5) AND t2.f > 3).

Key rule: Whether a pushed condition becomes a WHERE filter or a HAVING filter depends on where the correlated column appears. Columns in GROUP BY produce a WHERE filter (Example 2); columns outside GROUP BY, such as aggregate results, produce a HAVING filter (Example 1).