Correlated subquery pull-up - PolarDB - Alibaba Cloud Documentation Center

This topic describes the background and usage of the correlated subquery pull-up feature.

Applicability

This feature is supported in PolarDB for PostgreSQL clusters that run PostgreSQL 14 and have a minor engine version of 2.0.14.8.11.0 or later.

Note

You can view the minor engine version in the console or by running the SHOW polardb_version; statement. If your cluster does not meet the version requirement, you can upgrade the minor engine version.

Background information

The PostgreSQL optimizer uses SubLink to represent a subquery and its associated operators in an expression. The following types of SubLink are available:

EXISTS_SUBLINK: implements an EXISTS (SELECT ...) subquery.
ALL_SUBLINK: implements an ALL (SELECT ...) subquery.
ANY_SUBLINK: implements an ANY (SELECT ...) subquery or an IN (SELECT ...) subquery.

The optimizer usually attempts to pull up correlated subqueries that use the ANY, IN, EXISTS, or NOT EXISTS operator. This allows the subquery and its outer query to be optimized together into an execution plan that uses a semi-join or an anti-join, which improves query performance. For an ANY_SUBLINK, if the subquery references variables from the outer query, the subquery is not pulled up. This misses the opportunity for joint optimization with the outer query. The subquery can only be optimized independently, which greatly increases the SQL running time.

PolarDB for PostgreSQL and let you use a parameter to control the pull-up of correlated ANY_SUBLINK subqueries. For correlated subqueries that use IN or ANY, the subquery can be pulled up even if it references variables from the outer query. This expands the optimizer's search space and helps generate a better execution plan.

Usage

The polar_enable_pullup_with_lateral parameter enables or disables the pull-up of correlated ANY_SUBLINK subqueries. Valid values are:

ON (default): Enables the pull-up of correlated ANY_SUBLINK subqueries.
OFF: Disables the pull-up of correlated ANY_SUBLINK subqueries.

Example

Prepare data.

CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 SELECT i, 1 FROM generate_series(1, 100000) i;
CREATE TABLE t2 AS SELECT * FROM t1;

View the execution plan and running time after the feature is disabled.

=> SET polar_enable_pullup_with_lateral TO OFF;

=> EXPLAIN (COSTS OFF, ANALYZE)
   SELECT * FROM t1
   WHERE t1.a IN (SELECT a FROM t2 WHERE t2.b = t1.b AND t2.b = 1);
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on t1 (actual time=67.631..1641827.119 rows=100000 loops=1)
   Filter: (SubPlan 1)
   SubPlan 1
     ->  Result (actual time=0.005..13.124 rows=50000 loops=100000)
           One-Time Filter: (t1.b = 1)
           ->  Seq Scan on t2 (actual time=0.005..7.718 rows=50000 loops=100000)
                 Filter: (b = 1)
 Planning Time: 0.145 ms
 Execution Time: 1641847.702 ms
(9 rows)

View the execution plan and running time after the feature is enabled.

=> SET polar_enable_pullup_with_lateral TO ON;

=> EXPLAIN (COSTS OFF, ANALYZE)
   SELECT * FROM t1
   WHERE t1.a IN (SELECT a FROM t2 WHERE t2.b = t1.b AND t2.b = 1);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Hash Semi Join (actual time=64.783..173.482 rows=100000 loops=1)
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1 (actual time=0.016..25.440 rows=100000 loops=1)
         Filter: (b = 1)
   ->  Hash (actual time=64.550..64.551 rows=100000 loops=1)
         Buckets: 131072  Batches: 2  Memory Usage: 2976kB
         ->  Seq Scan on t2 (actual time=0.010..30.330 rows=100000 loops=1)
               Filter: (b = 1)
 Planning Time: 0.195 ms
 Execution Time: 178.050 ms
(10 rows)

The example shows that after the subquery is pulled up, the optimizer combines the subquery and the outer query into a semi-join. The filter conditions in the subquery can then significantly reduce the result set of the outer query. This greatly reduces the running time. If the subquery is not pulled up, it cannot filter the rows returned by the outer query.