This topic describes the background and usage of the correlated subquery pull-up feature.
Applicability
This feature is supported in PolarDB for PostgreSQL clusters that run PostgreSQL 14 and have a minor engine version of 2.0.14.8.11.0 or later.
You can view the minor engine version in the console or by running the SHOW polardb_version; statement. If your cluster does not meet the version requirement, you can upgrade the minor engine version.
Background information
The PostgreSQL optimizer uses SubLink to represent a subquery and its associated operators in an expression. The following types of SubLink are available:
EXISTS_SUBLINK: implements anEXISTS (SELECT ...)subquery.ALL_SUBLINK: implements anALL (SELECT ...)subquery.ANY_SUBLINK: implements anANY (SELECT ...)subquery or anIN (SELECT ...)subquery.
The optimizer usually attempts to pull up correlated subqueries that use the ANY, IN, EXISTS, or NOT EXISTS operator. This allows the subquery and its outer query to be optimized together into an execution plan that uses a semi-join or an anti-join, which improves query performance. For an ANY_SUBLINK, if the subquery references variables from the outer query, the subquery is not pulled up. This misses the opportunity for joint optimization with the outer query. The subquery can only be optimized independently, which greatly increases the SQL running time.
PolarDB for PostgreSQL and let you use a parameter to control the pull-up of correlated ANY_SUBLINK subqueries. For correlated subqueries that use IN or ANY, the subquery can be pulled up even if it references variables from the outer query. This expands the optimizer's search space and helps generate a better execution plan.
Usage
The polar_enable_pullup_with_lateral parameter enables or disables the pull-up of correlated ANY_SUBLINK subqueries. Valid values are:
ON (default): Enables the pull-up of correlated
ANY_SUBLINKsubqueries.OFF: Disables the pull-up of correlated
ANY_SUBLINKsubqueries.
Example
Prepare data.
CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 SELECT i, 1 FROM generate_series(1, 100000) i;
CREATE TABLE t2 AS SELECT * FROM t1;View the execution plan and running time after the feature is disabled.
=> SET polar_enable_pullup_with_lateral TO OFF;
=> EXPLAIN (COSTS OFF, ANALYZE)
SELECT * FROM t1
WHERE t1.a IN (SELECT a FROM t2 WHERE t2.b = t1.b AND t2.b = 1);
QUERY PLAN
---------------------------------------------------------------------------------
Seq Scan on t1 (actual time=67.631..1641827.119 rows=100000 loops=1)
Filter: (SubPlan 1)
SubPlan 1
-> Result (actual time=0.005..13.124 rows=50000 loops=100000)
One-Time Filter: (t1.b = 1)
-> Seq Scan on t2 (actual time=0.005..7.718 rows=50000 loops=100000)
Filter: (b = 1)
Planning Time: 0.145 ms
Execution Time: 1641847.702 ms
(9 rows)View the execution plan and running time after the feature is enabled.
=> SET polar_enable_pullup_with_lateral TO ON;
=> EXPLAIN (COSTS OFF, ANALYZE)
SELECT * FROM t1
WHERE t1.a IN (SELECT a FROM t2 WHERE t2.b = t1.b AND t2.b = 1);
QUERY PLAN
----------------------------------------------------------------------------
Hash Semi Join (actual time=64.783..173.482 rows=100000 loops=1)
Hash Cond: (t1.a = t2.a)
-> Seq Scan on t1 (actual time=0.016..25.440 rows=100000 loops=1)
Filter: (b = 1)
-> Hash (actual time=64.550..64.551 rows=100000 loops=1)
Buckets: 131072 Batches: 2 Memory Usage: 2976kB
-> Seq Scan on t2 (actual time=0.010..30.330 rows=100000 loops=1)
Filter: (b = 1)
Planning Time: 0.195 ms
Execution Time: 178.050 ms
(10 rows)The example shows that after the subquery is pulled up, the optimizer combines the subquery and the outer query into a semi-join. The filter conditions in the subquery can then significantly reduce the result set of the outer query. This greatly reduces the running time. If the subquery is not pulled up, it cannot filter the rows returned by the outer query.