通過子串連下推提升IN/ANY子句的SQL執行效率-雲原生資料庫 PolarDB-阿里雲

PolarDB PostgreSQL版支援子串連下推的查詢改寫功能，能夠提升含有IN/ANY子句的SQL的執行效率。

背景

在PostgreSQL中，ANY類型的子串連（對應IN和ANY子句）會被嘗試上拉為半串連（semi join）。然而，如果與之串連的表是一個無法上拉的子查詢，PostgreSQL不支援為其產生參數化路徑，只能作為一個整體獨立執行，當子查詢中的資料量較大時，可能導致整個SQL執行效率顯著降低。

例如，以下SQL由於子查詢中存在GROUP BY語句導致無法上拉，其執行時間主要在t_big表的掃描和排序上，並且隨著t_big表規模的增大，執行時間亦相應延長。

EXPLAIN ANALYZE SELECT * FROM (SELECT a, sum(b) b FROM t_big GROUP BY a)v WHERE a IN (SELECT a FROM t_small);
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Semi Join  (cost=0.55..59523.15 rows=10000 width=12) (actual time=0.064..1237.621 rows=2 loops=1)
   Merge Cond: (t_big.a = t_small.a)
   ->  GroupAggregate  (cost=0.42..46910.99 rows=1000000 width=12) (actual time=0.033..1113.615 rows=1000000 loops=1)
         Group Key: t_big.a
         ->  Index Scan using t_big_a_idx on t_big  (cost=0.42..31910.99 rows=1000000 width=8) (actual time=0.024..420.575 rows=1000000 loops=1)
   ->  Index Only Scan using t_small_a_idx on t_small  (cost=0.13..12.16 rows=2 width=4) (actual time=0.028..0.030 rows=2 loops=1)
         Heap Fetches: 2
 Planning Time: 0.256 ms
 Execution Time: 1237.700 ms
(9 rows)

如果能夠將ANY類型的子串連下推至子查詢中，則可以利用子查詢中的索引，從而提升執行效率。

EXPLAIN ANALYZE SELECT * FROM (SELECT a, sum(b) b FROM t_big WHERE a IN (SELECT a FROM t_small) GROUP BY a)v;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=17.96..17.99 rows=2 width=12) (actual time=0.061..0.064 rows=2 loops=1)
   Group Key: t_big.a
   ->  Sort  (cost=17.96..17.96 rows=2 width=8) (actual time=0.054..0.056 rows=2 loops=1)
         Sort Key: t_big.a
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=1.46..17.95 rows=2 width=8) (actual time=0.031..0.045 rows=2 loops=1)
               ->  Unique  (cost=1.03..1.04 rows=2 width=4) (actual time=0.014..0.017 rows=2 loops=1)
                     ->  Sort  (cost=1.03..1.03 rows=2 width=4) (actual time=0.013..0.014 rows=2 loops=1)
                           Sort Key: t_small.a
                           Sort Method: quicksort  Memory: 25kB
                           ->  Seq Scan on t_small  (cost=0.00..1.02 rows=2 width=4) (actual time=0.005..0.006 rows=2 loops=1)
               ->  Index Scan using t_big_a_idx on t_big  (cost=0.42..8.44 rows=1 width=8) (actual time=0.010..0.011 rows=1 loops=2)
                     Index Cond: (a = t_small.a)
 Planning Time: 0.527 ms
 Execution Time: 0.143 ms
(15 rows)

前提條件

支援的PolarDB PostgreSQL版的版本如下：

PostgreSQL 14（核心小版本2.0.14.13.28.0及以上）
PostgreSQL 11（核心小版本2.0.11.15.44.0及以上）

說明

您可在控制台查看核心小版本號碼，也可以通過SHOW polardb_version;語句查看。如未滿足核心小版本要求，請升級核心小版本。

應用情境

子串連下推功能適用於IN/ANY子句引用的包含GROUP BY語句的子查詢，並且該子查詢涉及一個大表。將IN/ANY子句下推至子查詢中，能夠利用大表的索引，從而減少對大表資料的訪問。

功能限制

子串連下推功能存在以下限制：

IN/ANY子句必須引用一個有GROUP BY語句的子查詢，否則當前原生PostgreSQL將直接產生參數化路徑，不使用子串連下推功能。
IN/ANY子句的列必須包含在GROUP BY語句的列中，否則下推後的SQL不等價。
當前查詢塊不能存在外串連，否則下推後的SQL不等價。
目前只支援單列的情境。例如：a in (select a from t)或者a = any(select a from t)。
目前只支援SELECT和CREATE TABLE AS語句。

使用說明

子串連下推功能受參數控制，相關的參數名稱及其作用如下所示：

參數名稱

描述

polar_cbqt_pushdown_sublink

用於控制子串連下推功能的開關，取值如下：

OFF（預設）：關閉子串連下推功能。
ON：開啟子串連下推功能，由基於代價的查詢變換（CBQT）架構根據代價決定是否應用下推。
FORCE：強制開啟子串連下推功能，不受CBQT架構控制。該參數值一般用於 HINT，不建議全域設定。

樣本

資料準備

CREATE TABLE t_small(a int);
CREATE TABLE t_big(a int, b int, c int);

CREATE INDEX ON t_big(a);

INSERT INTO t_big SELECT i, i, i FROM generate_series(1, 1000000)i;
INSERT INTO t_small VALUES(1), (1000000);

ANALYZE t_small, t_big;

原始查詢

在原始查詢計劃中，t_big.a = t_small.a串連條件無法作為參數化路徑進行下推，導致t_big表需要進行全表掃描，執行效率很低。

EXPLAIN ANALYZE SELECT * FROM (SELECT a, sum(b) b FROM t_big GROUP BY a)v WHERE a IN (SELECT a FROM t_small);

返回結果如下：

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Semi Join  (cost=1.46..59510.27 rows=10000 width=12) (actual time=0.049..1239.128 rows=2 loops=1)
   Merge Cond: (t_big.a = t_small.a)
   ->  GroupAggregate  (cost=0.42..46909.23 rows=1000000 width=12) (actual time=0.034..1113.324 rows=1000000 loops=1)
         Group Key: t_big.a
         ->  Index Scan using t_big_a_idx on t_big  (cost=0.42..31909.23 rows=1000000 width=8) (actual time=0.025..412.650 rows=1000000 loops=1)
   ->  Sort  (cost=1.03..1.03 rows=2 width=4) (actual time=0.012..0.013 rows=2 loops=1)
         Sort Key: t_small.a
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on t_small  (cost=0.00..1.02 rows=2 width=4) (actual time=0.005..0.006 rows=2 loops=1)
 Planning Time: 0.219 ms
 Execution Time: 1239.208 ms
(11 rows)

通過CBQT開啟子串連下推

開啟CBQT和子串連下推功能後，將a in (select a from t_small)子句下推到子查詢中，可以利用串連條件為t_big表產生參數化路徑，掃描的資料大大減少，執行時間明顯縮短。

說明

原計劃的代價需要大於參數polar_cbqt_cost_threshold，子串連下推查詢改寫才會應用。

-- 開啟CBQT功能
SET polar_enable_cbqt to on;

-- 開啟子串連下推功能
SET polar_cbqt_pushdown_sublink to on;

EXPLAIN ANALYZE SELECT * FROM (SELECT a, sum(b) b FROM t_big GROUP BY a)v WHERE a IN (SELECT a FROM t_small);

返回結果如下：

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=17.96..17.99 rows=2 width=12) (actual time=0.056..0.059 rows=2 loops=1)
   Group Key: t_big.a
   ->  Sort  (cost=17.96..17.96 rows=2 width=8) (actual time=0.051..0.052 rows=2 loops=1)
         Sort Key: t_big.a
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=1.46..17.95 rows=2 width=8) (actual time=0.032..0.045 rows=2 loops=1)
               ->  Unique  (cost=1.03..1.04 rows=2 width=4) (actual time=0.014..0.018 rows=2 loops=1)
                     ->  Sort  (cost=1.03..1.03 rows=2 width=4) (actual time=0.014..0.015 rows=2 loops=1)
                           Sort Key: t_small.a
                           Sort Method: quicksort  Memory: 25kB
                           ->  Seq Scan on t_small  (cost=0.00..1.02 rows=2 width=4) (actual time=0.007..0.008 rows=2 loops=1)
               ->  Index Scan using t_big_a_idx on t_big  (cost=0.42..8.44 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=2)
                     Index Cond: (a = t_small.a)
 Planning Time: 0.518 ms
 Execution Time: 0.141 ms
(15 rows)

通過HINT開啟子串連下推

通過HINT的方式，在SQL層級強制開啟子串連下推功能，同樣將a in (select a from t_small)子句下推到子查詢中，可以利用串連條件為t_big表產生參數化路徑，掃描的資料大大減少，執行時間明顯縮短。

-- polar_cbqt_pushdown_sublink參數預設關閉
SET polar_cbqt_pushdown_sublink to off;

EXPLAIN ANALYZE /*+ Set(polar_cbqt_pushdown_sublink force) */ SELECT * FROM (SELECT a, sum(b) b FROM t_big GROUP BY a)v WHERE a IN (SELECT a FROM t_small);

返回結果如下：

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=17.96..17.99 rows=2 width=12) (actual time=0.073..0.076 rows=2 loops=1)
   Group Key: t_big.a
   ->  Sort  (cost=17.96..17.96 rows=2 width=8) (actual time=0.067..0.069 rows=2 loops=1)
         Sort Key: t_big.a
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=1.46..17.95 rows=2 width=8) (actual time=0.026..0.040 rows=2 loops=1)
               ->  Unique  (cost=1.03..1.04 rows=2 width=4) (actual time=0.011..0.015 rows=2 loops=1)
                     ->  Sort  (cost=1.03..1.03 rows=2 width=4) (actual time=0.010..0.011 rows=2 loops=1)
                           Sort Key: t_small.a
                           Sort Method: quicksort  Memory: 25kB
                           ->  Seq Scan on t_small  (cost=0.00..1.02 rows=2 width=4) (actual time=0.005..0.006 rows=2 loops=1)
               ->  Index Scan using t_big_a_idx on t_big  (cost=0.42..8.44 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=2)
                     Index Cond: (a = t_small.a)
 Planning Time: 0.788 ms
 Execution Time: 0.156 ms
(15 rows)