how to use partitionedtablescannode, performance comparison with append - PolarDB

This topic describes the PartitionedTable Scan feature, including its background, usage, and performance.

Background

PolarDB for PostgreSQL does not limit the number of partitions in a partitioned table. If a table has more than two levels of partitions, the number of partitions increases exponentially.

For example, a table has two levels of partitions. The first level has 100 hash partitions. Each first-level partition is then divided into 100 hash subpartitions. The entire partitioned table has 10,000 partitions. If you query this partitioned table, the query plan is as follows:

explain analyze select * from part_hash;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Append  (cost=0.00..344500.00 rows=16300000 width=22)
   ->  Seq Scan on part_hash_sys0102  (cost=0.00..26.30 rows=1630 width=22)
   ->  Seq Scan on part_hash_sys0103  (cost=0.00..26.30 rows=1630 width=22)
   ->  Seq Scan on part_hash_sys0104  (cost=0.00..26.30 rows=1630 width=22)
  ...
  ...
  ...
   ->  Seq Scan on part_hash_sys10198  (cost=0.00..26.30 rows=1630 width=22)
   ->  Seq Scan on part_hash_sys10199  (cost=0.00..26.30 rows=1630 width=22)
   ->  Seq Scan on part_hash_sys10200  (cost=0.00..26.30 rows=1630 width=22)
 Planning Time: 3183.644 ms
 Execution Time: 633.779 ms
(10003 rows)

The results show that the query is slow. This is because the optimizer handles partitioned tables by first generating an optimal plan for each partition and then using the Append operator to combine these plans into a final plan for the partitioned table. If a partitioned table has a small number of partitions, this process is fast and users will not notice a delay. However, as the number of partitions grows, the process becomes noticeably slower. Queries on partitioned tables can become much slower than queries on standard tables.

For example, in the preceding SQL statement, the table part_hash has 10,000 partitions. Its Planning Time is about 3 seconds. In contrast, the Planning Time for a query on a standard table is only 0.1 milliseconds. This is a difference of several hundred times. In addition to a long Planning Time, the query process also uses a large amount of memory, which can cause an out-of-memory (OOM) error.

This drawback of partitioned tables is more obvious in join queries:

CREATE TABLE part_hash2 (a int, b int, c varchar(10))
PARTITION BY HASH(a) SUBPARTITION BY HASH (b) PARTITIONS 100 SUBPARTITIONS 100;

explain analyze select count(*) from part_hash a join part_hash2 b on a.a=b.b where b.c = '0001';
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=48970442.90..48970442.91 rows=1 width=8) (actual time=6466.854..6859.935 rows=1 loops=1)
   ->  Gather  (cost=48970442.68..48970442.89 rows=2 width=8) (actual time=397.780..6859.902 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=48969442.68..48969442.69 rows=1 width=8) (actual time=4.748..11.768 rows=1 loops=3)
               ->  Merge Join  (cost=1403826.01..42177776.01 rows=2716666667 width=0) (actual time=4.736..11.756 rows=0 loops=3)
                     Merge Cond: (a.a = b.b)
                     ->  Sort  (cost=1093160.93..1110135.93 rows=6790000 width=4) (actual time=4.734..8.588 rows=0 loops=3)
                           Sort Key: a.a
                           Sort Method: quicksort  Memory: 25kB
                           Worker 0:  Sort Method: quicksort  Memory: 25kB
                           Worker 1:  Sort Method: quicksort  Memory: 25kB
                           ->  Parallel Append  (cost=0.00..229832.35 rows=6790000 width=4) (actual time=4.665..8.518 rows=0 loops=3)
                                 ->  Parallel Seq Scan on part_hash_sys0102 a  (cost=0.00..19.59 rows=959 width=4) (actual time=0.001..0.001 rows=0 loops=1)
                                 ->  Parallel Seq Scan on part_hash_sys0103 a_1  (cost=0.00..19.59 rows=959 width=4) (actual time=0.001..0.001 rows=0 loops=1)
                                 ->  Parallel Seq Scan on part_hash_sys0104 a_2  (cost=0.00..19.59 rows=959 width=4) (actual time=0.001..0.001 rows=0 loops=1)
                                  ...
                     ->  Sort  (cost=310665.08..310865.08 rows=80000 width=4) (never executed)
                           Sort Key: b.b
                           ->  Append  (cost=0.00..304150.00 rows=80000 width=4) (never executed)
                                 ->  Seq Scan on part_hash2_sys0102 b  (cost=0.00..30.38 rows=8 width=4) (never executed)
                                       Filter: ((c)::text = '0001'::text)
                                 ->  Seq Scan on part_hash2_sys0103 b_1  (cost=0.00..30.38 rows=8 width=4) (never executed)
                                       Filter: ((c)::text = '0001'::text)
                                 ->  Seq Scan on part_hash2_sys0104 b_2  (cost=0.00..30.38 rows=8 width=4) (never executed)
                                       Filter: ((c)::text = '0001'::text)
                                  ... 

 Planning Time: 221082.616 ms
 Execution Time: 9500.148 ms
(30018 rows)

This shows that during a full table scan, a partitioned table is less efficient than a standard table. This happens because the query has no conditions to limit the scan to specific partitions. All advantages of a partitioned table are lost. You can use partition pruning to limit a query to a small number of partitions. However, some online analytical processing (OLAP) scenarios require a full table scan of the entire partitioned table.

Overview

To solve this problem, PolarDB for PostgreSQL provides the PartitionedTableScan operator. This query operator for partitioned tables is more efficient than the Append operator. It significantly reduces Planning Time and consumes less memory, which helps prevent Out of Memory (OOM) errors.

The following shows the Planning Time and memory usage for two separate SQL queries run with the PartitionedTableScan operator.

explain analyze select * from part_hash;
                                 QUERY PLAN                                  
----------------------------------------------------------------------------------------------------------------------------------------------------
PartitionedTableScan on part_hash  (cost=0.00..1.00 rows=1 width=22) (actual time=134.348..134.352 rows=0 loops=1)(Iteration partition number 10000)
   Scan Partitions: part_hash_sys0102, part_hash_sys0103, ...part_hash_sys10198, part_hash_sys10199, part_hash_sys10200
   ->  Seq Scan on part_hash  (cost=0.00..1.00 rows=1 width=22)
 Planning Time: 293.778 ms
 Execution Time: 384.202 ms
(5 rows)

explain analyze select count(*) from part_hash a join part_hash2 b on a.a=b.b where b.c = '0001';
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=2.02..2.03 rows=1 width=8) (actual time=152.322..152.326 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..2.02 rows=1 width=0) (actual time=152.308..152.311 rows=0 loops=1)
         Join Filter: (a.a = b.b)
         ->  PartitionedTableScan on part_hash a  (cost=0.00..1.00 rows=1 width=4) (actual time=152.305..152.306 rows=0 loops=1)(Iteration partition number 10000)
               Scan Partitions: part_hash_sys0102, part_hash_sys0103,, part_hash_sys10198, part_hash_sys10199, part_hash_sys10200
               ->  Seq Scan on part_hash a  (cost=0.00..1.00 rows=1 width=4)
         ->  PartitionedTableScan on part_hash2 b  (cost=0.00..1.00 rows=1 width=4) (never executed)
               ->  Seq Scan on part_hash2 b  (cost=0.00..1.00 rows=1 width=4)
                     Filter: ((c)::text = '0001'::text)
 Planning Time: 732.952 ms
 Execution Time: 436.927 ms
(11 rows)

As you can see, in this example, the Planning Time is significantly reduced after using the PartitionedTableScan operator. The following table shows a comparison of the specific values:

Type	`Append`	`PartitionedTableScan`
Single Query Planning Time	3183.644ms	293.778ms
Join Query Planning Time	221082.616ms	732.952ms

Applicability

The PartitionedTableScan operator is supported only on PolarDB for PostgreSQL 14 with a minor engine version of 2.0.11.9.32.0 or later.
PartitionedTableScan currently supports only SELECT statements. It does not support DML statements.
PartitionedTableScan does not support partition-wise join. If you enable the partition-wise join feature, a PartitionedTableScan plan is not generated.

Note

The PartitionedTable Scan feature is available only for clusters with a minor engine version of 2.0.11.9.32.0 or later. If you have an existing cluster with an earlier minor engine version and need to use this feature, contact us to enable it.

Usage

The following examples show how to use the PartitionedTableScan feature.

First, create a partitioned table.

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY Hash(a) partitions 16;

Enable PartitionedTableScan using a parameter

You can enable or disable the PartitionedTableScan feature by setting the polar_num_parts_for_pts parameter.

Parameter

Valid values

Default value

Description

polar_num_parts_for_pts

-1 to INT_MAX

Specifies the threshold for the number of subpartitions in a partitioned table that triggers the PartitionedTableScan feature. The values of this parameter are described as follows:

If polar_num_parts_for_pts is set to 0 (the default value), PartitionedTableScan is never enabled, regardless of the number of subpartitions.
If polar_num_parts_for_pts is set to -1, PartitionedTableScan is always enabled, regardless of the number of subpartitions.
If polar_num_parts_for_pts is greater than 0, the PartitionedTableScan feature is automatically enabled when the number of subpartitions in a table exceeds the value of the polar_num_parts_for_pts parameter. For example, if you set this parameter to 64, the PartitionedTableScan feature is automatically enabled only when the number of subpartitions is greater than 64.

The following is an example:

SET polar_num_parts_for_pts to -1;
explain select * from prt1;
                           QUERY PLAN                            
-----------------------------------------------------------------
 PartitionedTableScan on prt1  (cost=0.00..1.00 rows=1 width=40)
   ->  Seq Scan on prt1  (cost=0.00..1.00 rows=1 width=40)
(2 rows)

Use a HINT

Use the HINT syntax PARTEDSCAN(table alias). The following is an example:

EXPLAIN select /*+PARTEDSCAN(prt1) */ select * from prt1;
                           QUERY PLAN                            
-----------------------------------------------------------------
 PartitionedTableScan on prt1  (cost=0.00..1.00 rows=1 width=40)
   ->  Seq Scan on prt1  (cost=0.00..1.00 rows=1 width=40)
(2 rows)

Parallel query

PolarDB for PostgreSQL supports parallel query for partitioned tables. This feature enables efficient queries on large datasets. Similar to Append, PartitionedTableScan also supports parallel query. However, the parallel query implementation for PartitionedTableScan is called Parallel PartitionedTableScan and supports only two modes of parallelism: inter-partition parallelism and hybrid parallelism.

Inter-partition parallelism

Inter-partition parallelism means that each worker queries a different partition. This allows multiple workers to query the entire partitioned table in parallel.

EXPLAIN (COSTS OFF) select /*+PARTEDSCAN(prt1) */ * from prt1;
                 QUERY PLAN                  
---------------------------------------------
 Gather
   Workers Planned: 4
   ->  Parallel PartitionedTableScan on prt1
         ->  Seq Scan on prt1
(4 rows)

As shown in the preceding plan, four parallel workers are started for the partitioned table (Workers Planned: 4). Each worker is responsible for querying one partition. A clear indicator is the operator named Parallel PartitionedTableScan.

Hybrid parallelism

Hybrid parallelism means that execution can run in parallel both between partitions and within each partition. This provides the highest degree of parallelism for a parallel query.

EXPLAIN (COSTS OFF) select /*+PARTEDSCAN(prt1) */ * from prt1;
                 QUERY PLAN                  
---------------------------------------------
 Gather
   Workers Planned: 8
   ->  Parallel PartitionedTableScan on prt1
         ->  Parallel Seq Scan on prt1
(4 rows)

As shown in the preceding plan, the query uses eight workers for parallel execution (Workers Planned: 8). Queries can run in parallel between partitions and also within each partition. A clear indicator is the operator named Parallel PartitionedTableScan.

Both parallel modes have their own cost models. The optimizer chooses the optimal one.

Partition pruning

Like Append, PartitionedTableScan supports partition pruning in three stages. For more information, see Partition pruning.

Performance comparison

PartitionedTableScan is more efficient than Append. The following test data shows a performance comparison between PartitionedTableScan and Append.

Note

The following data is from a development environment and is for reference only. Actual performance may vary based on your configuration and conditions. The purpose of this test is to compare the performance of Append and PartitionedTableScan in a consistent environment where the operator is the only variable.

The following SQL statements are used for testing:

explain select  * from prt1 where b = 10; 

explain select /*+PARTEDSCAN(prt1) */ * from prt1 where b = 10;

Planning time of a single SQL statement

Number of partitions	Append Planning Time	PartitionedTableScan Planning Time
16	0.266ms	0.067ms
32	1.820ms	0.258ms
64	3.654ms	0.402ms
128	7.010ms	0.664ms
256	14.095ms	1.247ms
512	27.697ms	2.328ms
1024	73.176ms	4.165ms

Memory (Memory usage of a single SQL statement)

Number of partitions	Append Memory	PartitionedTableScan Memory
16	1,170 KB	1,044 KB
32	1,240 KB	1,044 KB
64	2,120 KB	1,624 KB
128	2,244 KB	1,524 KB
256	2,888 KB	2,072 KB
512	4,720 KB	3,012 KB
1024	8,236 KB	5,280 KB

Queries per second (QPS)

pgbench -i --scale=10
pgbench -c 64 -j 64 -n -T60
Query:
	explain select  * from prt1 where b = 10; 
	explain select /*+PARTEDSCAN(prt1) */ * from prt1 where b = 10;

Number of partitions	Append QPS	PartitionedTableScan QPS
16	25,318	93,950
32	10,906	61,879
64	5,281	30,839
128	2,195	16,684
256	920	8,372
512	92	3,708
1024	21	1,190

Conclusion

The preceding comparison between PartitionedTableScan and Append shows that PartitionedTableScan significantly improves performance over Append as the number of partitions increases. Therefore, if your partitioned tables have many partitions and a long Planning Time, we recommend using PartitionedTableScan for optimization.