×
Community Blog How PostgreSQL UDF Can Accelerate the Performance of Compound Queries of Single- and Multi-value Fields

How PostgreSQL UDF Can Accelerate the Performance of Compound Queries of Single- and Multi-value Fields

This article describes the methods to accelerate compound queries of single-value fields and multi-value fields over 100 times, focusing on PostgreSQL UDF implementation.

Background

In applications, especially for businesses that use PostgreSQL multi-value columns (arrays, full-text search, and JSON), a single-value column also needs to be queried in addition to multi-value columns.

Therefore, in cases, when both types of queries coexist, the database selects a single column or multiple columns to compose the index based on the cost. However, it cannot reach the best performance but only simplifies the user's index design. Let's consider the following example to understand this better.


create table tbl(gid int, c1 int[]);  
  
  
insert into tbl select random()*99, gen_randarr(999, 10) from generate_series(1,10000000); 

Since gid has 100 values and c1 has 10 values (the value range is less than 1,000), the user may query by gid or c1, or by combining both. When user queries by combining both fields, the current method is not efficient. This includes btree_gin.

This leads to a critical question, how to accelerate the query?

Demo

Let's have a quick look at the following demo to understand different acceleration methods:

Step 1. Create a function to generate a random number.

CREATE OR REPLACE FUNCTION public.gen_randarr(integer, integer)  
 RETURNS integer[]  
 LANGUAGE sql  
 STRICT  
AS $function$  
  select array(select (random()*$1)::int from generate_series(1,$2));  
$function$;  

Step 2. Now, create a test table that contains a single-value column and a multi-value column.

create table tbl(gid int, c1 int[]); 

Step 3. Write 10 millions pieces of data.

insert into tbl select random()*99, gen_randarr(999, 10) from generate_series(1,10000000); 

Traditional Acceleration Method 1

Now, implementing the traditional acceleration method, create a compound index of single-value columns and multi-value columns.

create extension btree_gin;  
set maintenance_work_mem ='8GB';  
create index idx_tbl_1 on tbl using gin (gid, c1);  

Determine the performance of this compound query as shown below.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 and c1 @> array[1,2,3];  
                                                     QUERY PLAN                                                        
---------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=96.00..97.02 rows=1 width=65) (actual time=12.810..12.810 rows=0 loops=1)  
   Output: gid, c1  
   Recheck Cond: ((tbl.gid = 1) AND (tbl.c1 @> '{1,2,3}'::integer[]))  
   Buffers: shared hit=184  
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..96.00 rows=1 width=0) (actual time=12.807..12.807 rows=0 loops=1)  
         Index Cond: ((tbl.gid = 1) AND (tbl.c1 @> '{1,2,3}'::integer[]))  
         Buffers: shared hit=184  
 Planning time: 0.154 ms  
 Execution time: 12.838 ms  
(9 rows)  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gid=1 and c1 && array[1,2,3];  
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=129.80..3433.25 rows=3297 width=65) (actual time=17.453..22.486 rows=2932 loops=1)  
   Output: gid, c1  
   Recheck Cond: ((tbl.gid = 1) AND (tbl.c1 && '{1,2,3}'::integer[]))  
   Heap Blocks: exact=2906  
   Buffers: shared hit=3089  
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..128.97 rows=3297 width=0) (actual time=17.121..17.121 rows=2932 loops=1)  
         Index Cond: ((tbl.gid = 1) AND (tbl.c1 && '{1,2,3}'::integer[]))  
         Buffers: shared hit=183  
 Planning time: 0.223 ms  
 Execution time: 22.761 ms  
(10 rows)

In case, you do not seem to have the motivation to optimize further, you will observe that this performance appears to be quite OK.

Multiple index conditions are also used. In fact, this index is a bitmap scan performed after the two indexes are combined internally.

Traditional Acceleration Method 2

The second method lets you create a single-value column index and a multi-value column index, separately.

postgres=# drop index idx_tbl_1;  
DROP INDEX  
  
postgres=# create index idx_tbl_1 on tbl (gid);  
CREATE INDEX  
  
postgres=# create index idx_tbl_2 on tbl using gin (c1);  

The actual effect is worse than that of GIN compound indexes.

explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 and c1 @> array[1,2,3];  
  
explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gid=1 and c1 && array[1,2,3];  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 and c1 @> array[1,2,3];  
                                                      QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=72.09..83.25 rows=1 width=65) (actual time=12.848..12.848 rows=0 loops=1)  
   Output: gid, c1  
   Recheck Cond: (tbl.c1 @> '{1,2,3}'::integer[])  
   Filter: (tbl.gid = 1)  
   Rows Removed by Filter: 13  
   Heap Blocks: exact=13  
   Buffers: shared hit=131  
   ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..72.09 rows=11 width=0) (actual time=12.810..12.810 rows=13 loops=1)  
         Index Cond: (tbl.c1 @> '{1,2,3}'::integer[])  
         Buffers: shared hit=118  
 Planning time: 0.254 ms  
 Execution time: 12.874 ms  
(12 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gid=1 and c1 && array[1,2,3];  
                                                              QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=3534.41..6837.86 rows=3297 width=65) (actual time=69.636..74.613 rows=2932 loops=1)  
   Output: gid, c1  
   Recheck Cond: ((tbl.gid = 1) AND (tbl.c1 && '{1,2,3}'::integer[]))  
   Heap Blocks: exact=2906  
   Buffers: shared hit=2982 read=279  
   ->  BitmapAnd  (cost=3534.41..3534.41 rows=3297 width=0) (actual time=69.002..69.002 rows=0 loops=1)  
         Buffers: shared hit=76 read=279  
         ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..1089.93 rows=106333 width=0) (actual time=13.538..13.538 rows=100704 loops=1)  
               Index Cond: (tbl.gid = 1)  
               Buffers: shared read=279  
         ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..2442.58 rows=310077 width=0) (actual time=50.878..50.878 rows=296887 loops=1)  
               Index Cond: (tbl.c1 && '{1,2,3}'::integer[])  
               Buffers: shared hit=76  
 Planning time: 0.147 ms  
 Execution time: 74.886 ms  
(15 rows)  

Acceleration Method that Improves the Performance by Over 100 Times

This article focuses on the method that is applicable for scenarios where the compound queries of single-value columns and multi-value columns are used. In fact, we use the UDF and indexing expressions functions of PostgreSQL.

UDF combines a single-value column and a multi-value column into a new multi-value column. Indexes on expressions are created for this UDF with the purpose to remove the internal BITMAP combination and use only one inverted tree. This inverted tree contains the values of the single-value and multi-value columns. Follow the steps listed below to implement this acceleration method:

Step 1. Create a UDF and combine the gid and c1 values as shown in the following example.

create or replace function gen_newarr(int, anyarray) returns text[] as $$  
declare  
  res text[] := '{}';  
  x int;  
begin  
  foreach x in array $2 loop  
    res := array_append(res, $1||'_'||x);  
  end loop;  
  return res;  
end;  
$$ language plpgsql strict immutable;  
  
  
  
  
postgres=# select gen_newarr(123,array[1,2,3,4]);  
-[ RECORD 1 ]-------------------------  
gen_newarr | {123_1,123_2,123_3,123_4}  

Step 2. Create indexes on expressions.

set maintenance_work_mem ='8GB';  
create index idx_tbl_2 on tbl using gin (gen_newarr(gid, c1));  

Step 3. An expression query is used during the query. Therefore, you need to change the SQL statement as shown below.

postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gen_newarr(gid, c1) @> array['1_1','1_2','1_3'];  
                                                    QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=12.01..13.27 rows=1 width=65) (actual time=0.146..0.146 rows=0 loops=1)  
   Output: gid, c1  
   Recheck Cond: (gen_newarr(tbl.gid, tbl.c1) @> '{1_1,1_2,1_3}'::text[])  
   Buffers: shared hit=14  
   ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..12.01 rows=1 width=0) (actual time=0.144..0.144 rows=0 loops=1)  
         Index Cond: (gen_newarr(tbl.gid, tbl.c1) @> '{1_1,1_2,1_3}'::text[])  
         Buffers: shared hit=14  
 Planning time: 0.092 ms  
 Execution time: 0.174 ms  
(9 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gen_newarr(gid, c1) && array['1_1','1_2','1_3'];  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=1220.70..133422.08 rows=149251 width=65) (actual time=1.020..6.034 rows=2932 loops=1)  
   Output: gid, c1  
   Recheck Cond: (gen_newarr(tbl.gid, tbl.c1) && '{1_1,1_2,1_3}'::text[])  
   Heap Blocks: exact=2906  
   Buffers: shared hit=2919  
   ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..1183.38 rows=149251 width=0) (actual time=0.640..0.640 rows=2932 loops=1)  
         Index Cond: (gen_newarr(tbl.gid, tbl.c1) && '{1_1,1_2,1_3}'::text[])  
         Buffers: shared hit=13  
 Planning time: 0.102 ms  
 Execution time: 6.348 ms  
(10 rows)

The query is as follows.

select * from tbl where gen_newarr(gid, c1) && array['1_1','1_2','1_3'];  
  
 gid |                   c1                      
-----+-----------------------------------------  
   1 | {62,904,204,618,917,227,388,352,167,1}  
   1 | {825,126,174,409,340,285,231,942,3,136}  
   1 | {222,418,799,881,728,582,558,2,368,196}  
   1 | {847,197,690,1,288,468,179,521,799,196}  
   1 | {867,316,447,747,953,998,370,360,558,3}  
   1 | {249,963,669,929,534,945,388,816,1,601}  
   1 | {925,609,108,981,712,681,906,832,3,275}  
   1 | {3,354,253,947,588,598,401,89,246,968}  
   1 | {323,121,22,3,7,714,80,619,178,439}  
   1 | {866,1,185,704,932,882,496,324,264,882}  

......

As a result, performance improves significantly.

Acceleration Method for a Compound Query of a Single-Value Column and Full-Text Search

Similarly, combine a single-value column with a full-text search column to accelerate performance.

create table tbl123(gid int, ts tsvector);  
  
insert into tbl123 select random()*99, array_to_tsvector(gen_randarr(999, 10)::text[]) from generate_series(1,10000000);  
create index idx_tbl123_1 on tbl123 using gin ( array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) );  
explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery '1_1 & 1_2 & 1_3';  
  
  
explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery '1_1 | 1_2 | 1_3';  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery '1_1 & 1_2 & 1_3';  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl123  (cost=12.01..13.28 rows=1 width=77) (actual time=0.207..0.207 rows=0 loops=1)  
   Output: gid, ts  
   Recheck Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ '''1_1'' & ''1_2'' & ''1_3'''::tsquery)  
   Buffers: shared hit=14  
   ->  Bitmap Index Scan on idx_tbl123_1  (cost=0.00..12.01 rows=1 width=0) (actual time=0.204..0.204 rows=0 loops=1)  
         Index Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ '''1_1'' & ''1_2'' & ''1_3'''::tsquery)  
         Buffers: shared hit=14  
 Planning time: 0.080 ms  
 Execution time: 0.238 ms  
(9 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery '1_1 | 1_2 | 1_3';  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl123  (cost=1220.70..136709.34 rows=149251 width=77) (actual time=0.971..5.988 rows=2970 loops=1)  
   Output: gid, ts  
   Recheck Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ '''1_1'' | ''1_2'' | ''1_3'''::tsquery)  
   Heap Blocks: exact=2937  
   Buffers: shared hit=2950  
   ->  Bitmap Index Scan on idx_tbl123_1  (cost=0.00..1183.38 rows=149251 width=0) (actual time=0.612..0.612 rows=2970 loops=1)  
         Index Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ '''1_1'' | ''1_2'' | ''1_3'''::tsquery)  
         Buffers: shared hit=13  
 Planning time: 0.029 ms  
 Execution time: 6.284 ms  
(10 rows)  
  
postgres=# select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery '1_1 | 1_2 | 1_3';  
 gid |                            ts                               
-----+-----------------------------------------------------------  
   1 | '180' '219' '253' '262' '282' '3' '633' '657' '807' '809'  
   1 | '1' '166' '261' '670' '807' '860' '897' '922' '93' '964'  
   1 | '1' '174' '211' '319' '322' '532' '84' '849' '869' '993'  

......  

The preceding example shows how performance significantly improves in this scenario as well.

Conclusion

Following points sum up the highlights of this article:

  • For a partition table, each partition defines the corresponding index. However, there can be many partitions in a case when a single-value type contains a large number of values, which may not be a good thing.
  • For partition indexes, currently, PostgreSQL does not support creating a compound index with multiple trees for a single table. One tree is built with a single-value column, and the value points to another tree. Another tree is a GIN inverted tree built with multi-value columns.
  • As mentioned in this article, use UDF to combine a single-value column and a multi-value column into a new multi-value column. This new column contains the attributes of the single-value column. This achieves the same effect as that of partition tables or partition indexes. Performance is significantly improved.

When a multi-value column itself contains the attributes of a single-value column, you do not need to create a compound index for a single-value column and a multi-value column. Instead, you only need to create a multi-value column index.

On the other hand, when a multi-value column does not contain the attributes of a single-value column, you can use UDF if you have a compound query of single-value columns and multi-value columns. This combines a single-value column and a multi-value column into a new multi-value column and creates a GIN inverted index for this new multi-value column. This significantly improves performance. In this use case, the speed is increased by over 100 times.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

0 0 0
Share on

digoal

215 posts | 15 followers

You may also like

Comments