×
Community Blog Optimizations with Full-Text Search in PostgreSQL

Optimizations with Full-Text Search in PostgreSQL

This article shows how PostgreSQL, equipped with both GIN indexes and the cost-based optimizer (CBO), can automatically choose the most optimal query method.

By Digoal.

With built-in GIN indexes, PostgreSQL supports full-text search and searching multi-value data types including arrays. GIN in the GIN index stands for Generalized Inverted Index. PostgreSQL can also conduct queries that do not contain any keyword and that simply scan, skipping the token above the main tree.

All of these kinds of queries can be cost-effective by being optimized to the data they are querying. The later type is very cost-effective, in particular, because through skipping a token that contains a relatively large amount of data, these queries can go much faster than queries than use tokens. Moreover, for this particular kind of function and query, PostgreSQL also uses a cost-based optimizer (CBO), which is a type of execution plan optimizer, which automatically choose the best index for the particular job.

In this article we will look at four different query examples to show how PostgreSQL, equipped with both GIN indexes and the cost-based optimizer (CBO), can use the most optimal plan to choose the best index. Last, we will summarize what we discovered about from our example queries.

Example 1. Full-Text Search: "Does Not Contain" Query

For this type of query, you'll need to set up some things. First, create a test table:

postgres=# create table notcontain (id int, info tsvector);  
CREATE TABLE  

Next, create a function to generate random strings.

CREATE OR REPLACE FUNCTION   
gen_rand_str(integer)    
 RETURNS text    
 LANGUAGE sql    
 STRICT    
AS $function$    
  select string_agg(a[(random()*6)::int+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a);    
$function$;   

Insert 1 million entries of test data:

postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector(gen_rand_str(256));   

After you do all of this, now create full-text index, which is a type of GIN index.

create index idx_notcontain_info on notcontain using gin (info);  

Now, it's time for the query. The query this time does not contain a keyword. So you're want to query one of the records:

postgres=# select * from notcontain limit 1;  
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
id   | 1  
info | 'afbbeeccbf':3 'b':16 'bdcdfd':2 'bdcfbcecdeeaed':8 'bfedfecbfab':7 'cd':9 'cdcaefaccdccadeafadededddcbdecdaefbcfbdaefcec':14 'ceafecff':6 'd':17,18 'dbc':12 'dceabcdcbdca':10 'dddfdbffffeaca':13 'deafcccfbcdebdaecda':11 'dfbadcdebdedbfa':19 'eb':15 'ebe':1 'febdcbdaeaeabbdadacabdbbedfafcaeabbdcedaeca':5 'fedeecbcdfcdceabbabbfcdd':4  

Next, you'll want to do a test. For the text, the database automatically chooses Full Table Scan, and does not use the GIN index. You may ask why wasn't an index used? The reason for this is because the number of data records here containing the keyword amounts to relatively small number. Therefore, it is not cost-effective to use indexes for filtering in a query that does not contain a keyword, as a result the optimizer doesn't choose this as an option. However, note that, if the query contains a keyword, it will be very cost-effective to use the GIN indexes. "Contains" and "Does Not Contain" are inversions of each other, and so are the associated costs.

select * from notcontain t1 where info @@ to_tsquery ('!eb');  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb');  
                                                             QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..318054.51 rows=950820 width=412) (actual time=0.016..1087.463 rows=947911 loops=1)  
   Output: id, info  
   Filter: (t1.info @@ to_tsquery('!eb'::text))  
   Rows Removed by Filter: 52089  
   Buffers: shared hit=55549  
 Planning time: 0.131 ms  
 Execution time: 1134.571 ms  
(7 rows)  

Next, you can force disable the Full Table Scan, and allow the database to choose the indexes. From this, we can see that searches that use index are really slow. Most of the time we can trust the database to find the most efficient method to complete our queries. All of these can be calibrated as long as cost factors and environmental performance measuring are sufficiently accurate.

postgres=# set enable_seqscan=off;  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb');  
                                                                       QUERY PLAN                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on postgres.notcontain t1  (cost=82294981.00..82600120.25 rows=950820 width=412) (actual time=1325.587..1540.145 rows=947911 loops=1)  
   Output: id, info  
   Recheck Cond: (t1.info @@ to_tsquery('!eb'::text))  
   Heap Blocks: exact=55549  
   Buffers: shared hit=171948  
   ->  Bitmap Index Scan on idx_notcontain_info  (cost=0.00..82294743.30 rows=950820 width=0) (actual time=1315.663..1315.663 rows=947911 loops=1)  
         Index Cond: (t1.info @@ to_tsquery('!eb'::text))  
         Buffers: shared hit=116399  
 Planning time: 0.135 ms  
 Execution time: 1584.670 ms  
(10 rows)  

Example 2. Full-Text Search: "Does Not Contain" Query

For this example query, you will create some unevenly distributed data. The token for this data will contain a large number of repeated content, and you will filter them out with a "Does Not Contain" operator. So the question remains, will the index be used?

For this example, first generate some test data:

postgres=# truncate notcontain ;  
TRUNCATE TABLE  
postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector('abc');  
INSERT 0 1000000  

Given that the test search entries that do not contain ABC, the database automatically chooses indexed scanning and skips data blocks that do not need to be searched.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc');  
                                                              QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on postgres.notcontain t1  (cost=220432.15..220433.71 rows=1 width=21) (actual time=107.936..107.936 rows=0 loops=1)  
   Output: id, info  
   Recheck Cond: (t1.info @@ to_tsquery('!abc'::text))  
   Buffers: shared hit=268  
   ->  Bitmap Index Scan on idx_notcontain_info  (cost=0.00..220432.15 rows=1 width=0) (actual time=107.933..107.933 rows=0 loops=1)  
         Index Cond: (t1.info @@ to_tsquery('!abc'::text))  
         Buffers: shared hit=268  
 Planning time: 0.183 ms  
 Execution time: 107.962 ms  
(9 rows)  

Next, you'll want to force enable Full Table Scan. From this, we can see that the performance is truly no match for indexed scanning, which proves that PostgreSQL is a cost-based optimizer and can automatically choose the most optimal execution plan.

postgres=# set enable_bitmapscan =off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc');  
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..268870.00 rows=1 width=21) (actual time=1065.436..1065.436 rows=0 loops=1)  
   Output: id, info  
   Filter: (t1.info @@ to_tsquery('!abc'::text))  
   Rows Removed by Filter: 1000000  
   Buffers: shared hit=6370  
 Planning time: 0.059 ms  
 Execution time: 1065.449 ms  
(7 rows)  

Example 3. Ordinary B-Tree Indexes: "Not Equal To" Searches

This example will be an ordinary search that uses a B-tree index. So the question for us is whether PostgreSQL supports the "Not Equal To" indexed search. The test method which will be used for this example will be similar to that of GIN test. In this test, I you will use both unevenly and evenly distributed data.

For a query that uses the "Does Not Contain" operator on unevenly distributed data, the number of records filtered out with indexes is a relatively small number. So far, as we have seen in this tutorial, "Does Not Contain" searches that use B-tree indexes are not supported at the kernel level. Granted, though, this can be achieved by skipping Branch nodes that do not need to be scanned with INDEX SKIP SCAN.

postgres=# truncate notcontain ;  
TRUNCATE TABLE  
postgres=# insert into notcontain select generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# create index idx1 on notcontain (id);  
CREATE INDEX  
postgres=# set enable_bitmapscan =on;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1;  
                                                           QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..16925.00 rows=999999 width=36) (actual time=0.011..110.592 rows=999999 loops=1)  
   Output: id, info  
   Filter: (t1.id <> 1)  
   Rows Removed by Filter: 1  
   Buffers: shared hit=4425  
 Planning time: 0.195 ms  
 Execution time: 156.013 ms  
(7 rows)  
  
  
postgres=# set enable_seqscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1;  
                                                                   QUERY PLAN                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=10000000000.00..10000016925.00 rows=999999 width=36) (actual time=0.011..110.964 rows=999999 loops=1)  
   Output: id, info  
   Filter: (t1.id <> 1)  
   Rows Removed by Filter: 1  
   Buffers: shared hit=4425  
 Planning time: 0.062 ms  
 Execution time: 156.461 ms  
(7 rows)  

Next, you're going to want to change the SQL write method to achieve an indexed search. As INDEX SKIP SCAN is not used in this case, we need a JOIN process. Otherwise, the resulting performance will not be very optimal.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Merge Anti Join  (cost=0.85..25497.28 rows=999999 width=36) (actual time=0.023..277.639 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Merge Cond: (t1.id = t2.id)  
   Buffers: shared hit=7164  
   ->  Index Scan using idx1 on postgres.notcontain t1  (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.009..148.520 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=7160  
   ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1)  
         Output: t2.id  
         Index Cond: (t2.id = 1)  
         Heap Fetches: 1  
         Buffers: shared hit=4  
 Planning time: 0.223 ms  
 Execution time: 322.798 ms  
(14 rows)  
postgres=# set enable_mergejoin=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Hash Anti Join  (cost=3.05..27053.05 rows=999999 width=36) (actual time=0.060..251.232 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Hash Cond: (t1.id = t2.id)  
   Buffers: shared hit=4432  
   ->  Seq Scan on postgres.notcontain t1  (cost=0.00..14425.00 rows=1000000 width=36) (actual time=0.011..84.659 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=4425  
   ->  Hash  (cost=3.04..3.04 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)  
         Output: t2.id  
         Buckets: 1024  Batches: 1  Memory Usage: 9kB  
         Buffers: shared hit=4  
         ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=1)  
               Output: t2.id  
               Index Cond: (t2.id = 1)  
               Heap Fetches: 1  
               Buffers: shared hit=4  
 Planning time: 0.147 ms  
 Execution time: 297.127 ms  
(18 rows)  
  
postgres=# set enable_seqscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Hash Anti Join  (cost=3.48..35622.27 rows=999999 width=36) (actual time=0.036..324.401 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Hash Cond: (t1.id = t2.id)  
   Buffers: shared hit=7164  
   ->  Index Scan using idx1 on postgres.notcontain t1  (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.017..149.383 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=7160  
   ->  Hash  (cost=3.04..3.04 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=1)  
         Output: t2.id  
         Buckets: 1024  Batches: 1  Memory Usage: 9kB  
         Buffers: shared hit=4  
         ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)  
               Output: t2.id  
               Index Cond: (t2.id = 1)  
               Heap Fetches: 1  
               Buffers: shared hit=4  
 Planning time: 0.141 ms  
 Execution time: 369.749 ms  
(18 rows)  

In fact, in the SQL statement, we can also use <1 or >1 or null.

Next, PostgreSQL uses multi-core parallel computing to allow incredible performance improvements to Full Table Scan. Parallel scanning can remarkably improve the performance in case with a significantly large number of records.

postgres=# create  unlogged table ptbl(id int);  
CREATE TABLE  
postgres=# insert into ptbl select generate_series(1,100000000);  
  
postgres=# alter table ptbl set (parallel_workers =32);  
  
\timing  
  
Non-concurrently query:  
postgres=# set max_parallel_workers_per_gather =0;  
postgres=# select count(*) from ptbl where id<>1;  
  count     
----------  
 99999999  
(1 row)  
  
Time: 11863.151 ms (00:11.863)  
  
Concurrently query:  
postgres=# set max_parallel_workers_per_gather =32;  
postgres=# select count(*) from ptbl where id<>1;  
  count     
----------  
 99999999  
(1 row)  
  
Time: 610.017 ms  

Performance saw a marked improvement when we used a parallel query.

Example 4. Ordinary Partial B-Tree Indexes: "Not Equal To" Searches

For this last example query, we will look at "Not Equal To" queries. Generally, for fixed "Not Equal To" queries, we can use the partial index function of PostgreSQL.

create table tbl (id int, info text, crt_time timestamp, c1 int);

select * from tbl where c1<>1;

insert into tbl select generate_series(1,10000000), 'test', now(), 1;
insert into tbl values (1,'abc',now(),2);

create index idx_tbl_1 on tbl(id) where c1<>1;

Interestingly, it takes only 0.03 ms to run a "Not Equal To" search over 10 million entries of data using partial index.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1<>1;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_tbl_1 on postgres.tbl  (cost=0.12..1.44 rows=1 width=21) (actual time=0.015..0.015 rows=1 loops=1)
   Output: id, info, crt_time, c1
   Buffers: shared hit=1 read=1
 Planning time: 0.194 ms
 Execution time: 0.030 ms
(5 rows)

Summary

From the examples above, we can learn several things:

  1. With built-in GIN indexes, PostgreSQL supports full-text search and searching multi-value data types including arrays.
  2. PostgreSQL uses a cost-based execution plan optimizer. It automatically chooses the best execution plan. When running a "Does Not Contain" search, PostgreSQL automatically chooses whether to use indexed scanning.
  3. With regards to B-tree indexes, technically it can be used to implement the "Not Equal To" searches (INDEX SKIP SCAN), but it is currently not supported at the kernel level. We can use indexed scanning by adjusting the current SQL write methods.
  4. PostgreSQL uses multi-core parallel computing to allow incredible performance improvements to Full Table Scan. Parallel scanning can remarkably improve the performance in case with a significantly large number of records.
  5. PostgreSQL supports partial index, which supports partitioned or partial indexes. Performance is outstanding for "Not Equal To" queries with fixed conditions.
0 0 0
Share on

digoal

77 posts | 4 followers

You may also like

Comments