×
Community Blog Open Source PolarDB Uses pg_trgm GIN Indexes to Implement Efficient Fuzzy Queries

Open Source PolarDB Uses pg_trgm GIN Indexes to Implement Efficient Fuzzy Queries

This article describes how open source PolarDB uses pg_trgm GIN indexes to implement efficient fuzzy queries like '%xxx%'.

Background

PolarDB's cloud-native computing-storage separation architecture offers cost-effective data storage, scalable operations, high-speed multi-machine parallel computing, and efficient data search and processing. By combining PolarDB with computing algorithms, it enables businesses to derive value from their data and turn it into productivity.

This article describes how open source PolarDB uses pg_trgm GIN indexes to implement efficient fuzzy queries like '%xxx%'.

The test environment used for this demonstration is macOS + Docker. For detailed instructions on PolarDB deployment, please refer to the following article: Simple Deployment of PolarDB

Principle

pg_trgm adds two spaces before and one space after the string, then splits the string into groups of three consecutive characters to generate a set of tokens. For example, hello is split into {" h"," he",ell,hel,llo,"lo "}.

postgres=# select show_trgm('hello');  
            show_trgm              
---------------------------------  
 {"  h"," he",ell,hel,llo,"lo "}  
(1 row)  

When you create a GIN index for these tokens and perform a fuzzy search (or even support a regular search), you can apply tokenization to the conditions of the target string without adding spaces before and after the target string, unless prefix or suffix restrictions are specified. This allows you to quickly match the target row using the GIN index.

Use pg_trgm GIN Indexes to Implement Efficient Fuzzy Queries Like '%xxx%' in PolarDB

1.  Create a table and generate 2 million test texts.

create table tbl (id int, info text);  
  
insert into tbl select id, md5(random()::text) from generate_series(1,1000000) id;  
insert into tbl select id, md5(random()::text) from generate_series(1,1000000) id;  

2.  In the absence of indexes, a full table scan is required to perform a fuzzy query, which is time-consuming.

explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';  
                                                  QUERY PLAN                                                     
---------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl  (cost=0.00..41665.50 rows=200 width=37) (actual time=2.505..522.958 rows=851 loops=1)  
   Output: id, info  
   Filter: (tbl.info ~~ '%abcd%'::text)  
   Rows Removed by Filter: 1999149  
   Buffers: shared hit=16645 read=22 dirtied=8334  
 Planning Time: 1.643 ms  
 Execution Time: 523.138 ms  
(7 rows)  

3.  Create the pg_trgm plug-in and the GIN index.

postgres=# create extension pg_trgm ;  
CREATE EXTENSION  
  
create index on tbl using gin (info gin_trgm_ops);  

4.  Use pg_trgm GIN index to realize high-efficiency fuzzy query like '%xxx%'.

explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=29.55..762.82 rows=200 width=37) (actual time=2.445..3.962 rows=851 loops=1)  
   Output: id, info  
   Recheck Cond: (tbl.info ~~ '%abcd%'::text)  
   Rows Removed by Index Recheck: 96  
   Heap Blocks: exact=926  
   Buffers: shared hit=946  
   ->  Bitmap Index Scan on tbl_info_idx  (cost=0.00..29.50 rows=200 width=0) (actual time=2.287..2.288 rows=947 loops=1)  
         Index Cond: (tbl.info ~~ '%abcd%'::text)  
         Buffers: shared hit=20  
 Planning Time: 0.239 ms  
 Execution Time: 4.112 ms  
(11 rows)  

We can see that the performance is improved by more than 100 times.

0 1 0
Share on

digoal

276 posts | 24 followers

You may also like

Comments