PostgreSQL Fuzzy Search Best Practices: Single-word, Double-word, and Multi-word Fuzzy Search Methods

By Digoal

Background

Prefix fuzzy searches, suffix fuzzy searches, fully fuzzy searches (prefix/suffix-free fuzzy searches), and regex matches are common text search requirements.

PostgreSQL offers powerful text search capabilities and efficiently supports full-text search, fuzzy search, and regex search. The built-in pg_trgm plug-in that facilitates such searches is not available in general databases. Expression indexes and GIN indexes are also built-in.

Different fuzzy search requirements have various optimization methods. Like other databases, PostgreSQL uses a B-tree to accelerate prefix fuzzy searches and suffix fuzzy searches. The function index of the reverse function also helps to accelerate suffix fuzzy searches.

PostgreSQL uses the pg_trgm plug-in to accelerate fully fuzzy searches and regex matches using GIN indexes (this is very effective for fuzzy searches with 3 or more characters input). Further, PostgreSQL also uses custom GIN expression indexes to accelerate custom fuzzy searches.

I. Optimization for Prefix and Suffix Fuzzy Searches

1. Optimization for Prefix Fuzzy Searches: B-tree supports prefix fuzzy searches

1.1 This method is only applicable for searches with COLLATE ="C" when the default "index ops class" is used. You must specify COLLATE "C" for indexes and search criteria under the condition lc_collate<>C.

Indexes can be used only when the COLLATE value is the same for indexes and search criteria.

Example.

test=# create table test(id int, info text);      
CREATE TABLE      
test=# insert into test select generate_series(1,1000000),md5(random()::text);      
INSERT 0 1000000      
test=# create index idx on test(info collate "C");      
CREATE INDEX      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";      
                                                      QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------      
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)      
   Output: id, info      
   Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))      
   Filter: (test.info ~~ 'abcd%'::text COLLATE "C")      
   Buffers: shared hit=18 read=3      
 Planning time: 0.424 ms      
 Execution time: 0.124 ms      
(7 rows)

1.2 Another way is available for the B-tree index to support fuzzy searches when the database uses lc_collate<>C by default. Using the corresponding "pattern ops", the character search method will be used instead of the binary search method.

Refer to this document for the following explanations.

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops   
support B-tree indexes on the types text, varchar, and char respectively.   
  
The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules.   
  
This makes these operator classes suitable for use by queries involving pattern   
matching expressions (LIKE or POSIX regular expressions) when the database   
does not use the standard "C" locale.

Example.

test=# drop table test;  
DROP TABLE  
test=# create table test(id int, info text);      
CREATE TABLE  
test=# insert into test select generate_series(1,1000000),md5(random()::text);      
INSERT 0 1000000  
test=# create index idx on test(info text_pattern_ops);   
CREATE INDEX  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN";      
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))  
   Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN")  
   Buffers: shared hit=12 read=3  
 Planning time: 0.253 ms  
 Execution time: 0.081 ms  
(7 rows)  
  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";      
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))  
   Filter: (test.info ~~ 'abcd%'::text COLLATE "C")  
   Buffers: shared hit=15  
 Planning time: 0.141 ms  
 Execution time: 0.072 ms  
(7 rows)

In case, you use "pattern ops", index search supports the syntax of LIKE and regular expressions, as follows.

test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd';  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))  
   Filter: (test.info ~ '^abcd'::text)  
   Buffers: shared hit=15  
 Planning time: 0.213 ms  
 Execution time: 0.083 ms  
(7 rows)

2. Optimization for Suffix Fuzzy Searches: Use the reverse index to support suffix fuzzy searches.

2.1 This method is only applicable for searches with COLLATE ="C" when the default "index ops class" is used. You must specify COLLATE "C" for indexes and search criteria under the condition lc_collate<>C.

Indexes can be used only when the COLLATE value is the same for indexes and search criteria.

Example.

test=# create index idx1 on test(reverse(info) collate "C");      
CREATE INDEX      
test=# select * from test limit 1;      
 id |               info                     
----+----------------------------------      
  1 | b3275976cdd437a033d4329775a52514      
(1 row)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C";      
                                                        QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------      
 Index Scan using idx1 on public.test  (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)      
   Output: id, info      
   Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text))      
   Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C")      
   Buffers: shared hit=18 read=3      
 Planning time: 0.128 ms      
 Execution time: 0.122 ms      
(7 rows)      
      
test=# select * from test where reverse(info) like '4152%' collate "C";      
   id   |               info                     
--------+----------------------------------      
 847904 | abe2ecd90393b5275df8e34a39702514      
 414702 | 97f66d26545329321164042657d02514      
 191232 | 7820972c6220c2b01d46c11ebb532514      
 752742 | 93232ac39c6632e2540df44627c42514      
 217302 | 39e518893a1a7b1e691619bd1fc42514      
      1 | b3275976cdd437a033d4329775a52514      
 615718 | 4948f94c484c13dc6c4fae8a3db52514      
 308815 | fc2918ceff7c7a4dafd2e04031062514      
 149521 | 546d963842ea5ca593e622c810262514      
 811093 | 4b6eca2eb6b665af67b2813e91a62514      
 209000 | 1dfd0d4e326715c1739f031cca992514      
 937616 | 8827fd81f5b673fb5afecbe3e11b2514      
 419553 | bd6e01ce360af16137e8b6abc8ab2514      
 998324 | 7dff51c19dc5e5d9979163e7d14c2514      
 771518 | 8a54e30003a48539fff0aedc73ac2514      
 691566 | f90368348e3b6bf983fcbe10db2d2514      
 652274 | 8bf4a97b5f122a5540a21fa85ead2514      
 233437 | 739ed715fc203d47e37e79b5bcbe2514      
(18 rows)

2.2 When the database uses lc_collate<>C by default, another way is available for the B-tree index to support fuzzy searches. Using the corresponding "pattern ops", the character search method will be used instead of the binary search method.

When you use "pattern ops", index search supports the syntax of LIKE and regular expressions.

Example.

test=# create index idx1 on test(reverse(info) text_pattern_ops);      
CREATE INDEX  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%';  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx1 on public.test  (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))  
   Filter: (reverse(test.info) ~~ '4152%'::text)  
   Buffers: shared hit=15  
 Planning time: 0.102 ms  
 Execution time: 0.072 ms  
(7 rows)  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ '^4152';  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx1 on public.test  (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))  
   Filter: (reverse(test.info) ~ '^4152'::text)  
   Buffers: shared hit=15  
 Planning time: 0.148 ms  
 Execution time: 0.087 ms  
(7 rows)

3. Optimization for Both- Prefix and Suffix Fuzzy Searches: Use the pg_trgm index to support prefix and suffix fuzzy searches.

Note: To achieve a good index filtering effect, enter at least 1 character for prefix fuzzy searches and at least 2 characters for suffix fuzzy searches.

To efficiently support multi-byte characters (such as Chinese characters), ensure that the lc_ctype of the database is not "C". The effect will be OK only if the token is correctly split. Set lc_ctype correctly, to correctly split the characters in multi-byte strings one by one.

Indexes can be used only when the COLLATE value is the same for indexes and search criteria.

test=# \l+ test      
                                              List of databases      
 Name |  Owner   | Encoding |  Collate   |   Ctype    | Access privileges |  Size  | Tablespace | Description       
------+----------+----------+------------+------------+-------------------+--------+------------+-------------      
 test | postgres | UTF8     | zh_CN.utf8 | zh_CN.utf8 |                   | 245 MB | pg_default |       
(1 row)      
      
test=# create extension pg_trgm;      
      
test=# create table test001(c1 text);      
CREATE TABLE

Use the function for generating random Chinese strings as shown below.

test=# create or replace function gen_hanzi(int) returns text as $$                  
declare        
  res text;        
begin        
  if $1 >=1 then        
    select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);        
    return res;        
  end if;        
  return null;        
end;        
$$ language plpgsql strict;        
CREATE FUNCTION

Now, generate random data.

test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);      
INSERT 0 100000      
      
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);      
CREATE INDEX      
      
test=# select * from test001 limit 5;      
                    c1                          
------------------------------------------      
 埳噪办甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅      
 秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅      
 蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓      
 馳泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼      
 襳铺煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤      
(5 rows)

Next, implement the fuzzy search.

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';      
                                                      QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ '你%'::text)      
   Heap Blocks: exact=3      
   Buffers: shared hit=7      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)      
         Index Cond: (test001.c1 ~~ '你%'::text)      
         Buffers: shared hit=4      
 Planning time: 0.119 ms      
 Execution time: 0.063 ms      
(10 rows)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤';      
                                                      QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ '%恧顤'::text)      
   Rows Removed by Index Recheck: 1      
   Heap Blocks: exact=2      
   Buffers: shared hit=6      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)      
         Index Cond: (test001.c1 ~~ '%恧顤'::text)      
         Buffers: shared hit=4      
 Planning time: 0.136 ms      
 Execution time: 0.062 ms      
(11 rows)

II. Optimization for Fully Fuzzy Searches

You can use the pg_trgm plug-in to support fully fuzzy searches.

Note: To enable pg_trgm efficiently support multi-byte characters (such as Chinese characters), the lc_ctype of the database cannot be "C". The effect will be OK only if the token is correctly split. Only when lc_ctype is set correctly, the characters in multi-byte strings are correctly split one by one.

It is recommended to enter 3 or more characters. Otherwise, the effect will not be good (we will analyze the cause later).

Example.

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%';      
                                                      QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text)      
   Heap Blocks: exact=1      
   Buffers: shared hit=5      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)      
         Index Cond: (test001.c1 ~~ '%燋邢賀%'::text)      
         Buffers: shared hit=4      
 Planning time: 0.170 ms      
 Execution time: 0.076 ms      
(10 rows)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%';      
                                                              QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ '%燋邢%'::text)      
   Rows Removed by Index Recheck: 99999      
   Heap Blocks: exact=1137      
   Buffers: shared hit=14429      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)      
         Index Cond: (test001.c1 ~~ '%燋邢%'::text)      
         Buffers: shared hit=13292      
 Planning time: 0.133 ms      
 Execution time: 178.265 ms      
(11 rows)

III. Optimization for Regex Matches

The syntax for the PostgreSQL regex match is字符串 ~ 'pattern' or 字符串 ~* 'pattern'. For more information about functions matching, check out this document.

Example.

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}';      
                                                       QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)      
   Rows Removed by Index Recheck: 1      
   Heap Blocks: exact=1      
   Buffers: shared hit=50      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1)      
         Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)      
         Buffers: shared hit=49      
 Planning time: 0.452 ms      
 Execution time: 0.221 ms      
(11 rows)      
      
test01=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '宸朾啣' collate "zh_CN";   
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test001  (cost=6.58..19.42 rows=10 width=61) (actual time=0.061..0.061 rows=1 loops=1)
   Output: c1
   Recheck Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN")
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..6.58 rows=10 width=0) (actual time=0.049..0.049 rows=1 loops=1)
         Index Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN")
         Buffers: shared hit=4
 Planning time: 0.238 ms
 Execution time: 0.082 ms
(10 rows)

Refer to contrib/pg_trgm/trgm_regexp.c for the principle of the regex match index.

Principle of the pg_trgm Fuzzy Search

First, pg_trgm adds two spaces before the string and one space at the end of the string. Then, the characters are split in a way that each 3 consecutive characters are a token. Finally, GIN reverse indexes are created for the tokens. To view the tokens of a string, you can use the following method.

test=# select show_trgm('123');      
        show_trgm              
-------------------------      
 {"  1"," 12",123,"23 "}      
(1 row)

Following are the reasons why the number of characters for pg_trgm prefix or suffix fuzzy searches, and fully fuzzy searches need to meet the requirements.

For prefix fuzzy searches (such as a%), at least 1 character must be provided. (The search is for results that meet the condition that token='a ')
For suffix fuzzy searches (such as %ab), at least 2 characters must be provided. (The search is for results that meet the condition that token=' ab')
For fully fuzzy searches (such as %abcd%), at least 3 characters must be provided. (In this example, the search is for results that meet the condition that the token(s) contains {"a","ab",abc,bcd,"cd "}).

It is necessary to abide by these conditions because the token generated by pg_trgm has three characters, and a match can be found only when the preceding three conditions are met.

test=# select show_trgm('123');      
        show_trgm              
-------------------------      
 {"  1"," 12",123,"23 "}      
(1 row)

IV. Optimization of Fuzzy Searches With Less Than 3 Input Characters

pg_trgm does not apply when the fully fuzzy search is based on 1 or 2 characters. In this case, you can use expression and GIN indexes.

Firstly, use an expression to split the string into one word and two arrays containing consecutive characters and then create GIN indexes for the arrays.

Example.

test=# create or replace function split001(text) returns text[] as $$      
declare      
  res text[];      
begin      
  select regexp_split_to_array($1,'') into res;      
  for i in 1..length($1)-1 loop      
    res := array_append(res, substring($1,i,2));      
  end loop;      
  return res;      
end;      
$$ language plpgsql strict immutable;      
CREATE FUNCTION      
      
test=# create index idx_test001_2 on test001 using gin (split001(c1));      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你好'];      
                                                       QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1)      
   Output: c1      
   Recheck Cond: (split001(test001.c1) @> '{你好}'::text[])      
   Buffers: shared hit=4      
   ->  Bitmap Index Scan on idx_test001_2  (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1)      
         Index Cond: (split001(test001.c1) @> '{你好}'::text[])      
         Buffers: shared hit=4      
 Planning time: 0.104 ms      
 Execution time: 0.068 ms      
(9 rows)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你'];      
                                                       QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1)      
   Output: c1      
   Recheck Cond: (split001(test001.c1) @> '{你}'::text[])      
   Heap Blocks: exact=80      
   Buffers: shared hit=84      
   ->  Bitmap Index Scan on idx_test001_2  (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1)      
         Index Cond: (split001(test001.c1) @> '{你}'::text[])      
         Buffers: shared hit=4      
 Planning time: 0.101 ms      
 Execution time: 0.217 ms      
(10 rows)      
      
test=# select * from test001 where split001(c1) @> array['你'];      
                    c1                          
------------------------------------------      
 殐踨洪冨垓丩贤閚偉垢胸鍘崩你萭隡劭芛雫袰      
 靅慨热脸罆淓寘鰻总襎戍謸枨陪丼倫柆套你仮      
......

V. Optimization for Similarity Searches

Fuzzy searches and regex matches aim to find the records that completely match the search criteria. A third search type is the similarity search. Consider an example, for the PostgreSQL string, even if p0stgresgl is entered, it can still be matched according to the similarity. Use the pg_trgm plug-in here. To support Chinese characters, the following requirements are also applied.

To enable pg_trgm to support similar searches for Chinese characters, the lc_ctype of the database cannot be "C". The effect will be OK only if the token is correctly split. Only when lc_ctype is set correctly, can the characters in multi-byte strings be correctly split one by one. (Character classification [What is a letter? Its upper-case equivalent?]).

It is recommended to enter 3 or more characters. Otherwise, the effect will not be good (the cause will be analyzed later).

Example.

test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops);      
CREATE INDEX      
        
test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist        
  FROM test001 t        
  ORDER BY dist LIMIT 5;        
                                                                   QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------      
 Limit  (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1)      
   Output: t.*, ((c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text))      
   Buffers: shared hit=1631      
   ->  Index Scan using idx_test001_3 on public.test001 t  (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1)      
         Output: t.*, (c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)      
         Order By: (t.c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)      
         Buffers: shared hit=1631      
 Planning time: 0.089 ms      
 Execution time: 37.668 ms      
(9 rows)      
      
test=# SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist        
  FROM test001 t        
  ORDER BY dist LIMIT 5;        
                     t                      |   dist         
--------------------------------------------+----------      
 (癷磛鹚蠌鳃蠲你鶡埀婎鳊苿奶垨惸溴蔻筴熝憡) | 0.307692      
 (坆桻悁斾耾瑚豌腏炁悿隖轲盃挜稐睟礓蜮铅湆) | 0.976744      
 (癷鉜餯祂鼃恫蝅瓟顡廕梍蛸歡僷贊敔欓侑韌鐹) | 0.976744      
 (癷嚯鳬戚蹪熼胘檙佌欔韜挹樷覄惶蹝顼鑜鞖媗) | 0.976744      
 (癷饎瞲餿堒歃峽盾豼擔禞嵪豦咢脉馄竨济隘缄) | 0.976744      
(5 rows)

VI. Summary

1. If you only need prefix fuzzy searches (the string LIKE "xx%"), use the B-tree index with collate="C". If collate is not "C", you can use the corresponding "pattern ops" (for example, text_pattern_ops) to create a B-tree index.

2. If you only need suffix fuzzy searches (the string LIKE '"%abc" is equivalent to the reverse string LIKE "cba%"), use the B-tree index of reverse() expression with collate="C". If collate is not "C", you can use the corresponding "pattern ops" (for example, text_pattern_ops) to create a B-tree index.

3. If you need fully fuzzy searches that contain Chinese characters, use the database with lc_ctype <> "C", together with the GIN index of the pg_trgm plug-in. (The effect will be OK only if the token is correctly split. Only when lc_ctype is set correctly, can the characters in multi-byte strings be correctly split one by one. [Character classification {What is a letter? Its upper-case equivalent?}].)

4. If you need fully fuzzy searches that do not contain any Chinese characters, use the GIN index of the pg_trgm plug-in.

5. If you need regex searches, use the GIN index of the pg_trgm plug-in.

6. If the number of input characters is less than 3, use GIN expression indexes along with arrays to accelerate the fuzzy search.

VII. Performance

100 million records are provided, and each record contains 15 random Chinese characters. Follow the steps listed below to test the performance of the prefix fuzzy search, suffix fuzzy search, and fully fuzzy search.

Step 1. Generate test data.

vi test.sql      
insert into test001 select gen_hanzi(15) from generate_series(1,2500000);      
      
pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test      
    
    
    
test=# select count(*) from test001;      
   count         
-----------      
 100000000      
(1 row)      
test=# select * from test001 limit 10;    
               c1                   
--------------------------------    
 釾笉皜鰈确艄騚馺腃彊釲忰采汦擇    
 槮搮圮墔婂蹾飘孡鶒镇赀聵線麯櫕    
 孨鄈韞萅赫炧暤蟠檼駧餪崉娲譌筯    
 烸喖醝稦怩鷟棾奜妛曫仾飛饡绘韦    
 撑豁襉峊炠眏罱襄彊鰮莆壏妒辷阛    
 蜁愊鶱磹贰帵眲嚉榑苍潵檐簄椰魨    
 瑄翁蠃巨躋壾蛸湗鑂顂櫟砣八癱栵    
 馇巍笿鞒装棊嘢恓煓熴锠鋈蹃煿屓    
 訆韄踔牤嘇糺絢軿鵑燿螛梋鰢謇郼    
 撲蓨伤釱糕觩嬖蓷鰼繩圆醷熌靉掑    
(10 rows)

Step 2. Create indexes as shown below.

test=# set maintenance_work_mem ='32GB';      
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);

Refer the following table and index sizes.

test=# \di+  
                             List of relations  
 Schema |     Name      | Type  |  Owner   |  Table  | Size  | Description   
--------+---------------+-------+----------+---------+-------+-------------  
 public | idx_test001_1 | index | postgres | test001 | 12 GB |   
(1 row)  
  
test=# \dt+  
                      List of relations  
 Schema |  Name   | Type  |  Owner   |  Size   | Description   
--------+---------+-------+----------+---------+-------------  
 public | test001 | table | postgres | 7303 MB |   
(1 row)

Step 3. Test the fuzzy search performance.

Step 3.1 Performance of prefix fuzzy search is as follows:

Response time: 9 ms
4701 rows are returned.

select * from test001 where c1 like '你%';    
    
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';    
                                                          QUERY PLAN                                                              
------------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on public.test001  (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1)    
   Output: c1    
   Recheck Cond: (test001.c1 ~~ '你%'::text)    
   Rows Removed by Index Recheck: 85    
   Heap Blocks: exact=4776    
   Buffers: shared hit=4784    
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1)    
         Index Cond: (test001.c1 ~~ '你%'::text)    
         Buffers: shared hit=8    
 Planning time: 0.099 ms    
 Execution time: 9.166 ms    
(11 rows)

Step 3.2 Performance of suffix fuzzy search is as follows:

Response time: 0.25 ms
2 rows are returned.

select * from test001 where c1 like '%靉掑';    
    
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%靉掑';    
                                                         QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on public.test001  (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1)    
   Output: c1    
   Recheck Cond: (test001.c1 ~~ '%靉掑'::text)    
   Rows Removed by Index Recheck: 87    
   Heap Blocks: exact=89    
   Buffers: shared hit=94    
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1)    
         Index Cond: (test001.c1 ~~ '%靉掑'::text)    
         Buffers: shared hit=5    
 Planning time: 0.113 ms    
 Execution time: 0.249 ms    
(11 rows)

Step 3.3 Performance of fully fuzzy search is as follows:

Response time: 0.2 ms
1 row is returned.

select * from test001 where c1 like '%螛梋鰢%';    
    
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%螛梋鰢%';    
                                                         QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on public.test001  (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1)    
   Output: c1    
   Recheck Cond: (test001.c1 ~~ '%螛梋鰢%'::text)    
   Rows Removed by Index Recheck: 81    
   Heap Blocks: exact=82    
   Buffers: shared hit=87    
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1)    
         Index Cond: (test001.c1 ~~ '%螛梋鰢%'::text)    
         Buffers: shared hit=5    
 Planning time: 0.112 ms    
 Execution time: 0.201 ms    
(11 rows)

Summary

For fuzzy searches based on three or more characters, pg_trgm is a good solution.

For fuzzy searches based on 1 or 2 characters, you can achieve good performance by using expression indexes.

postgres=# create or replace function split_12(text) returns text[] as $$
declare
 res text[];
begin
  select regexp_split_to_array($1, '') into res;
  for i in 1..length($1)-1 loop
    res := array_append(res, substring($1, i, 2));
  end loop;
  return res;
end;
$$ language plpgsql strict immutable;
CREATE FUNCTION
postgres=# select split_12('abc你好');
           split_12           
------------------------------
 {a,b,c,你,好,ab,bc,c你,你好}
(1 row)

create index idx2 on tbl using gin (split_12(col));
  
select * from tbl where split_12(col) @> array['单字或双字'];

It is recommended that the application determines the number of characters and selects the correct SQL syntax when performing the search.

lc_collate, lc_ctype

Check whether PostgreSQL is actually using the required locale. The LC_COLLATE and LC_CTYPE settings are determined when a database is created, and cannot be changed except by creating a new database. Other locale settings including LC_MESSAGES and LC_MONETARY are initially determined by the environment the server is started in, but can be changed on-the-fly. You can check the active locale settings using the SHOW command.

lc_ctype and lc_collate can only be specified when the database is created. Once specified, they cannot be modified. Refer this document to know more about the role of lc_ctype and lc_collate.

The locale settings influence the following SQL features:

Sort order in queries using ORDER BY or the standard comparison operators on textual data

The upper, lower, and initcap functions

Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions

The to_char family of functions

The ability to use indexes with LIKE clauses

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Community

PostgreSQL Fuzzy Search Best Practices: Single-word, Double-word, and Multi-word Fuzzy Search Methods

Background

I. Optimization for Prefix and Suffix Fuzzy Searches

II. Optimization for Fully Fuzzy Searches

III. Optimization for Regex Matches

Principle of the pg_trgm Fuzzy Search

IV. Optimization of Fuzzy Searches With Less Than 3 Input Characters

V. Optimization for Similarity Searches

VI. Summary

VII. Performance

Summary

lc_collate, lc_ctype

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

OpenSearch

Image Search