How to Use PostgreSQL to Efficiently Search With Split Fields?

By Digoal.

In some application scenarios, multi-field matching may be required. For example, consider a scenario in which a table includes several fields like artist, track, album, composer, and Lyrics.

In this kind of scenario, users may want to use the word-breaking function to find matches for singers and performers like the Hong Kong singer Andy Lau or Canadian personality Justin Bieber, for example, and have it return TRUE if any field matches. The traditional practice is to establish a word-breaking index on each field and perform the matching one by one. However, this leads to lengthy SQL statements and requires lots of OR operations. Can a better method be used in this scenario?

Well, actually, the answer's yes. We can convert this whole record into one string and create a word-breaking index on this string. At this point, the question is what is the format of the output, and if this will impact the result of the word-breaking.

Output Record Format in PostgreSQL

Now, create a table and inset the value.

create table t1(id int, c1 text, c2 text, c3 text);    
insert into t1 values (1 , 'speed-fast e5a1cbb8' , 'speed-fast e5a1cbb8' , 'abc');    
  
postgres=# select t1::text from t1;  
                t1                   
-----------------------------------  
 (1,speed-fast e5a1cbb8,speed-fast e5a1cbb8,abc)  
(1 row)  
  
postgres=# \df+ record_out  
                                                                  List of functions  
   Schema   |    Name    | Result data type | Argument data types |  Type  | Security | Volatility |  Owner   | Language | Source code | Description   
------------+------------+------------------+---------------------+--------+----------+------------+----------+----------+-------------+-------------  
 pg_catalog | record_out | cstring          | record              | normal | invoker  | stable     | postgres | internal | record_out  | I/O  
(1 row)

Now consider the source code of the record output src/backend/utils/adt/rowtypes.c.

/*  
 * record_out           - output routine for any composite type.  
 */  
Datum  
record_out(PG_FUNCTION_ARGS)  
{  
...  
        /* And build the result string */  
        initStringInfo(&buf);  
  
        appendStringInfoChar(&buf, '(');  // Use parentheses (()) to enclose the field.  
  
        for (i = 0; i < ncolumns; i++)  
        {  
...  
                if (needComma)  
                        appendStringInfoChar(&buf, ',');  // Use a comma (,) between fields.  
                needComma = true;  
...  
                /* Detect whether we need double quotes for this value */  
                nq = (value[0] == '\0');        /* force quotes for empty string */  
                for (tmp = value; *tmp; tmp++)  
                {  
                        char            ch = *tmp;  
  
                        if (ch == '"' || ch == '\\' ||  
                                ch == '(' || ch == ')' || ch == ',' ||  
                                isspace((unsigned char) ch))  
                        {  
                                nq = true;  
                                break;  
                        }  
                }  
  
                /* And emit the string */  
                if (nq)  
                        appendStringInfoCharMacro(&buf, '"');  // For some types, use the double quotation markers ("")
                for (tmp = value; *tmp; tmp++)  
                {  
                        char            ch = *tmp;  
  
                        if (ch == '"' || ch == '\\')  
                                appendStringInfoCharMacro(&buf, ch);  
                        appendStringInfoCharMacro(&buf, ch);  
                }  
                if (nq)  
                        appendStringInfoCharMacro(&buf, '"');  
        }  
  
        appendStringInfoChar(&buf, ')');  
...

Problems with SCWS Word-Breaking

So it may seem that no problems should occur. Only a comma and double quotes are added. These are all characters and scws should be able to process them.

In reality, however, some problems do exist in this scenario. Consider the following example. The two words only differ in the ending parts. If an additional comma is added, the result is like this.

postgres=# select * from ts_debug('scwscfg', 'speed-fast e5a1cbb8,');  
 alias | description | token | dictionaries | dictionary | lexemes   
-------+-------------+-------+--------------+------------+---------  
 k     | head        | speed    | {}           |            |   
 a     | adjective   | fast    | {simple}     | simple     | {fast}  
 e     | exclamation | e5a   | {simple}     | simple     | {e5a}  
 e     | exclamation | 1cbb  | {simple}     | simple     | {1cbb}  
 e     | exclamation | 8     | {simple}     | simple     | {8}  
 u     | auxiliary   | ,     | {}           |            |   
(6 rows)  
  
postgres=# select * from ts_debug('scwscfg', 'speed-fast e5a1cbb8');  
 alias | description |  token   | dictionaries | dictionary |  lexemes     
-------+-------------+----------+--------------+------------+------------  
 k     | head        | speed       | {}           |            |   
 a     | adjective   | fast       | {simple}     | simple     | {fast}  
 e     | exclamation | e5a1cbb8 | {simple}     | simple     | {e5a1cbb8}  
(3 rows)

Problem analysis

The following diagram shows the word-breaking steps in PostgreSQL.

First, use parse to split a string into multiple tokens and specify the type of each token.

Therefore, when creating the text search configuration, you need to specify a parser, which is also the core part of the word-breaking process.

Command:     CREATE TEXT SEARCH CONFIGURATION  
Description: define a new text search configuration  
Syntax:  
CREATE TEXT SEARCH CONFIGURATION name (  
    PARSER = parser_name |  
    COPY = source_config  
)

At the same time, token types supported by the parser must also be specified when you create the parser.

Command:     CREATE TEXT SEARCH PARSER  
Description: define a new text search parser  
Syntax:  
CREATE TEXT SEARCH PARSER name (  
    START = start_function ,  
    GETTOKEN = gettoken_function ,  
    END = end_function ,  
    LEXTYPES = lextypes_function  
    [, HEADLINE = headline_function ]  
)

View parsers that have been created.

postgres=# select * from pg_ts_parser ;  
 prsname | prsnamespace |   prsstart   |     prstoken     |   prsend   |  prsheadline  |   prslextype     
---------+--------------+--------------+------------------+------------+---------------+----------------  
 default |           11 | prsd_start   | prsd_nexttoken   | prsd_end   | prsd_headline | prsd_lextype  
 scws    |         2200 | pgscws_start | pgscws_getlexeme | pgscws_end | prsd_headline | pgscws_lextype  
 jieba   |         2200 | jieba_start  | jieba_gettoken   | jieba_end  | prsd_headline | jieba_lextype  
(3 rows)

View the token types that a parser supports.

postgres=# select * from ts_token_type('scws');  
 tokid | alias |  description    
-------+-------+---------------  
    97 | a     | adjective  
    98 | b     | difference  
    99 | c     | conjunction  
   100 | d     | adverb  
   101 | e     | exclamation  
   102 | f     | position  
   103 | g     | word root  
   104 | h     | head  
   105 | i     | idiom  
   106 | j     | abbreviation  
   107 | k     | head  
   108 | l     | temp  
   109 | m     | numeral  
   110 | n     | noun  
   111 | o     | onomatopoeia  
   112 | p     | prepositional  
   113 | q     | quantity  
   114 | r     | pronoun  
   115 | s     | space  
   116 | t     | time  
   117 | u     | auxiliary  
   118 | v     | verb  
   119 | w     | punctuation  
   120 | x     | unknown  
   121 | y     | modal  
   122 | z     | status  
(26 rows)

Each token type corresponds to one or more dictionaries for matchmaking.

ALTER TEXT SEARCH CONFIGURATION name  
    ADD MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ]

View the map information of configured token types and dictionaries.

postgres=# select * from pg_ts_config_map ;

Next, for the first matching dictionary, convert the token output into a lexeme.

(Will remove stop words), delete plural forms, so on.

The following functions can be used to debug word-breaking problems.

ts_token_type(parser_name text, OUT tokid integer, OUT alias text, OUT description text): Returns token types supported by the parser.
ts_parse(parser_name text, txt text, OUT tokid integer, OUT token text): Specifies a parser and returns the string as a token.
ts_debug(config regconfig, document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[]): Specifies the word-breaking configuration and returns the string as a token and additional information.

In the preceding case, when the scws parser is used, the token output changes.

postgres=# select * from pg_ts_parser ;  
 prsname | prsnamespace |   prsstart   |     prstoken     |   prsend   |  prsheadline  |   prslextype     
---------+--------------+--------------+------------------+------------+---------------+----------------  
 default |           11 | prsd_start   | prsd_nexttoken   | prsd_end   | prsd_headline | prsd_lextype  
 scws    |         2200 | pgscws_start | pgscws_getlexeme | pgscws_end | prsd_headline | pgscws_lextype  
 jieba   |         2200 | jieba_start  | jieba_gettoken   | jieba_end  | prsd_headline | jieba_lextype  
(3 rows)  
  
postgres=# select * from ts_parse('scws', '子远e5a1cbb8,');  
 tokid | token   
-------+-------  
   107 | speed  
    97 | fast  
   101 | e5a  
   101 | 1cbb  
   101 | 8  
   117 | ,  
(6 rows)

Solution

Without modifying the scws code, we can first replace the comma with a space, because scws will ignore spaces.

postgres=# select replace(t1::text, ',', ' ') from t1;   
              replace                
-----------------------------------  
 (1 speed-fast e5a1cbb8 speed-fast e5a1cbb8 abc)  
(1 row)  
  
postgres=# select to_tsvector('scwscfg', replace(t1::text, ',', ' ')) from t1;   
              to_tsvector                
---------------------------------------  
 '1':1 'abc':6 'e5a1cbb8':3,5 'fast':2,4  
(1 row)

Full-text Search Index

postgres=# create or replace function rec_to_text(anyelement) returns text as 
$$
  
select $1::text;  

$$
 language sql strict immutable;  
CREATE FUNCTION  
  
postgres=# create index idx on t1 using gin (to_tsvector('scwscfg', replace(rec_to_text(t1), ',', ' ')));  
CREATE INDEX  
  
SQL Writing  
postgres=# explain verbose select * from t1 where to_tsvector('scwscfg', replace(rec_to_text(t1), ',', ' ')) @@ to_tsquery('scwscfg', 'speed-fast e5a1cbb8');  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.t1  (cost=4.50..6.52 rows=1 width=100)  
   Output: c1, c2, c3, c4  
   Recheck Cond: (to_tsvector('scwscfg'::regconfig, replace(rec_to_text(t1.*), ','::text, ' '::text)) @@ '''fast'' & ''e5a1cbb8'''::tsquery)  
   ->  Bitmap Index Scan on idx  (cost=0.00..4.50 rows=1 width=0)  
         Index Cond: (to_tsvector('scwscfg'::regconfig, replace(rec_to_text(t1.*), ','::text, ' '::text)) @@ '''fast'' & ''e5a1cbb8'''::tsquery)  
(5 rows)

References

For work-breaking, each CPU core can process about 44,400 words per second.

postgres=# create extension pg_scws;  
CREATE EXTENSION  
Time: 6.544 ms  
postgres=# alter function to_tsvector(regconfig,text) volatile;  
ALTER FUNCTION  
postgres=# select to_tsvector('scwscfg','How can I speed up the loading speeds of PostgreSQL?');  
                                       to_tsvector                                         
-----------------------------------------------------------------------------------------  
 'postgresql':4 'How':2 'How can I':1 'word-breaking':6 'speed up':3 'loading':7 'PostgreSQL':5 'speeds':8  
(1 row)  
Time: 0.855 ms  
postgres=# set zhparser.dict_in_memory = t;  
SET  
Time: 0.339 ms  
postgres=# explain (buffers,timing,costs,verbose,analyze) select to_tsvector('scwscfg','How can I speed up the loading speeds of PostgreSQL?') from generate_series(1,100000);  
                                                               QUERY PLAN                                                                 
------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series  (cost=0.00..260.00 rows=1000 width=0) (actual time=11.431..17971.197 rows=100000 loops=1)  
   Output: to_tsvector('scwscfg'::regconfig, 'How can I speed up the loading speeds of PostgreSQL?'::text)  
   Function Call: generate_series(1, 100000)  
   Buffers: temp read=172 written=171  
 Planning time: 0.042 ms  
 Execution time: 18000.344 ms  
(6 rows)  
Time: 18000.917 ms  
postgres=# select 8*100000/18.000344;  
      ?column?        
--------------------  
 44443.595077960732  
(1 row)

For the CPU:

Architecture:          x86_64  
CPU op-mode(s):        32-bit, 64-bit  
Byte Order:            Little Endian  
CPU(s):                32  
On-line CPU(s) list:   0-31  
Thread(s) per core:    1  
Core(s) per socket:    32  
Socket(s):             1  
NUMA node(s):          1  
Vendor ID:             GenuineIntel  
CPU family:            6  
Model:                 63  
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz  
Stepping:              2  
CPU MHz:               2494.224  
BogoMIPS:              4988.44  
Hypervisor vendor:     KVM  
Virtualization type:   full  
L1d cache:             32K  
L1i cache:             32K  
L2 cache:              256K  
L3 cache:              30720K  
NUMA node0 CPU(s):     0-31

Community

How to Use PostgreSQL to Efficiently Search With Split Fields?

Output Record Format in PostgreSQL

Problems with SCWS Word-Breaking

Problem analysis

Solution

Full-text Search Index

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

ApsaraDB RDS for PostgreSQL

Database for FinTech Solution