Double Eleven Technology Series: Word Segmentation and Index Searching

This article introduces how Alibaba uses PostgreSQL index searching and word segmentation technologies for its annual Double Eleven online shopping festival.

The Alibaba Cloud 2021 Double 11 Cloud Services Sale is live now! For a limited time only you can turbocharge your cloud journey with core Alibaba Cloud products available from just $1, while you can win up to $1,111 in cash plus $1,111 in Alibaba Cloud credits in the Number Guessing Contest.

PostgreSQL has a good reputation in the search field. This is particularly evident in the GIS field where it has taken the leading position in the industry for many years. Fortunately for us, PostgreSQL turns out to be a great solution for e-commerce applications as well. In this article, we will introduce PostgreSQL index searching and word segmentation technologies used for Alibaba's Double Eleven (Singles' Day) annual online shopping festival.

Full Text Search Types

Full text search involves two data types: phrases and search word combinations.

https://www.postgresql.org/docs/9.6/static/datatype-textsearch.html

1. Phrase Type

Character strings are converted into phrases based on the selected phrasing rule. The phrases can be considered as a group of abstracted lexemes.

Example:

postgres=> select * from pg_ts_config;
  cfgname   | cfgnamespace | cfgowner | cfgparser 
------------+--------------+----------+-----------
 simple     |           11 |       10 |      3722
 danish     |           11 |       10 |      3722
 dutch      |           11 |       10 |      3722
 english    |           11 |       10 |      3722
 finnish    |           11 |       10 |      3722
 french     |           11 |       10 |      3722
 german     |           11 |       10 |      3722
 hungarian  |           11 |       10 |      3722
 italian    |           11 |       10 |      3722
 norwegian  |           11 |       10 |      3722
 portuguese |           11 |       10 |      3722
 romanian   |           11 |       10 |      3722
 russian    |           11 |       10 |      3722
 spanish    |           11 |       10 |      3722
 swedish    |           11 |       10 |      3722
 turkish    |           11 |       10 |      3722
(16 rows)

Phrase conversion: followed by location information  
postgres=> select to_tsvector('english', 'Hi i''m digoal, a pger, are you pger?');
            to_tsvector             
------------------------------------
 'digoal':4 'hi':1 'm':3 'pger':6,9
(1 row)

Stop words are filtered out, which are meaningless,  for example, i, an, a, am, yes, and no (user defined).

2. Search Word Type

The search words refer to the words you want to search, for example, postgresql, or postgresql or mysql.

You also need to select a configuration rule to judge and filter lexemes. The phrases you want to search can be combined randomly.

postgres=> select to_tsquery('a');
NOTICE:  text-search query contains only stop words or doesn't contain lexemes, ignored
 to_tsquery 
------------
 
(1 row)

postgres=> select to_tsquery('english', 'mysql|postgresql');
       to_tsquery       
------------------------
 'mysql' | 'postgresql'
(1 row)

postgres=> select to_tsquery('english', 'mysql&postgresql');
       to_tsquery       
------------------------
 'mysql' & 'postgresql'
(1 row)

postgres=> select to_tsquery('english', 'mysql&postgresql|abc');
           to_tsquery           
--------------------------------
 'mysql' & 'postgresql' | 'abc'
(1 row)

postgres=> select to_tsquery('english', '(mysql&postgresql)|abc');
           to_tsquery           
--------------------------------
 'mysql' & 'postgresql' | 'abc'
(1 row)

postgres=> select to_tsquery('english', 'mysql&(postgresql|abc)');
             to_tsquery             
------------------------------------
 'mysql' & ( 'postgresql' | 'abc' )
(1 row)

Moreover, tsquery supports prefix query, for example:

postgres=> select to_tsquery('english', 'postgres:*' );
 to_tsquery 
------------
 'postgr':*
(1 row)

Lexeme starting with postgr

Tsvector also involves the section and weight concepts, for example, body, title, and sub-title. There are four levels, A, B, C, and D, indicating the layer and location where the lexeme resides.

Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default and hence is not shown on output:

SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector          
----------------------------
 'a':1A 'cat':5 'fat':2B,4C

Full Text Search Type Operator

We have talked about data types. Next, let's learn about the data type operators. Like the operators in mathematics, word segmentation type has the match, include, add, and adjacent operators.

Example:

Whether the phrase includes the words to be searched.
to_tsvector('fat cats ate rats') @@ to_tsquery('cat & rat')
 return 
true


Combine the two phrases.
'a:1 b:2'::tsvector || 'c:1 d:2 b:3'::tsvector
  return
'a':1 'b':2,5 'c':3 'd':4


Execute the AND operation on the two search words
'fat | rat'::tsquery && 'cat'::tsquery
  return
( 'fat' | 'rat' ) & 'cat'


Execute the OR operation on the two search words
'fat | rat'::tsquery || 'cat'::tsquery
  return
( 'fat' | 'rat' ) | 'cat'


Not search word.
!! 'cat'::tsquery
  return
!'cat'


It indicates two adjacent search words. This is a brilliant function newly added to 9.6. For example, if you type in "hello <-> world", it matches hello world, but not hello digoal world.  This is because the location information is used.  
to_tsquery('fat') <-> to_tsquery('rat')
  return
'fat' <-> 'rat'

postgres=> select to_tsvector('english', 'Hi i''m digoal, a pger, are you pger?') @@ to_tsquery(
$$
'digoal' <-> 'pger'
$$
);
 ?column? 
----------
 f
(1 row)

postgres=> select to_tsvector('english', 'Hi i''m digoal, a pger, are you pger?');
            to_tsvector             
------------------------------------
 'digoal':4 'hi':1 'm':3 'pger':6,9
(1 row)

postgres=> select to_tsvector('english', 'hello world');
     to_tsvector     
---------------------
 'hello':1 'world':2
(1 row)

postgres=> select to_tsvector('english', 'hello world') @@ to_tsquery(
$$
'hello' <-> 'world'
$$
);
 ?column? 
----------
 t
(1 row)

postgres=> select to_tsvector('english', 'hello world') @@ to_tsquery(
$$
'world' <-> 'hello'
$$
);
 ?column? 
----------
 f
(1 row)

postgres=> select to_tsvector('english', 'hello world') @@ to_tsquery(
$$
'world' & 'hello'
$$
);
 ?column? 
----------
 t
(1 row)


Include relationship between two words.
'cat'::tsquery @> 'cat & rat'::tsquery
  return
false

'cat'::tsquery <@ 'cat & rat'::tsquery
  return
true

Full Text Search Type Functions

These functions are database built-in functions of the tsvector or tsquery type, supporting various commonly used functions.

Example:

Converting array into the phrase type, obtaining the current tsconfig (English, Chinese, etc.), obtaining the phrase length (number of lexemes), converting character strings into search words, and converting phrases into search words (including location information)

Example Use Case

The following describes the phrase conversion function added to PG 9.6, which supports adjacency.

For example, when you type in Chinese Taoist Culture, it is segmented into Chinese <-> Taoist <-> Culture. The match is positive only when the words are adjacent. Otherwise, the words cannot be matched.

The word segmentation such as "Chinese population census, Taoist proportion xx, education level xx" cannot be matched. If you want a positive match, type in Chinese & Taoist & Culture. This function is convenient.

postgres=# select phraseto_tsquery('hello digoal');
   phraseto_tsquery   
----------------------
 'hello' <-> 'digoal'
(1 row)

postgres=# select phraseto_tsquery('hello digoal zhou');
        phraseto_tsquery         
---------------------------------
 'hello' <-> 'digoal' <-> 'zhou'
(1 row)

postgres=# select plainto_tsquery('hello digoal zhou');
       plainto_tsquery       
-----------------------------
 'hello' & 'digoal' & 'zhou'
(1 row)

postgres=# select plainto_tsquery('hello digoal zhou, this is china');
            plainto_tsquery            
---------------------------------------
 'hello' & 'digoal' & 'zhou' & 'china'
(1 row)

postgres=# select phraseto_tsquery('hello digoal zhou, this is china');
              phraseto_tsquery               
---------------------------------------------
 'hello' <-> 'digoal' <-> 'zhou' <3> 'china'
(1 row)

Phrases supporting or not supporting indexing.

querytree('foo & ! bar'::tsquery)  
  
foo supports indexing.

!bar does not support indexing.

Add or remove the weight (namely, ABCD) of tsvector.

setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A') -- Add weight A to all lexemes.
  
'cat':3A 'fat':2,4 'rat':5A

setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}') -- Add weight A to cat and rat.

'cat':3A 'fat':2,4 'rat':5A

strip('fat:2,4 cat:3 rat:5A'::tsvector) -- Remove weights.

'cat' 'fat' 'rat'

Remove the specified lexeme from tsvector. If you know some words are meaningless, remove them.

ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat')   Return    'cat':3 'rat':5A
ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])    Return    'cat':3

Filter phrases based on weight. For example, only check whether the title and sub-title are matched (assume that the weights of title and sub-title are A and B respectively).

ts_filter('fat:2,4 cat:3b rat:5A'::tsvector, '{a,b}')    Return    'cat':3B 'rat':5A

The words in bold are positively matched. This is a useful function, especially in demonstration.

For example, some blog platforms set sensitive words, which will be filtered out. With this function, I can quickly find the words that match the sensitive word rules.

    ts_headline('x y z', 'z'::tsquery)    Return    x y <b>z</b>

The match percentage is useful in many scenarios. For example, rank the words based on match degree. The word with the highest match percentage is ranked first.

ts_rank(textsearch, query)    0.818

select *, ts_rank(fts,to_tsquery('supernovae & x-ray')) as rank 
from apod
where fts  @@ to_tsquery('supernovae & x-ray') 
order by rank desc limit 5;

Or use the <=> operator.

select *
from apod
where fts  @@ to_tsquery('supernovae & x-ray') 
order by fts <=> to_tsquery('supernovae & x-ray') desc limit 5;

Phrases include the title, sub-title, body, and section (namely, weight), so different coefficients can be set for different weights, for example:

That is, the match degree is calculated based on the weights you have set.

Set the coefficients of A, B, C, and D to {0.1, 0.2, 0.4, 1.0}.

ts_rank_cd('{0.1, 0.2, 0.4, 1.0}', textsearch, query)    Return    2.01317

Rewrite search words, like SQL REWRITE or the text replacement function.

Replace a with foo|bar.
ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'foo|bar'::tsquery)    Return    'b' & ( 'foo' | 'bar' )

Batch replacement is supported. For example, use QUERY to replace the s fields with t.   
SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases')    Return    'b' & ( 'foo' | 'bar' )

Calculate the phrase and convert it into tsquery, including the lexeme distance information.

tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'))    Return    'fat' <-> 'cat'-- Create a phrase in which fat and cat are adjacent.

tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10)    Return    'fat' <10> 'cat'-- Create a phrase in which 10 tokens (including the tokens filtered out) exist between fat and cat.

Convert tsvector into array, which does not include location information.

tsvector_to_array('fat:2,4 cat:3 rat:5A'::tsvector)    Return   {cat,fat,rat}

The automatic phrase update function is important. If you have updated the text fields, how are the phrases automatically updated?

The answer is two built-in triggers.
  
CREATE TRIGGER ... tsvector_update_trigger(tsvcol, 'pg_catalog.swedish', title, body)  
  
CREATE TRIGGER ... tsvector_update_trigger_column(tsvcol, configcol, title, body)

Expand tsvector to convert it into multiple records.

unnest('fat:2,4 cat:3 rat:5A'::tsvector)    (cat,{3},{D}) ...

Full Text Search Type Debugging Functions

If you have added a dictionary, and set or modified the phrasing rule, you may need to view the phrasing result after modification.

Phrasing includes several steps, for example, splitting a character string into tokens (including location information and weight) according to the dictionary, filtering unneeded tokens based on the token properties and ts config, and returning tsvector.

Using the debugging functions can display the original splitting information.

https://www.postgresql.org/docs/9.6/static/functions-textsearch.html

Example:

Indexing

In addition to type and function, a database must support indexing of a data type; otherwise, the database cannot fully support this data type.

The phrases in PostgreSQL support indexes, including GIN, GiST, and SP-Gist.

The following is an example of using indexes.

create table test(id serial primary key, content text, ts tsvector);
  
create index idx_test_ts on test using gin (ts);

select * from test where ts @@ to_tsquery('english', 'hello <-> digoal');  -- Indexing is implemented.

Multilingual Phrasing

PostgreSQL supports customized dictionary and phrasing configuration.

Chinese phrases can be added. There are many methods of adding Chinese phrases on the Internet.

https://github.com/fxsjy/jieba

https://github.com/jaiminpan/pg_scws

For this purpose, you can consider using Alibaba Cloud RDS PostgreSQL because it contains the zhparser phrase.

Fuzzy Query

Word segmentation and fuzzy search are two functions. Word segmentation cannot provide the fuzzy match function. For example, if you type in a regular expression or condition with fuzzy prefix and suffix for search, fuzzy search can hit the text that cannot be searched by word segmentation.

Word segmentation can match only the prefixes, but do not support the fuzzy prefixes, suffixes, or regular expressions.

postgres=# select to_tsquery('postgresql:*');
   to_tsquery   
----------------
 'postgresql':*
(1 row)

postgres=# select to_tsquery('postgres:*');
 to_tsquery 
------------
 'postgr':*
(1 row)

If you require the fuzzy search function, use the pg_trgm plugin in PostgreSQL for indexing.

Similarity-based search and fuzzy match are supported.

postgres=# create extension pg_trgm;
CREATE EXTENSION

postgres=# explain select * from tb where info ~ '5821a'; -- Fuzzy prefix and suffix.
                                 QUERY PLAN                                   
----------------------------------------------------------------------------  
 Bitmap Heap Scan on tb  (cost=103.75..3677.71 rows=1000 width=9)  
   Recheck Cond: (info ~ '5821a'::text)  
   ->  Bitmap Index Scan on idx_tb_2  (cost=0.00..103.50 rows=1000 width=0)  
         Index Cond: (info ~ '5821a'::text)  
(4 rows)  
Time: 0.647 ms  

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^33.+7.+9$' limit 10; -- Regular expression search.  
                                                           QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=43.75..80.78 rows=10 width=57) (actual time=19.573..21.212 rows=10 loops=1)  
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
   Buffers: shared hit=566  
   ->  Bitmap Heap Scan on public.t_split  (cost=43.75..3746.56 rows=1000 width=57) (actual time=19.571..21.206 rows=10 loops=1)  
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
         Recheck Cond: (t_split.info ~ '^33.+7.+9$'::text)  
         Rows Removed by Index Recheck: 647  
         Heap Blocks: exact=552  
         Buffers: shared hit=566  
         ->  Bitmap Index Scan on idx9  (cost=0.00..43.50 rows=1000 width=0) (actual time=11.712..11.712 rows=39436 loops=1)  
               Index Cond: (t_split.info ~ '^33.+7.+9$'::text)  
               Buffers: shared hit=14  
 Planning time: 0.301 ms  
 Execution time: 21.255 ms  
(14 rows)  
Time: 21.995 ms

Advantages of PostgreSQL

As you can see from the examples above, it is really easy to apply PostgreSQL to real-life e-commerce applications. PostgreSQL's powerful search and indexing features has helped Alibaba make the Double Eleven online shopping festival successful year after year. The problems used to be solved by search engine or external plugin can be solved by PostgreSQL today. Here are some of the advantages of using PostgreSQL for word segmentation and index searching:

You do not need to transfer data from one database to another. Real-time search is implemented with a low cost.
Text and phrases are stored in one database, ensuring the consistency in searching.
The numbers of search times and returned results are not limited, providing controllable flexibility.
The lexeme weights and weight coefficients can be customized to improve search flexibility.
Phrases can be ranked and searched by similarities.
Lexeme distance coefficients are supported. The search can be implemented based on absolute distance coefficients. For example, 'Fast & Furious' is converted into 'Fast <1> Furious', and the movie can be accurately matched. It is impossible to match 'speed is fast, very furious'.
Phrases can be customized. If the pre-defined phrases cannot meet your requirement, define your own phrases.
The tsvector granularity is adjustable.

References

PostgreSQL has a long history. It originated from University of California, Berkeley with a BSD-LIKE license. PostgreSQL is user-friendly; many products have been developed based on PostgreSQL.