Using the Built-In and Custom Ranking Algorithms of PostgreSQL for Full-Text Searches

By Digoal.

In my previous article in this series, Using PostgreSQL to Create an Efficient Search Engine , specifically in the section "Sorting Algorithms", I previously discussed the PostgreSQL ranking algorithm. We will be discussing these algorithms and other related topics more in this article. Also, in that same previous article, we also discussed how tsvector divides the document into four levels: title, author, abstract, and content. For these four levels, you can set the corresponding weight for ranking computing. And, you can also set the correction mask of ranking.

However, there are limitations here. Four levels is far from meeting all business needs. So the question remains, is there a way for a more-refined ranking algorithm that can implement more levels in PostgreSQL? Is it possible to customize ranking, that is?

Also, consider the following real-world example. In the e-commerce industry, you can store tags for each shop, and each tag can have a corresponding coefficient. It is reasonable to say that the four levels system is insufficient here. Rather, a system based on these tags and the corresponding coefficients is more suitable. Coefficients can be dynamically adjusted. During the search, some of the shop's tags are hit, and millions of shops may be searched, but finally, the way to rank them should be computed based on the weight because 10,000 of them can be obtained. In this case, the way to refine the ranking is shown.

In this article, we will explore how you can refine the ranking of shops from the e-commerce example above through two different methods, using tsvector and using a multi-dimensional array, to customize rankings.

Method 1: Using Tsvector

The first method is to use Tsvector. For this example, first look at the shop tag table:

create table tbl (    
  shop_id int8 primary key,   -- The shop ID    
  tags text                   -- Multi-value type, lavel 1: score 1, label 2: score 2, …

For the tags field, we will be using a user-defined function (UDF) index. That is, we'll make our own custom function. For our user-defined function, an array to store tags or a tsvector index can be used. If the tsvector is used, it will be our best choice then to use the PostgreSQL full-text search syntax, which, of course, contains things like Contain, Notcontain, and Distance, as we discussed in the previous article. Consider the following example:

national_foot baths:0.99,national_dining:0.1,entertaining_KTV:0.45

Now consider the tag weight table:

create table tbl_weight (    
  tagid int primary key,   -- The Tag ID     
  tagname name,            -- The tag name   
  desc text,               -- The label description   
  weight float8            -- The tag weight   
);    

create index idx_tbl_weight_1 on tbl_weight (tagname);

Next, there's a UDF to convert the text to the tag array and the tsvector:

create or replace function text_to_tsvector(text) returns tsvector as 
$$
    
  select array_to_tsvector(array_agg(substring(id,'(.+):'))) from unnest(regexp_split_to_array($1, ',')) as t(id);    

$$
 language sql strict immutable;    
    
postgres=# select text_to_tsvector('abc:1.1,bc:100,c:293');    
 text_to_tsvector     
------------------    
 'abc' 'bc' 'c'    
(1 row)

Now, you'll want to create a tsvector expression index

create index idx_tbl_1 on tbl using gin (text_to_tsvector(tags));

Below is the UDF for fetching the weights of the hit tags

postgres=# select substring('bc:1.1,abc:100,c:293','[^,]?abc:([\d\.]+)') ;    
 substring     
-----------    
 100    
(1 row)    
    
postgres=# select substring('abc:1.1,bc:100,c:293','[^,]?abc:([\d\.]+)') ;    
 substring     
-----------    
 1.1    
(1 row)

If not hit, NULL is returned.

postgres=# select substring('abc:1.1,bc:100,c:293','[^,]?adbc:([\d\.]+)') ;    
 substring     
-----------    
     
(1 row)    
    
postgres=# select substring('abc:1.1,bc:100,c:293','[^,]?adbc:([\d\.]+)') is null;    
 ?column?     
----------    
 t    
(1 row)

Then, when it comes to Full-text search, I recommend that you reference my article Using PostgreSQL to Create an Efficient Search Engine .

select

After which, let's look at Refined ranking. The ranking may be obtained a ranking value based on the algorithm by combining the "hit tags, scores, and weights of the tags". The score of the hit tag is obtained based on what you found for the weights of the hit tags, and the corresponding weight value is obtained from tbl_weight based on the hit tag.

The algorithm is encapsulated into the UDF, and finally the Ranking is obtained. The UDF content of the ranking algorithm is omitted here. Of course, though, make sure to compile the corresponding algorithm based on your own situation or business needs. What follows is some reference code:

create or replace function cat_ranking(tsquery) returns float8 as 
$$
  
declare  
    
begin  
  for each x in array (contains_element) loop  
    search hit element's score.  
    search hit element's weight.  
    cat ranking and increment  
  end loop;  
  return res;  
end;  

$$
 language plpgsql strict;

For the process to follow, you can delete the tags and corresponding scores by using the regexp_replace function. Then, append tags and corresponding scores with the concat function. After that, then, modify the element score with the regexp_replace function. Note that all of the above can be operated using regular expressions.

Method 2: Using a Multi-Dimensional Array

For this second method, let's use a multi-dimensional array. In reality, using arrays to store tags and weights is actually much easier to program than using tsvector, as we did in the previous example.

For this method, you'll want to first understand some array functions before doing anything else.

Find the location according to the element, find the location according to the tag, and obtain the score from score[] according to the location.  
  
postgres=# select array_position(array[1,2,null,null,2,2,3,1],null);  
 array_position   
----------------  
              3  
(1 row)  
  
postgres=# select array_positions(array[1,2,null,null,2,2,3,1],null);  
 array_positions   
-----------------  
 {3,4}  
(1 row)  
  
postgres=# select array_positions(array[1,2,null,null,2,2,3,1],2);  
 array_positions   
-----------------  
 {2,5,6}  
(1 row)  
  
Find the element at a certain location  
  
array[i]  
  
postgres=# select (array[1,2,null,null,2,2,3,1])[1];  
 array   
-------  
     1  
(1 row)  
  
postgres=# select (array[1,2,null,null,2,2,3,1])[3];  
 array   
-------  
        
(1 row)  
  
postgres=# select (array[1,2,null,null,2,2,3,1])[5];  
 array   
-------  
     2  
(1 row)  
  
Append an element  
  
array_append  
  
Replace an element  
  
array_replace  
  
Delete an element  
  
array_remove: Note that if the same elements exist, they will be deleted (if the same scores exist, it must be noted that the element needs to be deleted by deleting the location)  
  
postgres=# select array_remove(array[1,2,null,null,2,2,3,1],2);  
   array_remove      
-------------------  
 {1,NULL,NULL,3,1}  
(1 row)  
  
Delete the element at a certain location  
  
postgres=# create or replace function array_remove(anyarray,int[]) returns anyarray as 
$$
  
  select array(select $1[i] from (select id from generate_series(1,array_length($1,1)) t(id) where id <> all( $2) ) t(i))  

$$
 language sql strict;  
CREATE FUNCTION  
postgres=# select array_remove(array[1,2,null,null,2,2,3,1],array[1,2]);  
    array_remove       
---------------------  
 {NULL,NULL,2,2,3,1}  
(1 row)  
  
postgres=# select array_remove(array[1,2,null,null,2,2,3,1],array[3,5]);  
   array_remove     
------------------  
 {1,2,NULL,2,3,1}  
(1 row)

Next, you'll want take a look at the shop tag table:

create table tbl (    
  shop_id int8 primary key,   -- the ID of the shop    
  tags text[],                -- array, label 1, label 2… 
  scores float8[]             -- array, score 1, score 2… 
);     
  
create index idx_tbl_1 on tbl using gin(tags);  
national_foot baths, national_dining, entertainment_KTV  
  
0.99,0.1,0.45

And also the tag weight table:

create table tbl_weight (    
  tagid int primary key,   -- The tag ID     
  tagname name,            -- The tag name
  desc text,               -- The label description    
  weight float8            -- The tag weight    
);    
  
create index idx_tbl_weight_1 on tbl_weight (tagname);

For the array queries for Contain, Notcontain, and Intersection, see this page.

After all of that, you'll want to create a refined ranking algorithm, which is similar to the one we used in method 1. A user-defined function (UDF) can be customized for this purpose. Overall, by using this method, you can simplify your overall development workload and increase the efficiency of the resulting system without using regular expressions.

For more information, I recommend that you check out this page.

: 5244987673819126 April 23, 2020 at 8:31 pm

Hello, I am trying to solve the issue when I need more weight levels than just 4 and I found your article. Unfortunately I am not that experienced in the SQL, could you please explain a little bit more the first method with ts_vector? I mean I do not understand how to put everything together. Let's say I have tsquery which I want to rank. I call the cat_ranking(tsquery), where did you get the 'contains_element' please?

0 0

Community

Using the Built-In and Custom Ranking Algorithms of PostgreSQL for Full-Text Searches

Method 1: Using Tsvector

Method 2: Using a Multi-Dimensional Array

Read previous post:

Read next post:

digoal

You may also like

Comments

5244987673819126 April 23, 2020 at 8:31 pm

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

Vector Retrieval Service for Milvus

OpenSearch