×
Community Blog Location Matching Filtering Syntax for PostgreSQL Full-Text Searches

Location Matching Filtering Syntax for PostgreSQL Full-Text Searches

In this article, we look at the location matching filtering syntax of PostgreSQL full-text searches and how you can use this syntax for textual analysis.

By Digoal.

Full-text search feature of PostgreSQL offers a rich array of functions. Besides the open parser and dictionary, it also support built-in and also some extended ranking algorithms.

In fact, location filtering is also an important aspect involved in search queries. Consider the term "Fast and Furious", for example, it contains location information based on how the words are segmented. Therefore, when conducting a search, we should mind the distance between the words "Fast" and "Furious" so to be able to improve the accuracy of the search.

The distance search syntax of PostgreSQL is as follows:

select * from tbl where ts @@ 'Speed <Distance Value> Passion'::tsquery;  
  
For example:  
  
select * from tbl where ts @@ 'Speed <1> Passion'::tsquery;  

In the above example, the first line is the general template, and the last line is an example. The distance value of 1 indicates the relative distance between the terms 'speed' and 'passion'.

Now let's consider a more complex example. For this example, let's look at two long phrases, and see how this system can analysis and work with these long, dense phrases.

Example Application

For this you'll want to create a test table.

postgres=# create table ts_test (id int, info text, ts tsvector);  
CREATE TABLE  

After that, you'll want to write test data. Consider this data below:

postgres=# insert into ts_test values (1, 'Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights', to_tsvector('scwscfg', 'Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights'));  
INSERT 0 1  
postgres=# insert into ts_test values (1, 'The Speed and Passion of Film 8 at the Box Office', to_tsvector('scwscfg', 'The Speed and Passion of Film 8 at the Box Office'));  
INSERT 0 1  

Next, view the segmentation results. In these results, you'll see that the location information has been included in the segmentation.

postgres=# select * from ts_test;  
 id |                              info                               |                                                     ts                                                       
----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------  
  1 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'quality service':9 'CO.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'high':10 'Passion':1 'lights':11 'Speed':4  
  1 | The Speed and Passion of Film 8 at the Box Office                                       | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2  
(2 rows)  

Now you'll want to write more interference data.

postgres=# insert into ts_test select 2, 'Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights', to_tsvector('scwscfg', 'Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights') from generate_series(1,1000000);  
INSERT 0 1000000  

After this, create an index plug-in RUM, which supports location search.

postgres=# create extension rum;  
CREATE EXTENSION  

Create a RUM segmentation index.

postgres=# CREATE INDEX rumidx ON ts_test USING rum (ts rum_tsvector_ops);  
CREATE INDEX  

Next, you'll want to check the query results. The first query does not include the location, so we can see that all records have been matched.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ts_test where ts @@ 'Speed & Passion'::tsquery;  
                                                             QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using rumidx on public.ts_test  (cost=9.60..41.24 rows=25 width=68) (actual time=223.602..456.956 rows=1000002 loops=1)  
   Output: id, info, ts  
   Index Cond: (ts_test.ts @@ '''Speed'' & ''Passion'''::tsquery)  
   Buffers: shared hit=38132  
 Planning time: 0.226 ms  
 Execution time: 541.545 ms  
(6 rows)  

The second query includes a location, and only matches one record, which is the "Fast and Furious" we want.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ts_test where ts @@ 'Speed <1> Passion'::tsquery;  
                                                          QUERY PLAN                                                            
------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using rumidx on public.ts_test  (cost=9.60..41.24 rows=25 width=68) (actual time=270.675..270.677 rows=1 loops=1)  
   Output: id, info, ts  
   Index Cond: (ts_test.ts @@ '''Speed'' <-> ''Passion'''::tsquery)  
   Buffers: shared hit=1095  
 Planning time: 0.111 ms  
 Execution time: 273.840 ms  
(6 rows)  

PostgreSQL segmenter is relatively simple to use. Simply follow the instructions given below.

With a Distance Value Search
  
postgres=# select * from ts_test where ts @@ 'Speed <1> Passion'::tsquery;  
 id |           info            |                     ts                       
----+---------------------------+--------------------------------------------  
  1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2  
(1 row)  
  
Without a Distance Value Search 
  
postgres=# select * from ts_test where ts @@ 'Speed & Passion'::tsquery limit 5;  
 id |                              info                               |                                                     ts                                                       
----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------  
  1 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4  
  1 | The Speed and Passion of Film 8 at the Box Office                                       | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2  
  2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Tax':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'passion':1 'light':11 'Speed':4  
  2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4  
  2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4  
(5 rows)  

Last, if the distance is a range, we can also use user-defined function (UDF) to implement filtering.

-- Create a UDF that retrieves the distance internal between two words.
create or replace function get_lexeme_pos_range(tsvector, text, text) returns int4range as 
$$

declare
  a_pos int[];
  b_pos int[];
  m1 int;
  m2 int;
begin
  -- All the positions where the first word appears.
  select positions into a_pos from (select * from unnest($1)) t where lexeme=$2;
  -- All the positions where the second word appears.
  select positions into b_pos from (select * from unnest($1)) t where lexeme=$3;

  -- The position of where the two words appear, find the interval.
  -- Return the range type.
  select min(abs(t1.pos-t2.pos)), max(abs(t1.pos-t2.pos)) into m1,m2 from 
    (select unnest(a_pos) pos) t1 cross join
    (select unnest(b_pos) pos) t2;

  return int4range(m1,m2+1);
end;

$$
 language plpgsql strict immutable;


-- Test, find the distance of 1 to 2 (excluding 2).

postgres=# select get_lexeme_pos_range(ts, 'Speed', 'Passion'), * from ts_test where ts @@ tsquery 'Speed & Passion' and get_lexeme_pos_range(ts, 'Speed', 'Passion') && int4range(1,2) limit 1;
 get_lexeme_pos_range | id |           info            |                     ts                     
----------------------+----+---------------------------+--------------------------------------------
 [1,2)                |  1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(1 row)

Time: 0.713 ms

-- Test, take the distance of 2 to 5 (excluding 5)

postgres=# select get_lexeme_pos_range(ts, 'Speed', 'Passion'), * from ts_test where ts @@ tsquery 'Speed & Passion' and get_lexeme_pos_range(ts, 'Speed', 'Passion') && int4range(2,5) limit 1;
 get_lexeme_pos_range | id |                              info                               |                                                     ts                                                     
----------------------+----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
 [3,4)                |  1 | Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights | 'quality service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
(1 row)

Time: 0.682 ms
0 0 0
Share on

digoal

277 posts | 24 followers

You may also like

Comments