×
Community Blog The RUM index and Full-Text Search Using PostgreSQL

The RUM index and Full-Text Search Using PostgreSQL

The recent update of PostgreSQL 9.6 has made many enhancements in full-text search and has also brought us RUM plug-in support.

By Digoal.

Full-text search and fuzzy query are two extremely common and widely used in query types. In fact, these two types of queries happen to also be part of the backbone of many modern online search engines.

Usually when we think about search engines, we just think of them as satisfying our search needs, not considering what's going on in the background. But, in reality, there's a lot of things going on, besides the synchronization of data to the search engine, there's also issues of synchronization latency, updates, and even data consistency to deal with.

All of it is rather quite impressive when you take a minute to think about it all.

Well, now let's get into how this all works with PostgreSQL. The built-in full-text search data types and index support of PostgreSQL for full-fuzzy queries can provide for all of your needs, and the efficiency of the entire system is pretty good as well. For example, the results can be returned in milliseconds for a search in 1 billion Tokens.

Next, adding to all of this, the recent update of PostgreSQL 9.6 has made many enhancements in the full-text search capabilities and optimization of PosgreSQL. Moreover, the RUM plug-in, as Oleg Bartunov, a core committer of the PG Community puts it, sort of opens a Pandora's Box in that it greatly improving the search efficiency of PostgreSQL, being even better than GIN.

Scenario

Consider the following scenario, and see how the updates to PostgreSQL can help to provide some powerful optimizations.

This is a common scenario. Many users will use commas to separate the elements to be searched, store them in the database as strings, and then use fuzzy queries to retrieve the data.

create table test(c1 text);
insert into test values ('1,100,2331,344,502,.........');
insert ............
.....

Following this, for the 10 million such records in these databases, queries are executed based on the combination of elements.

select * from test where c1 like '%1%' or c1 like '%502%' and c1 like '%2331%';  

However, all of this leads to very low efficiency for queries. This is problematic. If you want to return data within milliseconds, it is almost unimaginable. However, PostgreSQL's updates can help with this scenario.

PostgreSQL Array

For one, the above scenario can be met by the arrays in PostgreSQL.

create table arr_test(c1 int[]);

create index idx_arr_test on arr_test using gin(c1);

insert into arr_test values(array[1,100,2331,344,502,......]);
......

PostgreSQL arrays support GIN indexes, which allow for quicker searches. For example, below is a search which searched the records which contain 1 or 2 in 10 million records. It was finished in only a few milliseconds.

postgres=# explain analyze select * from arr_test where c1 && array[1,2] order by c1 offset 19000 limit 100;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=112837.69..112837.94 rows=100 width=424) (actual time=91.440..91.475 rows=100 loops=1)
   ->  Sort  (cost=112790.19..113039.57 rows=99750 width=424) (actual time=82.915..90.477 rows=19100 loops=1)
         Sort Key: c1
         Sort Method: external merge  Disk: 8440kB
         ->  Bitmap Heap Scan on arr_test  (cost=816.06..93595.94 rows=99750 width=424) (actual time=9.180..37.380 rows=19925 loops=1)
               Recheck Cond: (c1 && '{1,2}'::integer[])
               Heap Blocks: exact=19605
               ->  Bitmap Index Scan on idx_arr_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.196..5.196 rows=19925 loops=1)
                     Index Cond: (c1 && '{1,2}'::integer[])
 Planning time: 0.131 ms
 Execution time: 93.929 ms
(11 rows)

PostgreSQL Full-Text Search

PostgreSQL also supports full-text searches. You can store the element as a tsvector and use tsquery for queries.

postgres=# create table gin_test(c1 tsvector);
CREATE TABLE

postgres=# create index idx_gin_test on gin_test using gin (c1) ;
CREATE INDEX

These PostgreSQL full-text searches also support indexes to accelerate queries. For example, you can search records which contain 1 or 2 in 10 million records.

How RUM Changes Everything

When it comes to using GIN indexes, the scan method is BITMAP, so a SORT action needs to be performed as result. But as such, this can be quite time consuming if you have a rather large LIST.

However, the RUM index interface, a plug-in of PostgreSQL 9.6, provides more powerful support for full-text search, and directly uses the INDEX SCAN interface without the need for SORT. In other words, RUM also implements <=> text similarity attribute retrieval.

Prostgres Professional's CEO and a core committer of the PG Community Oleg Bartunov said that RUM in many ways has sort of opened Pandora's box. Besides PostgreSQL 9.6 having made great strides in full-text search optimization, RUM is also an important and even immense change-that really changes everything.

To understand RUM better, let's forget search engines, and use PostgreSQL instead, playing with some of its features. First let's test RUM.

postgres=# create table rum_test(c1 tsvector);
CREATE TABLE

postgres=# CREATE INDEX rumidx ON rum_test USING rum (c1 rum_tsvector_ops);
CREATE INDEX

For more information related to this text, you can also check out my recent article Similarity Analysis for PostgreSQL Text Databases.

Comparison: Array vs Full-Text Search and GIN vs RUM

After doing the simple test above, let's also compare the array GIN index, the full-text search GIN index, and the full-text search RUM index. The table structure of each is as follows:

postgres=# create table rum_test(c1 tsvector);
CREATE TABLE

postgres=# create table gin_test(c1 tsvector);
CREATE TABLE

postgres=# create table arr_test(c1 int[]);
CREATE TABLE

Insert 10 million records with each field having 100 random values, which is equivalent to performing the matching operation in 1 billion random values. Do this for all of them:

$ vi test.sql
insert into rum_test select to_tsvector(string_agg(c1::text,',')) from  (select (100000*random())::int from generate_series(1,100)) t(c1);

$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 200000


$ vi test.sql
insert into gin_test select to_tsvector(string_agg(c1::text,',')) from  (select (100000*random())::int from generate_series(1,100)) t(c1);

$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 200000


$ vi test.sql
insert into arr_test select array_agg(c1) from  (select (100000*random())::int from generate_series(1,100)) t(c1);

$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 200000

Next, create an index for each of them:

postgres=# set maintenance_work_mem ='64GB';
SET
postgres=# CREATE INDEX rumidx ON rum_test USING rum (c1 rum_tsvector_ops);
CREATE INDEX

postgres=# create index idx_gin_test on gin_test using gin (c1) ;
CREATE INDEX

postgres=# create index idx_arr_test on arr_test using gin (c1) ;
CREATE INDEX

Comparing Query Efficiency

Now let's compare the query efficiency. Query the records that contain 1 or 2.

Full-text search type with rum index:
postgres=# explain analyze select * from rum_test where c1 @@ to_tsquery('English','1 | 2');
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using rumidx on rum_test  (cost=16.00..99121.61 rows=99749 width=1387) (actual time=6.403..24.981 rows=19840 loops=1)
   Index Cond: (c1 @@ '''1'' | ''2'''::tsquery)
 Planning time: 0.075 ms
 Execution time: 26.086 ms
(4 rows)

Full-text search with GIN ind
postgres=# explain analyze select * from gin_test where c1 @@ to_tsquery('english','1 | 2');
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on gin_test  (cost=816.06..99386.94 rows=99750 width=1387) (actual time=9.551..34.121 rows=19847 loops=1)
   Recheck Cond: (c1 @@ '''1'' | ''2'''::tsquery)
   Heap Blocks: exact=19764
   ->  Bitmap Index Scan on idx_gin_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.554..5.554 rows=19847 loops=1)
         Index Cond: (c1 @@ '''1'' | ''2'''::tsquery)
 Planning time: 0.113 ms
 Execution time: 35.279 ms
(7 rows)

Array with Gin index:

postgres=# explain analyze select * from arr_test where c1 && array[1,2];
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on arr_test  (cost=816.06..93595.94 rows=99750 width=424) (actual time=9.148..31.648 rows=19925 loops=1)
   Recheck Cond: (c1 && '{1,2}'::integer[])
   Heap Blocks: exact=19605
   ->  Bitmap Index Scan on idx_arr_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.214..5.214 rows=19925 loops=1)
         Index Cond: (c1 && '{1,2}'::integer[])
 Planning time: 0.095 ms
 Execution time: 32.810 ms
(7 rows)

Now sort the output:

Full-text search with rum index:
postgres=# explain analyze select * from rum_test where c1 @@ to_tsquery('english','1 | 2') order by c1 <=> to_tsquery('english','1 | 2') offset 19000 limit 100;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=18988.45..19088.30 rows=100 width=1391) (actual time=58.912..59.165 rows=100 loops=1)
   ->  Index Scan using rumidx on rum_test  (cost=16.00..99620.35 rows=99749 width=1391) (actual time=16.426..57.892 rows=19100 loops=1)
         Index Cond: (c1 @@ '''1'' | ''2'''::tsquery)
         Order By: (c1 <=> '''1'' | ''2'''::tsquery)
 Planning time: 0.133 ms
 Execution time: 59.220 ms
(6 rows)

Full-text search with GIN index:
postgres=# explain analyze select * from gin_test where c1 @@ to_tsquery('english','1 | 2') order by c1 offset 19000 limit 100;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=176684.69..176684.94 rows=100 width=1387) (actual time=117.809..117.865 rows=100 loops=1)
   ->  Sort  (cost=176637.19..176886.57 rows=99750 width=1387) (actual time=94.889..116.929 rows=19100 loops=1)
         Sort Key: c1
         Sort Method: external merge  Disk: 26968kB
         ->  Bitmap Heap Scan on gin_test  (cost=816.06..99386.94 rows=99750 width=1387) (actual time=9.625..38.336 rows=19847 loops=1)
               Recheck Cond: (c1 @@ '''1'' | ''2'''::tsquery)
               Heap Blocks: exact=19764
               ->  Bitmap Index Scan on idx_gin_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.610..5.610 rows=19847 loops=1)
                     Index Cond: (c1 @@ '''1'' | ''2'''::tsquery)
 Planning time: 0.134 ms
 Execution time: 126.122 ms
(11 rows)

Array with GIN index:

postgres=# explain analyze select * from arr_test where c1 && array[1,2] order by c1 offset 19000 limit 100;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=112837.69..112837.94 rows=100 width=424) (actual time=90.619..90.656 rows=100 loops=1)
   ->  Sort  (cost=112790.19..113039.57 rows=99750 width=424) (actual time=82.067..89.622 rows=19100 loops=1)
         Sort Key: c1
         Sort Method: external merge  Disk: 8440kB
         ->  Bitmap Heap Scan on arr_test  (cost=816.06..93595.94 rows=99750 width=424) (actual time=9.087..36.870 rows=19925 loops=1)
               Recheck Cond: (c1 && '{1,2}'::integer[])
               Heap Blocks: exact=19605
               ->  Bitmap Index Scan on idx_arr_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.138..5.138 rows=19925 loops=1)
                     Index Cond: (c1 && '{1,2}'::integer[])
 Planning time: 0.122 ms
 Execution time: 93.057 ms
(11 rows)

Some Additional Capabilities of RUM

Rum search supports similarity ranking, which is useful in search. The similarity score indicates the similarity between the text and the search condition.

// Word Breaking example
postgres=#  select * from to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.');
                                   to_tsvector                                    
----------------------------------------------------------------------------------
 'Chinese Academy of Sciences':5 'Xiao Ming':1 'Kyoto University':10 'graduated':3 'study':11 'master's degree':2 'Institute of Computing Technology':6
(1 row)
// With Similarities
postgres=#  select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery('Institute of Computing Technology'));
 rum_ts_distance 
-----------------
         16.4493
(1 row)
// Without similarities
postgres=#  select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery('Computing Technology'));
 rum_ts_distance 
-----------------
        Infinity
(1 row)
// One or the other has similarities
postgres=# select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery('The Institute of Computing Technology | master's degree'));
 rum_ts_distance 
-----------------
         8.22467
(1 row)
// Both have similarities
postgres=# select * from rum_ts_distance(to_tsvector('jiebacfg', 'Xiao Ming graduated with a master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences, and later went on to study at Kyoto University in Japan.') , to_tsquery('Institute of Computing Technology & master's degree'));
 rum_ts_distance 
-----------------
         32.8987
(1 row)
// Order
postgres=# create table test15(c1 tsvector);
CREATE TABLE
postgres=# insert into test15 values (to_tsvector('jiebacfg', 'hello china, i''m digoal')), (to_tsvector('jiebacfg', 'hello world, i''m postgresql')), (to_tsvector('jiebacfg', 'how are you, i''m digoal'));
INSERT 0 3
postgres=# select * from test15;
                         c1                          
-----------------------------------------------------
 ' ':2,5,9 'china':3 'digoal':10 'hello':1 'm':8
 ' ':2,5,9 'hello':1 'm':8 'postgresql':10 'world':3
 ' ':2,4,7,11 'digoal':12 'm':10
(3 rows)
postgres=# create index idx_test15 on test15 using rum(c1 rum_tsvector_ops);
CREATE INDEX
postgres=# select *,c1 <=> to_tsquery('hello') from test15;
                         c1                          | ?column? 
-----------------------------------------------------+----------
 ' ':2,5,9 'china':3 'digoal':10 'hello':1 'm':8     |  16.4493
 ' ':2,5,9 'hello':1 'm':8 'postgresql':10 'world':3 |  16.4493
 ' ':2,4,7,11 'digoal':12 'm':10                     | Infinity
(3 rows)
postgres=# explain select *,c1 <=> to_tsquery('postgresql') from test15 order by c1 <=> to_tsquery('postgresql');
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Index Scan using idx_test15 on test15  (cost=3600.25..3609.06 rows=3 width=36)
   Order By: (c1 <=> to_tsquery('postgresql'::text))
(2 rows)

Summary

To sum up our comparison, RUM is very powerful and supports similarity search and non-BITMAP scans. In terms of query efficiency, our output results in our own comparison here show that RUM is twice as efficient as GIN and simple array query. So, as you can see, PostgreSQL 9.6 brings a lot of great things to the table. The one thing that excites me and many other developers, of course, is RUM.

0 0 0
Share on

digoal

281 posts | 24 followers

You may also like

Comments