Structure of PostgreSQL RUM Index

This short article discusses the structure of the PostgreSQL RUM Index with additional info not provided in the ctid (row number) of GIN posting list|tree.

By digoal

Background

The structure of GIN is listed below:

https://www.postgresql.org/docs/12/gin-implementation.html

The TREE constructed by the index field entry stores the row number corresponding to the entry in posting tree|list at the end. There is no other information.

Therefore, GIN indexes support retrievals, such as inclusion and intersection.

However, besides inclusion and intersection, ranking is also required for full-text retrieval, which requires additional information, such as the number of target words in this row.

GIN finds the corresponding heap tuples, extracts them one by one, calculates the number of target words in each matched row, and ranks the results.

Are there any better methods?

RUM Index

The RUM index is similar to the GIN index, but some attribute values are appended to each ctid(itempoint) of posting list|tree. For example, if the ctid(1,10) row contains the word "alibaba" with a total quantity of 100, the number 100 will be stored.

RUM access method - inverted index with additional information in posting lists

Here, iptr refers to the heap tuple row number:

Therefore, RUM can solve the low-efficiency problem of GIN (such as in ranking sort):

Slow Ranking: It needs position information about lexems to ranking. The GIN index doesn't store the positions of lexems. So, after index scan, we need an additional heap scan to retrieve lexems positions.
Slow Phrase Search with GIN Index: This problem relates to the previous problem. It needs position information to perform phrase search.
Slow Ordering by Timestamp: The GIN index can't store some related information in index with lexemes. So, it is necessary to perform an additional heap scan.

RUM solves these problems by storing additional information in posting tree, such as positional information of lexemes or timestamps. You can get an idea of RUM from the following picture:

https://github.com/postgrespro/rum

SELECT t, a <=> to_tsquery('english', 'beautiful | place') AS rank  
    FROM test_rum  
    WHERE a @@ to_tsquery('english', 'beautiful | place')  
    ORDER BY a <=> to_tsquery('english', 'beautiful | place');  
                t                |  rank  
---------------------------------+---------  
 It looks like a beautiful place | 8.22467  
 The situation is most beautiful | 16.4493  
 It is a beautiful               | 16.4493  
(3 rows)

It also stores other content (such as values of other fields) in the additional information of the itempoint and supports various types of sorting.

For example, use the following method to store the t (time) value in addition to the itempoint:

Construct an inverted tree based on the t (tsvector) field and store the content of field d of this record with the additional content corresponding to the ctid of the posting list|tree of the inverted tree

We can conduct a full-text search by t and sorting by d.

CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d)  
    WITH (attach = 'd', to = 't');  
  
Now we can execute the following queries:  
  
EXPLAIN (costs off)  
    SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;  
                                    QUERY PLAN  
-----------------------------------------------------------------------------------  
 Limit  
   ->  Index Scan using tsts_idx on tsts  
         Index Cond: (t @@ '''wr'' & ''qh'''::tsquery)  
         Order By: (d <=> 'Mon May 16 14:21:25 2016'::timestamp without time zone)  
(4 rows)  
  
SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;  
 id  |                d                |   ?column?  
-----+---------------------------------+---------------  
 355 | Mon May 16 14:21:22.326724 2016 |      2.673276  
 354 | Mon May 16 13:21:22.326724 2016 |   3602.673276  
 371 | Tue May 17 06:21:22.326724 2016 |  57597.326724  
 406 | Wed May 18 17:21:22.326724 2016 | 183597.326724  
 415 | Thu May 19 02:21:22.326724 2016 | 215997.326724  
(5 rows)

Community

Structure of PostgreSQL RUM Index

Background

RUM Index

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

Vector Retrieval Service for Milvus

AnalyticDB for PostgreSQL

ApsaraDB RDS for PostgreSQL