RUM extension - PolarDB - Alibaba Cloud Documentation Center

Prerequisites

Before you begin, ensure that:

Your PolarDB for PostgreSQL cluster runs one of the following minor engine versions or later: You can view the minor engine version in the console or by running SHOW polardb_version;. If your version does not meet the requirement, upgrade the minor engine version. For more information, see View the minor engine version.

PostgreSQL version	Minimum minor engine version
PostgreSQL 18	2.0.18.1.2.0
PostgreSQL 17	2.0.17.6.4.0
PostgreSQL 16	2.0.16.8.3.0
PostgreSQL 15	2.0.15.7.1.1
PostgreSQL 14	2.0.14.5.3.0

The smlar extension is not active in the same database schema. The % operator of rum conflicts with the % operator of smlar. If both extensions are required, install them in separate schemas.

GIN vs. RUM: when to use each

The RUM index trades a larger index and slower writes for significantly faster sorting, phrase search, and attached-column ordering. The table below summarizes the trade-offs.

Dimension	GIN index	RUM index
Relevance sorting	Slow; requires table lookups	Fast; sorts directly within the index
Phrase search	Slow; requires table lookups for word positions	Fast; positions stored in the index
Attached-column sorting	Not supported	Supported; attach a column (e.g., timestamp) to the index
Write performance	Faster	Slower; more complex index schema
Index size	Smaller	Larger; stores extra position and column data
Prefix search	Supported	Supported by `rum_tsvector_ops`; not supported by `rum__hash_` operator classes

Choose the RUM index for workloads that frequently sort full-text search results by relevance, time, price, or any other attached column. For write-intensive tables or simple keyword matching without sorting, the built-in GIN index is more cost-effective.

Install the extension

Connect to your database and run:

CREATE EXTENSION rum;

To uninstall:

DROP EXTENSION rum;

Operators and operator classes

An operator class defines how a RUM index processes a specific data type. Choosing the correct operator class determines which operators PostgreSQL can use to accelerate queries with the index. For background, see Operator classes and operator families.

Operators

Operator	Input types	Return type	Description
`A @@ B`	Left: `tsvector`, right: `tsquery`	`bool`	Returns `true` if the full-text vector matches the query
`A <=> B`	Left: `tsvector`, right: `tsquery`	`float4`	Returns the distance between the vector and the query; a smaller value means higher relevance
`A <=> B`	`timestamp`, `timestamptz`, `int2`, `int4`, `int8`, `float4`, `float8`, `money`, `oid`	`float8`	Returns the absolute difference between two values; time differences are in seconds (microsecond precision); money differences are in cents
`A <=\| B`	`timestamp`, `timestamptz`, `int2`, `int4`, `int8`, `float4`, `float8`, `money`, `oid`	`float8`	Returns `B - A` if `A ≤ B`; otherwise returns infinity
`A \|=> B`	`timestamp`, `timestamptz`, `int2`, `int4`, `int8`, `float4`, `float8`, `money`, `oid`	`float8`	Returns `A - B` if `A > B`; otherwise returns infinity

Operator classes for `tsvector`

Operator class	Supported operators	Notes
`rum_tsvector_ops`	`WHERE`: `@@`; `ORDER BY`: `<=>`	Stores lexemes and their positions. Supports full-text search, prefix search, and relevance sorting. The most common choice for full-text search.
`rum_tsvector_hash_ops`	`WHERE`: `@@`; `ORDER BY`: `<=>`	Stores hash values and positions instead of raw lexemes. Supports full-text search and relevance sorting, but not prefix search. The index may be smaller than `rum_tsvector_ops`, but hash collisions can require a recheck.
`rum_tsvector_addon_ops`	`WHERE`: `@@`	Attaches data from an additional column (such as a `timestamp`) to the `tsvector` index entries. Enables full-text search on the primary column and efficient `ORDER BY` on the attached column in a single index scan. The attached column's type must have a corresponding `rum_<TYPE>_ops` class, and `ORDER BY` must use `<=>`, `<=\|`, or `\|=>`.
`rum_tsvector_hash_addon_ops`	`WHERE`: `@@`	Same as `rum_tsvector_addon_ops` but stores hash values of lexemes. Does not support prefix search. The index may be smaller, but hash collisions can require a recheck and may make searches slower than `rum_tsvector_addon_ops`.
`rum_tsquery_ops`	`WHERE`: `@@`	Indexes a `tsquery` column for inverted matching: quickly finds which stored query conditions match a given document.

Operator classes for arrays

Operator class	Applicable types	Supported operators	Notes
`rum_anyarray_ops`	`anyarray` (e.g., `int[]`, `text[]`, `varchar[]`)	`WHERE`: `&&`, `@>`, `<@`, `=`, `%`; `ORDER BY`: `<=>`	Supports array overlap, containment, equality, and similarity checks (similarity is computed against a threshold; arrays exceeding the threshold are considered similar), plus sorting by distance between arrays.
`rum_anyarray_addon_ops`	`anyarray` (e.g., `int[]`, `text[]`, `varchar[]`)	`WHERE`: `&&`, `@>`, `<@`, `=`, `%`; `ORDER BY`: `<=>`	Same as `rum_anyarray_ops` with an attached column. The `%` operator uses the same threshold-based similarity check. The attached column's type must have a corresponding `rum_<TYPE>_ops` class, and `ORDER BY` must use `<=>`, `<=\|`, or `\|=>`.

Operator classes for other data types

Operator class	Applicable types	Supported operators
`rum_<TYPE>_ops`	`int2`, `int4`, `int8`, `float4`, `float8`, `money`, `oid`, `time`, `timetz`, `date`, `interval`, `macaddr`, `inet`, `cidr`, `text`, `varchar`, `char`, `bytea`, `bit`, `varbit`, `numeric`, `timestamp`, `timestamptz`	`WHERE`: `<`, `<=`, `=`, `>=`, `>`; `ORDER BY` (`int2`, `int4`, `int8`, `float4`, `float8`, `money`, `oid`, `timestamp`, `timestamptz` only): `<=>`, `<=\|`, `\|=>`

Use rum_<TYPE>_ops for range queries and distance sorting on non-text, non-array columns.

Examples

Sort full-text search results by relevance

This example creates a tsvector column with a RUM index and queries it ordered by relevance score.

Create the table and insert test data:

CREATE TABLE t1(
  t text,
  t_vec tsvector GENERATED ALWAYS AS (to_tsvector('pg_catalog.english', t)) STORED
);

INSERT INTO t1(t) VALUES ('The situation is most beautiful');
INSERT INTO t1(t) VALUES ('It is a beautiful');
INSERT INTO t1(t) VALUES ('It looks like a beautiful place');

Create a RUM index using the rum_tsvector_ops operator class:

CREATE INDEX t1_t_vec_idx ON t1 USING rum (t_vec rum_tsvector_ops);

Query and sort by relevance. The <=> operator returns a distance value — smaller means more relevant:

SET enable_seqscan TO off;

SELECT t, t_vec <=> to_tsquery('english', 'beautiful | place') AS rank
FROM t1
WHERE t_vec @@ to_tsquery('english', 'beautiful | place')
ORDER BY t_vec <=> to_tsquery('english', 'beautiful | place');

Result:

t                                |  rank
---------------------------------+---------
 It looks like a beautiful place | 8.22467
 The situation is most beautiful | 16.4493
 It is a beautiful               | 16.4493

Sort by timestamp with full-text filtering

This example shows how to attach a timestamp column to a tsvector index. Both full-text filtering and timestamp-based sorting complete in a single index scan.

Create the table and insert sample data:

CREATE TABLE tsts (id int, t tsvector, d timestamp);

INSERT INTO tsts VALUES
(354, to_tsvector('wr qh'), '2016-05-16 14:21:22.326724'),
(355, to_tsvector('wr qh'), '2016-05-16 13:21:22.326724'),
(356, to_tsvector('ts op'), '2016-05-16 18:21:22.326724'),
(358, to_tsvector('ts op'), '2016-05-16 23:21:22.326724'),
(371, to_tsvector('wr qh'), '2016-05-17 06:21:22.326724'),
(406, to_tsvector('wr qh'), '2016-05-18 17:21:22.326724'),
(415, to_tsvector('wr qh'), '2016-05-19 02:21:22.326724');

Create a RUM index with the d column attached to the t index entries. The WITH (attach = 'd', to = 't') clause stores values from d alongside the tsvector entries for t, so sorting on d requires no separate table lookup:
```
CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d) WITH (attach = 'd', to = 't');
```

Query records matching the full-text filter and sort by proximity to a target timestamp:

SET enable_seqscan TO off;

EXPLAIN (costs off)
SELECT id, d, d <=> '2016-05-16 14:21:25' AS distance
FROM tsts
WHERE t @@ 'wr&qh'
ORDER BY d <=> '2016-05-16 14:21:25'
LIMIT 5;

The execution plan confirms that both filtering and sorting use a single index scan:

QUERY PLAN
------------------------------------------------------------------------------
 Limit
   ->  Index Scan using tsts_idx on tsts
         Index Cond: (t @@ '''wr'' & ''qh'''::tsquery)
         Order By: (d <=> '2016-05-16 14:21:25'::timestamp without time zone)

Result:

id  |             d              |   distance
-----+----------------------------+---------------
 354 | 2016-05-16 14:21:22.326724 |      2.673276
 355 | 2016-05-16 13:21:22.326724 |   3602.673276
 371 | 2016-05-17 06:21:22.326724 |  57597.326724
 406 | 2016-05-18 17:21:22.326724 | 183597.326724
 415 | 2016-05-19 02:21:22.326724 | 215997.326724

Query and sort arrays by similarity

This example indexes an integer array column and retrieves rows that overlap with a target array, ordered by similarity.

Create the table and insert sample data:

CREATE TABLE test_array (id serial, i int2[]);
INSERT INTO test_array(i) VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');

Create a RUM index using the rum_anyarray_ops operator class:

CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);

Find rows that contain the element 1 and sort by similarity to {1}. The && operator checks for array overlap, and <=> returns the distance — smaller means more similar:
```
SELECT *
FROM test_array
WHERE i && '{1}'
ORDER BY i <=> '{1}' ASC;
```
Result:
```
i
-----------
 {1}
 {1,2}
 {1,2,3}
 {1,2,3,4}
```

Match documents against stored query rules

This example indexes a tsquery column to perform inverted matching: given a new document, find all stored query rules that the document satisfies. This pattern is useful for subscription or alert systems.

Create the query rule table and insert rules:

CREATE TABLE query (id serial, q tsquery, tag text);

INSERT INTO query (q, tag) VALUES
('supernova & star', 'sn'),
('black', 'color'),
('big & bang & black & hole', 'bang'),
('spiral & galaxy', 'shape'),
('black & hole', 'color');

Create a RUM index on the tsquery column:

CREATE INDEX query_idx ON query USING rum(q rum_tsquery_ops);

Match a new document against all stored rules:

SELECT *
FROM query
WHERE to_tsvector('black holes never exists before we think about them') @@ q;

Result:

id | q       | tag
---+---------+-------
 2 | 'black' | color

Limitations

Write performance and index size: The RUM index stores extra information, including word positions and attached-column data. This makes it larger than a GIN index and increases the overhead of INSERT and UPDATE operations. Evaluate the trade-off carefully for write-intensive tables where storage is a concern.
No prefix search with hash operator classes: Indexes created with rum_tsvector_hash_ops or rum_tsvector_hash_addon_ops do not support prefix search. These operator classes store hash values of lexemes rather than the original text.

References

rum on GitHub — official documentation and source code

Prerequisites

GIN vs. RUM: when to use each

Install the extension

Operators and operator classes

Operators

Operator classes for tsvector

Operator classes for arrays

Operator classes for other data types

Examples

Sort full-text search results by relevance

Sort by timestamp with full-text filtering

Query and sort arrays by similarity

Match documents against stored query rules

Limitations

References

Operator classes for `tsvector`