This topic describes how to use the RUM plug-in of ApsaraDB RDS for PostgreSQL to run full-text searches.
Prerequisites
- PostgreSQL 13
- PostgreSQL 12
- PostgreSQL 11
- PostgreSQL 10
Background information
Generalized Inverted Index (GIN) allows you to run full-text searches by using the tsvector and tsquery data types. However, this may produce the following issues:
- Slow sorting
ApsaraDB RDS for PostgreSQL can sort words only after it obtains the locations of the words. However, GIN does not store word locations. As a result, after ApsaraDB RDS for PostgreSQL runs a scan based on a GIN index, it must run another scan to retrieve the word locations.
- Slow queries for phrases
GIN can search for phrases only after it obtains the locations of the phrases.
- Slow sorting of timestamps
GIN does not store related information in indexes that contain morphemes. Therefore, an additional scan is required.
The RUM plug-in of ApsaraDB RDS for PostgreSQL is designed based on GIN. It allows you to store word or timestamp locations in RUM indexes.
However, the RUM plug-in requires more time than GIN to construct and insert indexes. This is because the RUM plug-in generates indexes based on write-ahead logging (WAL) logs and the generated RUM indexes contain more information than the keys that are used for encryption.
Universal operators
The RUM plug-in provides the following operators.
Operator | Data type | Description |
---|---|---|
tsvector <=> tsquery | float4 | Returns the distances between the data objects of the tsvector type and those of the tsquery type. |
timestamp <=> timestamp | float8 | Returns the distance between two timestamps. |
timestamp <=| timestamp | float8 | Returns only the distance to the left-side timestamp. |
timestamp |=> timestamp | float8 | Returns only the distance to the right-side timestamp. |
The following sections describe the functions that are provided by the RUM plug-in.
rum_tsvector_ops
- Supported data types: tsvector.
- Description: This function stores phrases of the tsvector data type along with the
locations of the phrases. This function allows you to sort phrases by using the
<=>
operator. This function also allows you to search for phrases based on prefixes. -
Examples:
- Execute the following statements to create a table:
CREATE TABLE test_rum(t text, a tsvector); CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON test_rum FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('a', 'pg_catalog.english', 't'); INSERT INTO test_rum(t) VALUES ('The situation is most beautiful'); INSERT INTO test_rum(t) VALUES ('It is a beautiful'); INSERT INTO test_rum(t) VALUES ('It looks like a beautiful place');
- Execute the following statement to create the RUM plug-in:
CREATE EXTENSION rum;
- Execute the following statement to create an index:
CREATE INDEX rumidx ON test_rum USING rum (a rum_tsvector_ops);
- Run the following two types of queries:
SELECT t, a <=> to_tsquery('english', 'beautiful | place') AS rank FROM test_rum WHERE a @@ to_tsquery('english', 'beautiful | place') ORDER BY a <=> to_tsquery('english', 'beautiful | place');
The following result is returned:
t | rank ---------------------------------+--------- It looks like a beautiful place | 8.22467 The situation is most beautiful | 16.4493 It is a beautiful | 16.4493 (3 rows)
SELECT t, a <=> to_tsquery('english', 'place | situation') AS rank FROM test_rum WHERE a @@ to_tsquery('english', 'place | situation') ORDER BY a <=> to_tsquery('english', 'place | situation');
The following result is returned:
t | rank ---------------------------------+--------- The situation is most beautiful | 16.4493 It looks like a beautiful place | 16.4493 (2 rows)
- Execute the following statements to create a table:
rum_tsvector_hash_ops
- Supported data types: tsvector.
- Description: This function stores phrases of the tsvector data type along with the
hash values and locations of the phrases. This function allows you to sort phrases
by using the
<=>
operator. However, this function does not allow you to search for phrases based on prefixes.
rum_TYPE_ops
- Supported data types:
- The
<
,<=
,=
,>=
,>
, and<=>
operators support the following data types: int2, int4, int8, float4, float8, money, oid, time, timetz, date, interval, macaddr, inet, cidr, text, varchar, char, bytea, bit, varbit, numeric, timestamp, and timestamptz. - The
<=|
and|=>
operators support the following data types: int2, int4, int8, float4, float8, money, oid, timestamp, and timestamptz.
- The
- Description: This function can be used with the rum_tsvector_addon_ops, rum_tsvector_hash_addon_ops, and rum_anyarray_addon_ops functions.
rum_tsvector_addon_ops
- Supported data types: tsvector.
- Description: This function stores the word segmentation method that is used by the tsvector data type. This method also stores all the other word segmentation methods that are supported by the fields of the RUM plug-in.
-
Examples:
- Execute the following statements to create a table with an index:
CREATE TABLE tsts (id int, t tsvector, d timestamp); \copy tsts from 'rum/data/tsts.data' CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d) WITH (attach = 'd', to = 't');
- Run the following query:
EXPLAIN (costs off) SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;
The following result is returned:
QUERY PLAN ----------------------------------------------------------------------------------- Limit -> Index Scan using tsts_idx on tsts Index Cond: (t @@ '''wr'' & ''qh'''::tsquery) Order By: (d <=> 'Mon May 16 14:21:25 2016'::timestamp without time zone) (4 rows)
- Run the following query:
SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;
The following result is returned:
id | d | ? column? -----+---------------------------------+--------------- 355 | Mon May 16 14:21:22.326724 2016 | 2.673276 354 | Mon May 16 13:21:22.326724 2016 | 3602.673276 371 | Tue May 17 06:21:22.326724 2016 | 57597.326724 406 | Wed May 18 17:21:22.326724 2016 | 183597.326724 415 | Thu May 19 02:21:22.326724 2016 | 215997.326724 (5 rows)
Note When the RUM plug-in creates an index on a table whose data is sorted based on referenced additional information, errors may occur. This is because the RUM plug-in uses a suffix tree that has the following limits: Both edges and suffixes must be of a fixed length, and suffixes cannot have child nodes.
- Execute the following statements to create a table with an index:
rum_tsvector_hash_addon_ops
- Supported data types: tsvector.
- Description: This function stores the hash values of the word libraries that are used by the tsvector data type. This function also stores the fields of the RUM plug-in. However, this function does not support prefix-based searches.
rum_tsquery_ops
- Supported data types: tsquery.
- Description: This function stores the branches of query trees.
-
Examples:
- Execute the following statements to create a table with an index:
CREATE TABLE test_array (i int2[]); INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}'); CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);
- Run the following query:
SET enable_seqscan TO off; EXPLAIN (COSTS OFF) SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;
The following result is returned:
QUERY PLAN ------------------------------------------ Index Scan using idx_array on test_array Index Cond: (i && '{1}'::smallint[]) Order By: (i <=> '{1}'::smallint[]) (3 rows)
- Run the following query:
SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;
The following result is returned:
i ----------- {1} {1,2} {1,2,3} {1,2,3,4} (4 rows)
- Execute the following statements to create a table with an index:
rum_anyarray_ops
- Supported data types: anyarray.
- Description: This function stores anyarray elements at lengths that are similar to
the lengths of arrays. This function supports the following operators:
&&
,@>
,<@
,=
, and%
. This function also allows you to sort data by using the<=>
operator. -
Examples:
- Execute the following statements to create a table with an index:
CREATE TABLE test_array (i int2[]); INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}'); CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);
- Run the following query:
SET enable_seqscan TO off; EXPLAIN (COSTS OFF) SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;
The following result is returned:
QUERY PLAN ------------------------------------------ Index Scan using idx_array on test_array Index Cond: (i && '{1}'::smallint[]) Order By: (i <=> '{1}'::smallint[]) (3 rows)
- Run the following query:
SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;
The following result is returned:
i ----------- {1} {1,2} {1,2,3} {1,2,3,4} (4 rows)
- Execute the following statements to create a table with an index:
rum_anyarray_addon_ops
- Supported data types: anyarray.
- Description: This function stores anyarray elements. This function also stores all the elements that are supported for the fields of the RUM plug-in.