This topic describes how to use the RUM plug-in of ApsaraDB RDS for PostgreSQL to run full-text searches.

Prerequisites

Your RDS instance runs one of the following PostgreSQL versions:
  • PostgreSQL 13
  • PostgreSQL 12
  • PostgreSQL 11
  • PostgreSQL 10

Background information

Generalized Inverted Index (GIN) allows you to run full-text searches by using the tsvector and tsquery data types. However, this may produce the following issues:

  • Slow sorting

    ApsaraDB RDS for PostgreSQL can sort words only after it obtains the locations of the words. However, GIN does not store word locations. As a result, after ApsaraDB RDS for PostgreSQL runs a scan based on a GIN index, it must run another scan to retrieve the word locations.

  • Slow queries for phrases

    GIN can search for phrases only after it obtains the locations of the phrases.

  • Slow sorting of timestamps

    GIN does not store related information in indexes that contain morphemes. Therefore, an additional scan is required.

The RUM plug-in of ApsaraDB RDS for PostgreSQL is designed based on GIN. It allows you to store word or timestamp locations in RUM indexes.

However, the RUM plug-in requires more time than GIN to construct and insert indexes. This is because the RUM plug-in generates indexes based on write-ahead logging (WAL) logs and the generated RUM indexes contain more information than the keys that are used for encryption.

Universal operators

The RUM plug-in provides the following operators.

Operator Data type Description
tsvector <=> tsquery float4 Returns the distances between the data objects of the tsvector type and those of the tsquery type.
timestamp <=> timestamp float8 Returns the distance between two timestamps.
timestamp <=| timestamp float8 Returns only the distance to the left-side timestamp.
timestamp |=> timestamp float8 Returns only the distance to the right-side timestamp.
Note The last three operators are also supported for the following data types: timestamptz, int2, int4, int8, float4, float8, money, and oid.

The following sections describe the functions that are provided by the RUM plug-in.

rum_tsvector_ops

  • Supported data types: tsvector.
  • Description: This function stores phrases of the tsvector data type along with the locations of the phrases. This function allows you to sort phrases by using the <=> operator. This function also allows you to search for phrases based on prefixes.
  • Examples:

    1. Execute the following statements to create a table:
      CREATE TABLE test_rum(t text, a tsvector);
      CREATE TRIGGER tsvectorupdate
      BEFORE UPDATE OR INSERT ON test_rum
      FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('a', 'pg_catalog.english', 't');
      INSERT INTO test_rum(t) VALUES ('The situation is most beautiful');
      INSERT INTO test_rum(t) VALUES ('It is a beautiful');
      INSERT INTO test_rum(t) VALUES ('It looks like a beautiful place');
    2. Execute the following statement to create the RUM plug-in:
      CREATE EXTENSION rum;
    3. Execute the following statement to create an index:
      CREATE INDEX rumidx ON test_rum USING rum (a rum_tsvector_ops);
    4. Run the following two types of queries:
      • SELECT t, a <=> to_tsquery('english', 'beautiful | place') AS rank
            FROM test_rum
            WHERE a @@ to_tsquery('english', 'beautiful | place')
            ORDER BY a <=> to_tsquery('english', 'beautiful | place');

        The following result is returned:

                        t                |  rank
        ---------------------------------+---------
         It looks like a beautiful place | 8.22467
         The situation is most beautiful | 16.4493
         It is a beautiful               | 16.4493
        (3 rows)
      • SELECT t, a <=> to_tsquery('english', 'place | situation') AS rank
            FROM test_rum
            WHERE a @@ to_tsquery('english', 'place | situation')
            ORDER BY a <=> to_tsquery('english', 'place | situation');

        The following result is returned:

                        t                |  rank
        ---------------------------------+---------
         The situation is most beautiful | 16.4493
         It looks like a beautiful place | 16.4493
        (2 rows)

rum_tsvector_hash_ops

  • Supported data types: tsvector.
  • Description: This function stores phrases of the tsvector data type along with the hash values and locations of the phrases. This function allows you to sort phrases by using the <=> operator. However, this function does not allow you to search for phrases based on prefixes.

rum_TYPE_ops

  • Supported data types:
    • The <, <=, =, >=, >, and <=> operators support the following data types: int2, int4, int8, float4, float8, money, oid, time, timetz, date, interval, macaddr, inet, cidr, text, varchar, char, bytea, bit, varbit, numeric, timestamp, and timestamptz.
    • The <=| and |=> operators support the following data types: int2, int4, int8, float4, float8, money, oid, timestamp, and timestamptz.
  • Description: This function can be used with the rum_tsvector_addon_ops, rum_tsvector_hash_addon_ops, and rum_anyarray_addon_ops functions.

rum_tsvector_addon_ops

  • Supported data types: tsvector.
  • Description: This function stores the word segmentation method that is used by the tsvector data type. This method also stores all the other word segmentation methods that are supported by the fields of the RUM plug-in.
  • Examples:

    1. Execute the following statements to create a table with an index:
      CREATE TABLE tsts (id int, t tsvector, d timestamp);
      \copy tsts from 'rum/data/tsts.data'
      CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d)
          WITH (attach = 'd', to = 't');
    2. Run the following query:
      EXPLAIN (costs off)
          SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;

      The following result is returned:

                                          QUERY PLAN
      -----------------------------------------------------------------------------------
       Limit
         ->  Index Scan using tsts_idx on tsts
               Index Cond: (t @@ '''wr'' & ''qh'''::tsquery)
               Order By: (d <=> 'Mon May 16 14:21:25 2016'::timestamp without time zone)
      (4 rows)
    3. Run the following query:
      SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;

      The following result is returned:

       id  |                d                |   ? column?
      -----+---------------------------------+---------------
       355 | Mon May 16 14:21:22.326724 2016 |      2.673276
       354 | Mon May 16 13:21:22.326724 2016 |   3602.673276
       371 | Tue May 17 06:21:22.326724 2016 |  57597.326724
       406 | Wed May 18 17:21:22.326724 2016 | 183597.326724
       415 | Thu May 19 02:21:22.326724 2016 | 215997.326724
      (5 rows)
      Note When the RUM plug-in creates an index on a table whose data is sorted based on referenced additional information, errors may occur. This is because the RUM plug-in uses a suffix tree that has the following limits: Both edges and suffixes must be of a fixed length, and suffixes cannot have child nodes.

rum_tsvector_hash_addon_ops

  • Supported data types: tsvector.
  • Description: This function stores the hash values of the word libraries that are used by the tsvector data type. This function also stores the fields of the RUM plug-in. However, this function does not support prefix-based searches.

rum_tsquery_ops

  • Supported data types: tsquery.
  • Description: This function stores the branches of query trees.
  • Examples:

    1. Execute the following statements to create a table with an index:
      CREATE TABLE test_array (i int2[]);
      INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');
      CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);
    2. Run the following query:
      SET enable_seqscan TO off;
      
      EXPLAIN (COSTS OFF) SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;

      The following result is returned:

                      QUERY PLAN
      ------------------------------------------
       Index Scan using idx_array on test_array
         Index Cond: (i && '{1}'::smallint[])
         Order By: (i <=> '{1}'::smallint[])
      (3 rows)
    3. Run the following query:
      SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;

      The following result is returned:

           i
      -----------
       {1}
       {1,2}
       {1,2,3}
       {1,2,3,4}
      (4 rows)

rum_anyarray_ops

  • Supported data types: anyarray.
  • Description: This function stores anyarray elements at lengths that are similar to the lengths of arrays. This function supports the following operators: &&, @>, <@, =, and . This function also allows you to sort data by using the <=> operator.
  • Examples:

    1. Execute the following statements to create a table with an index:
      CREATE TABLE test_array (i int2[]);
      INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');
      CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);
    2. Run the following query:
      SET enable_seqscan TO off;
      
      EXPLAIN (COSTS OFF) SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;

      The following result is returned:

                      QUERY PLAN
      ------------------------------------------
       Index Scan using idx_array on test_array
         Index Cond: (i && '{1}'::smallint[])
         Order By: (i <=> '{1}'::smallint[])
      (3 rows)
    3. Run the following query:
      SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;

      The following result is returned:

           i
      -----------
       {1}
       {1,2}
       {1,2,3}
       {1,2,3,4}
      (4 rows)

rum_anyarray_addon_ops

  • Supported data types: anyarray.
  • Description: This function stores anyarray elements. This function also stores all the elements that are supported for the fields of the RUM plug-in.