Use the pg_bigm extension to perform full-text search - AnalyticDB

pg_bigm is an open source third-party extension that is supported by AnalyticDB for PostgreSQL. The pg_bigm extension provides full-text search capabilities and allows you to create a 2-gram Generalized Inverted Index (GIN) index to accelerate the search process.

The pg_bigm extension extracts trigrams that consist of two characters from text. A trigram that has fewer than two characters contains a prefix or a suffix of one space. Example:

postgres=> SELECT show_bigm('full text search');
show_bigm                            
------------------------------------------------------------------
 {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
(1 row)

Usage notes

Only the following minor versions of AnalyticDB for PostgreSQL instances in elastic storage mode support the pg_bigm extension:
- AnalyticDB for PostgreSQL V6.0: V6.6.2.1 or later.
- AnalyticDB for PostgreSQL V7.0: V7.0.6.1 or later.
You can create a GIN index on a column whose length is not larger than 107,374,180 bytes (approximately 102 MB). Example:
```
postgres=> CREATE TABLE t1 (description text);
CREATE TABLE

postgres=> CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops);
CREATE INDEX

postgres=> INSERT INTO t1 SELECT repeat('A', 107374181);
ERROR:OUT OF memory
```
If an AnalyticDB for PostgreSQL database does not store data in the ASCII format, we recommend that you change the encoding format of the database to UTF-8. You can execute the following statement to query the encoding format of the current AnalyticDB for PostgreSQL database:
```
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();
```

Install the extension

Install the pg_bigm extension on the Extensions page of an AnalyticDB for PostgreSQL instance. For more information, see Install, update, and uninstall extensions.

Use the extension

Create an index

The following sample code provides an example on how to create an index:

CREATE TABLE pg_tools (tool text, description text);

INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL');

CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops);
CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);

Perform full-text search

The following sample code provides an example on how to use the pg_bigm extension to perform full-text search:

postgres=> SELECT * FROM pg_tools WHERE description LIKE '%search%';
 tool   |  description
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram FULL text SEARCH capability IN PostgreSQL
 pg_trgm | Tool that provides 3-gram FULL text SEARCH capability IN PostgreSQL
(2 ROWS)

Perform similarity search

The following sample code provides an example on how to use the =% operator to perform similarity search:

postgres=> SET pg_bigm.similarity_limit TO 0.2;
SET
postgres=> SELECT tool FROM pg_tools WHERE tool =% 'bigm';
tool
---------
 pg_bigm
 pg_trgm
(2 ROWS)

Common functions provided by the extension

likequery()

You can use the slikequery() function of the pg_bigm extension and the % operator to return a string that can be identified by the LIKE keyword. For more information about the likequery() function, see the "Appendixes" section of this topic. Example:

postgres=> SELECT likequery('pg_bigm has improved the full text search performance by 200%');
likequery
-------------------------------------------------------------------
 %pg\_bigm has improved the FULL text SEARCH performance BY 200\%%
(1 row)

postgres=> SELECT * FROM pg_tools WHERE description LIKE likequery('search');
 tool   | description
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram FULL text SEARCH capability IN PostgreSQL
 pg_trgm | Tool that provides 3-gram FULL text SEARCH capability IN PostgreSQL
(2 ROWS)

show_bigm()

You can use the show_bigm() function of the pg_bigm extension to return an array of all 2-gram elements in a string. For more information about the show_bigm() function, see the "Appendixes" section of this topic. Example:

postgres=> SELECT show_bigm('full text search');
show_bigm
------------------------------------------------------------------
 {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
(1 ROW)

bigm_similarity()

You can use the bigm_similarity() function of the pg_bigm extension to return the greatest similarity between two strings.

Important

When you use the 2-gram algorithm, a string may contain a prefix or a suffix of one space. Take note of the following items:
- The similarity between 'ABC' and 'B' is 0.
- The similarity between 'ABC' and 'A' is 0.25.
The bigm_similarity() function is case-sensitive. For example, the similarity between 'ABC' and 'abc' is 0.

Example:

postgres=> SELECT bigm_similarity('full text search', 'text similarity search');
 bigm_similarity
-----------------
 0.5714286
(1 ROW)

postgres=> SELECT bigm_similarity('ABC', 'A');
 bigm_similarity
-----------------
 0.25
(1 ROW)

postgres=> SELECT bigm_similarity('ABC', 'B');
 bigm_similarity
-----------------
 0
(1 ROW)

postgres=> SELECT bigm_similarity('ABC', 'abc');
 bigm_similarity
-----------------
 0
(1 ROW)

pg_gin_pending_stats()

You can use the pg_gin_pending_stats() function of the pg_bigm extension to return the numbers of pages and tuples in the pending list of a GIN index. Example:

postgres=> SELECT * FROM pg_gin_pending_stats('pg_tools_idx');
 pages | tuples
-------+--------
     0 |      0
(1 ROW)

When you create a GIN index, you can set the fastupdate parameter to off. In this case, the index does not have a pending list and 0 is returned. Example:

CREATE INDEX ON your_table_name (your_column_name gin_trgm_ops) WITH (fastupdate = off);

Appendixes

Grand Unified Configuration (GUC) parameters

pg_bigm.last_update: the date on which the pg_bigm extension was last updated. This is a read-only parameter.

pg_bigm.enable_recheck: specifies whether to perform a recheck. Default value: on. You can specify the parameter based on your business requirements. We recommend that you retain the default value to ensure correct results.

pg_bigm.gin_key_limit: the maximum number of 2-gram elements that can be used for full-text search. You can specify the parameter based on your business requirements. Default value: 0, which specifies that all 2-gram elements are used. If the performance degrades when all 2-gram elements are used, you can set the parameter to a small positive value to improve performance.

pg_bigm.similarity_limit: the minimum threshold that is used for similarity search. The tuples whose similarity exceeds the specified threshold are returned as similarity search results.

Functions

Function	Type of the return value	Description
`likequery`(text)	string	Returns a string that can be identified by the `LIKE` keyword. The result of this function ranges from 0 to 1. A value of 0 indicates that the two strings are completely dissimilar. A value of 1 indicates that the two strings are identical.
`show_bigm`(text)	text[ ]	Returns an array of all 2-gram elements in a string.
`bigm_similarity`(text, text)	real	Returns a floating-point number that indicates the greatest similarity between 2-gram elements in two strings. The number ranges from 0 to 1. A value of 0 indicates that 2-gram elements in the two strings are completely dissimilar. A value of 1 indicates that the 2-gram element set in the first string and a continuous extent of an ordered 2-gram element set in the second string are identical.

Operators

Operator	Type of the return value	Description
text % text	boolean	Returns true if the first string is similar to the second string. This operator is similar to the `LIKE` and `ILIKE` operators in SQL statements and allows you to use 2-gram indexes to accelerate search.
text =% text	boolean	Returns true if the similarity between a trigram in the first string and a trigram in the second string is equal to the threshold that is specified by the `pg_bigm.similarity_limit` parameter.

Index operators

Operator	Description
gin_bigm_ops	Converts text data into a trigram set and uses a GIN index to store the trigram set.