2-gram full text search-pg_bigm - PolarDB

The pg_bigm extension creates a 2-gram Generalized Inverted Index (GIN) index to accelerate full-text search in PolarDB for PostgreSQL (Compatible with Oracle). It supports non-alphabet languages and delivers faster results for short keywords (1–2 characters) than the built-in pg_trgm extension.

Prerequisites

Before you begin, ensure that you have:

A PolarDB for PostgreSQL (Compatible with Oracle) cluster running one of the following revision versions:
- Version 2.0: revision version 2.0.14.2.0 or later
- Version 1.0: revision version 1.1.28 or later

To check the revision version of your cluster, run:

SHOW polar_version;

pg_bigm vs pg_trgm

pg_bigm is developed based on pg_trgm. The table below summarizes the key differences to help you choose.

Functionality	pg_trgm	pg_bigm
Phrase matching model	3-gram	2-gram
Index types	GIN and GiST (Generalized Search Tree)	GIN only
Operators	`LIKE`, `ILIKE`, `~`, `~*`	`LIKE` only
Non-alphabet full-text search	Not supported	Supported
Full-text search with 1–2 character keywords	Slow (1)	Fast
Similarity search	Supported	Supported
Maximum indexed column size	238,609,291 bytes (~228 MB)	107,374,180 bytes (~102 MB)

(1) pg_trgm falls back to a full index scan rather than a normal index scan for keywords of 1–2 characters, which is significantly slower.

Choose pg_bigm when you need to search non-alphabet text (such as CJK characters) or frequently search with short keywords of 1–2 characters.

Usage notes

The GIN index column size limit is 107,374,180 bytes (~102 MB). Creating an index on a larger column fails. For example:

CREATE TABLE t1 (description text);
CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops);
INSERT INTO t1 SELECT repeat('A', 107374181);  -- fails: exceeds the size limit

If your data is not ASCII-encoded, use UTF-8 encoding. To check the current encoding:

SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();

Basic operations

Enable the extension

CREATE EXTENSION pg_bigm;

Create a GIN index

When creating a GIN index with pg_bigm, specify the gin_bigm_ops operator class.

CREATE TABLE pg_tools (tool text, description text);

INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL');

-- Single-column index
CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops);

-- Multi-column index with FASTUPDATE disabled
CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);

Run a full-text search

Use the LIKE operator to search indexed columns.

SELECT * FROM pg_tools WHERE description LIKE '%search%';

Result:

   tool   |                             description
----------+---------------------------------------------------------------------
 pg_bigm  | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm  | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

Run a similarity search

Use the =% operator to find strings similar to a keyword.

SELECT tool FROM pg_tools WHERE tool =% 'bigm';

Result:

   tool
---------
 pg_bigm
(1 row)

Disable the extension

DROP EXTENSION pg_bigm;

Built-in functions

likequery

Generates a LIKE-compatible search string by wrapping the keyword in % and escaping any % characters in the input.

Syntax: likequery(keyword text) → text

Example 1: Escape a literal % in the input.

SELECT likequery('pg_bigm has improved the full text search performance by 200%');

Result:

                            likequery
-------------------------------------------------------------------
 %pg\_bigm has improved the full text search performance by 200\%%
(1 row)

Example 2: Use likequery in a WHERE clause to avoid manually writing % patterns.

SELECT * FROM pg_tools WHERE description LIKE likequery('search');

Result:

   tool   |                             description
----------+---------------------------------------------------------------------
 pg_bigm  | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm  | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

show_bigm

Returns an array of all 2-gram elements extracted from a string. The function pads the input with a leading and trailing space before extraction.

Syntax: show_bigm(string text) → text[]

Example:

SELECT show_bigm('full text search');

Result:

                          show_bigm
------------------------------------------------------------------
 {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
(1 row)

bigm_similarity

Returns a floating-point similarity score between two strings, based on their shared 2-gram elements. The score ranges from 0 (completely different) to 1 (identical).

Syntax: bigm_similarity(string1 text, string2 text) → float4

The function pads each string with a leading and trailing space before comparison, and the comparison is case-sensitive.

Examples:

SELECT bigm_similarity('full text search', 'text similarity search');

Result:

 bigm_similarity
-----------------
        0.571429
(1 row)

-- Space padding means 'A' shares the " A" bigram with 'ABC' but not the "AB" or "BC" bigrams
SELECT bigm_similarity('ABC', 'A');

Result:

 bigm_similarity
-----------------
            0.25
(1 row)

-- 'B' has no overlapping bigrams with 'ABC' after space padding
SELECT bigm_similarity('ABC', 'B');

Result:

 bigm_similarity
-----------------
               0
(1 row)

-- Case-sensitive: 'abc' and 'ABC' share no bigrams
SELECT bigm_similarity('ABC', 'abc');

Result:

 bigm_similarity
-----------------
               0
(1 row)

pg_gin_pending_stats

Returns the number of pages and tuples in the pending list of a GIN index.

Syntax: pg_gin_pending_stats(index regclass) → (pages int, tuples int)

If FASTUPDATE is set to false for the index, the GIN index has no pending list, and this function returns 0, 0.

Example:

SELECT * FROM pg_gin_pending_stats('pg_tools_idx');

Result:

 pages | tuples
-------+--------
     0 |      0
(1 row)

Configuration parameters

pg_bigm.enable_recheck

Controls whether to apply a recheck step after retrieving candidates from the GIN index.

Default: on

How full-text search works internally: pg_bigm splits the search keyword into 2-gram elements and uses the GIN index to retrieve candidate rows. Because the index lookup is approximate, the candidate set may include false positives — rows that contain the matching 2-grams but not the actual keyword. The recheck step filters out these false positives by evaluating each candidate against the original search condition.

For example, when searching for trial, both "He is awaiting trial" and "It was a trivial mistake" are retrieved as candidates because trivial contains all the 2-grams of trial (tr, ri, ia, al). With recheck enabled, only the correct result is returned.

Example — recheck enabled (default):

CREATE TABLE tbl (doc text);

INSERT INTO tbl VALUES('He is awaiting trial');
INSERT INTO tbl VALUES('It was a trivial mistake');

CREATE INDEX tbl_idx ON tbl USING gin (doc gin_bigm_ops);

SET enable_seqscan TO off;

EXPLAIN ANALYZE SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Result:

                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=20.00..24.01 rows=1 width=32) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (doc ~~ '%trial%'::text)
   Rows Removed by Index Recheck: 1
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on tbl_idx  (cost=0.00..20.00 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=1)
         Index Cond: (doc ~~ '%trial%'::text)
 Planning Time: 0.117 ms
 Execution Time: 0.043 ms
(8 rows)

The index returns 2 candidates; the recheck removes the false positive, leaving 1 correct result:

SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Result:

         doc
----------------------
 He is awaiting trial
(1 row)

Example — recheck disabled:

With pg_bigm.enable_recheck = off, the recheck step is skipped and false positives are included in the results.

SET pg_bigm.enable_recheck = off;

SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Result:

           doc
--------------------------
 He is awaiting trial
 It was a trivial mistake
(2 rows)

Keep pg_bigm.enable_recheck at its default value (on) to get accurate results.

pg_bigm.gin_key_limit

Sets the maximum number of 2-gram elements used during a GIN index scan.

Default: 0 (all 2-gram elements are used)

Using all 2-grams produces the most accurate candidate set, but long keywords generate many 2-grams and can increase the cost of the GIN index scan. Reduce this value to limit the number of 2-grams used and speed up the scan.

However, using fewer 2-grams increases the number of false positives in the candidate set, which in turn increases the workload on the recheck step. Tune this parameter only when GIN index scan performance is a bottleneck, and test the impact on result accuracy.

pg_bigm.similarity_limit

Sets the similarity threshold for similarity search. Only rows whose similarity score meets or exceeds this threshold are returned by the =% operator.