Use pg_bigm to accelerate 2-gram full-text search - PolarDB

The pg_bigm extension accelerates full-text search in PolarDB for PostgreSQL (Compatible with Oracle) by building a 2-gram Generalized Inverted Index (GIN). Unlike the 3-gram-based pg_trgm, pg_bigm supports non-alphabet languages such as Chinese and Japanese, and delivers faster searches on short keywords of 1–2 characters.

Prerequisites

Before you begin, ensure that your PolarDB for PostgreSQL (Compatible with Oracle) cluster runs one of the following engine versions:

PolarDB for Oracle 2.0 (revision version 2.0.14.2.0 or later)
PolarDB for Oracle 1.0 (revision version 1.1.28 or later)

To check your current revision version, run:

SHOW polar_version;

Choose between pg_bigm and pg_trgm

Both pg_bigm and pg_trgm support full-text search and similarity search. The following table summarizes the key differences to help you decide which extension fits your use case.

Feature	pg_trgm	pg_bigm
Phrase matching model	3-gram	2-gram
Index types	GIN and GiST	GIN only
Operators	`LIKE`, `ILIKE`, `~`, `~*`	`LIKE`
Non-alphabet full-text search	Not supported	Supported
Full-text search on 1–2 character keywords	Slow	Fast
Similarity search	Supported	Supported
Maximum indexed column size	~228 MB (238,609,291 bytes)	~102 MB (107,374,180 bytes)

When to use pg_bigm: Choose pg_bigm when your data contains non-alphabet characters (such as Chinese or Japanese), or when users frequently search with keywords of 1–2 characters.

When to use pg_trgm: Choose pg_trgm when you need broader operator support (ILIKE, ~, ~*) or larger indexed columns, and your data is primarily alphabetic.

Usage notes

Column size limit: A GIN index column cannot exceed 107,374,180 bytes (~102 MB). Inserting oversized data after index creation raises an error. The following statements demonstrate this behavior:
```
CREATE TABLE t1 (description text);
CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops);
INSERT INTO t1 SELECT repeat('A', 107374181);
```

Encoding: For non-ASCII data, use UTF-8 encoding. To check the current database encoding:

SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();

Get started

Install the extension

CREATE EXTENSION pg_bigm;

Create a GIN index

Specify gin_bigm_ops as the operator class when creating a GIN index on a column.

CREATE TABLE pg_tools (tool text, description text);

INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL');

-- Single-column GIN index
CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops);

-- Multi-column GIN index with FASTUPDATE disabled
CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);

If FASTUPDATE is set to off for a GIN index, the index has no pending list.

Run a full-text search

Use LIKE with % wildcards to search indexed columns:

SELECT * FROM pg_tools WHERE description LIKE '%search%';

Expected output:

  tool   |                             description
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

Run a similarity search

Use the =% operator to find records similar to a keyword:

SELECT tool FROM pg_tools WHERE tool =% 'bigm';

Expected output:

  tool
---------
 pg_bigm
(1 row)

Uninstall the extension

DROP EXTENSION pg_bigm;

Functions

likequery

Generates a LIKE-compatible search string by wrapping the input with % and escaping any literal % characters with \.

Syntax: likequery(keyword text) → text

Example 1: Escape a keyword that contains a percent sign.

SELECT likequery('pg_bigm has improved the full text search performance by 200%');

Output:

                              likequery
-------------------------------------------------------------------
 %pg\_bigm has improved the full text search performance by 200\%%
(1 row)

Example 2: Use likequery directly in a query to avoid manually writing wildcards.

SELECT * FROM pg_tools WHERE description LIKE likequery('search');

Output:

  tool   |                             description
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

show_bigm

Returns all 2-gram elements of a string as an array. The function adds a space before and after the string before extracting 2-grams.

Syntax: show_bigm(string text) → text[]

Example:

SELECT show_bigm('full text search');

Output:

                            show_bigm
------------------------------------------------------------------
 {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
(1 row)

bigm_similarity

Returns the similarity between two strings as a floating-point number between 0 and 1, where 0 means completely different and 1 means identical.

Syntax: bigm_similarity(string1 text, string2 text) → float4

Key behaviors:

Case-sensitive: 'ABC' and 'abc' have a similarity of 0. Note that pg_trgm's similarity function is not case-sensitive, so results differ between the two extensions.
Space padding: A space is added before and after each string before computing 2-grams. This means 'ABC' and 'B' have a similarity of 0, while 'ABC' and 'A' have a similarity of 0.25.

Examples:

SELECT bigm_similarity('full text search', 'text similarity search');

Output: 0.571429

SELECT bigm_similarity('ABC', 'A');

Output: 0.25

SELECT bigm_similarity('ABC', 'B');

Output: 0

SELECT bigm_similarity('ABC', 'abc');

Output: 0

pg_gin_pending_stats

Returns the number of pages and tuples in the pending list of a GIN index.

Syntax: pg_gin_pending_stats(index regclass) → (pages int, tuples int)

If FASTUPDATE is set to off for a GIN index, the index has no pending list and this function returns (0, 0).

Example:

SELECT * FROM pg_gin_pending_stats('pg_tools_idx');

Output:

 pages | tuples
-------+--------
     0 |      0
(1 row)

Parameters

pg_bigm.enable_recheck

Controls whether a recheck step filters index scan results against the actual query condition. Keep the default value on to get accurate results.

Default: on

When enable_recheck is on, the GIN index scan may return candidate rows that do not actually match the query. The recheck step eliminates these false positives. Setting it to off skips the recheck, which can return incorrect rows.

The following example demonstrates the difference.

Step 1: Create a table and index with test data.

CREATE TABLE tbl (doc text);

INSERT INTO tbl VALUES('He is awaiting trial');
INSERT INTO tbl VALUES('It was a trivial mistake');

CREATE INDEX tbl_idx ON tbl USING gin (doc gin_bigm_ops);

Step 2: Run a query with recheck enabled (default).

SET enable_seqscan TO off;

EXPLAIN ANALYZE SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Output:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=20.00..24.01 rows=1 width=32) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (doc ~~ '%trial%'::text)
   Rows Removed by Index Recheck: 1
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on tbl_idx  (cost=0.00..20.00 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=1)
         Index Cond: (doc ~~ '%trial%'::text)
 Planning Time: 0.117 ms
 Execution Time: 0.043 ms
(8 rows)

The Rows Removed by Index Recheck: 1 line shows that the index returned 2 candidate rows but the recheck discarded 1 false positive — the GIN index matched both rows because they share the 2-gram tr and ia, but only one row actually contains the substring trial. The recheck filters out this false positive, leaving the correct result:

SELECT * FROM tbl WHERE doc LIKE likequery('trial');

         doc
----------------------
 He is awaiting trial
(1 row)

Step 3: Run the same query with recheck disabled.

SET pg_bigm.enable_recheck = off;

SELECT * FROM tbl WHERE doc LIKE likequery('trial');

           doc
--------------------------
 He is awaiting trial
 It was a trivial mistake
(2 rows)

With recheck disabled, both rows are returned even though 'It was a trivial mistake' does not match 'trial'. Disable recheck only if you understand and accept the risk of false positives.

pg_bigm.gin_key_limit

Sets the maximum number of 2-gram elements used when searching. The default value 0 means all 2-gram elements are used.

Default: 0 (use all 2-gram elements)

If a query with many 2-gram elements causes performance issues, lower this value to limit the number of elements used in the index lookup.

pg_bigm.similarity_limit

Sets the similarity threshold for the =% operator. Only rows with a similarity score above this threshold are returned as similarity search results.