Accelerate text similarity and fuzzy searches with pg_trgm - AnalyticDB

The pg_trgm extension adds text similarity search to AnalyticDB for PostgreSQL. It is particularly useful when:

A search term may be misspelled or imprecise
You need to rank results by how closely they match a query string
You want to accelerate LIKE '%...%' fuzzy searches with an index

Version requirements

AnalyticDB for PostgreSQL V6.0 instances of V6.3.8.9 or later support the pg_trgm extension.
AnalyticDB for PostgreSQL V7.0 instances of V7.0.2 or later support the pg_trgm extension.

To check your instance's minor version, see View the minor engine version.

Install the extension

Install pg_trgm on the Extensions page of your instance before use. For instructions, see Install, update, and uninstall extensions.

Key concepts

Trigrams

A trigram is a group of three consecutive characters extracted from a string. pg_trgm measures string similarity by counting the number of trigrams that two strings share.

A trigram with fewer than three characters is padded with up to two leading spaces and one trailing space.

For example:

SELECT show_trgm('abc');

Output:

        show_trgm
-------------------------
 {"  a"," ab","abc","bc "}
(1 row)

Similarity score

The similarity() function returns a value between 0 and 1:

0 — the strings share no trigrams
1 — the strings are identical

Distance

Distance is the complement of similarity: distance = 1 − similarity(). The <-> operator returns the distance between two strings, so a lower value means a closer match.

Perform a text similarity search

After installing pg_trgm, use trigram matching to find strings similar to a query term and rank results from best match to worst.

Create a test table:

CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm values('word'), ('This is a pg_trgm test'), ('word test'), ('w0rd'), ('test word');

Method 1: Filter and rank by similarity

Use the similarity() function and the % operator to find rows whose similarity to the query term exceeds the pg_trgm.similarity_threshold (default: 0.3), then sort by score.

SELECT t, similarity(t, 'word') AS sml
  FROM test_trgm
  WHERE t % 'word'
  ORDER BY sml DESC, t;

Output:

     t     | sml
-----------+-----
 word      |   1
 test word | 0.5
 word test | 0.5
(3 rows)

Method 2: Order by distance

Use the <-> operator to sort all rows by distance from the query term, with the closest match first.

SELECT t, t <-> 'word' AS dist
  FROM test_trgm
  ORDER BY dist LIMIT 10;

Output:

           t            | dist
------------------------+------
 word                   |    0
 word test              |  0.5
 test word              |  0.5
 w0rd                   | 0.75
 This is a pg_trgm test |    1
(5 rows)

Use Method 1 when you want to filter out low-relevance results. Use Method 2 when you want a ranked list of all rows regardless of similarity score.

Accelerate fuzzy searches with an index

The LIKE operator does not use indexes, so fuzzy searches on large tables require a full sequential scan. pg_trgm provides Generalized Search Tree (GiST) and Generalized Inverted Index (GIN) index operators that let you create an index on a text column to speed up LIKE '%...%' queries.

Without an index, a LIKE '%abcd%' query on a table with 1,000,000 rows performs a sequential scan:

CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm
SELECT md5(random()::text) FROM generate_series(1,1000000) i;

EXPLAIN ANALYZE SELECT * FROM test_trgm WHERE t LIKE '%abcd%';

                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..259.63 rows=422 width=32) (actual time=137.606..179.674 rows=431 loops=1)
   ->  Seq Scan on test_trgm  (cost=0.00..254.00 rows=141 width=32) (actual time=0.961..136.977 rows=146 loops=1)
         Filter: (t ~~ '%abcd%'::text)
         Rows Removed by Filter: 333458
 Optimizer: Postgres-based planner
 Planning Time: 0.328 ms
   (slice0)    Executor memory: 37K bytes.
   (slice1)    Executor memory: 36K bytes avg x 3 workers, 36K bytes max (seg0).
 Memory used:  128000kB
 Execution Time: 180.533 ms
(10 rows)

With a GIN index, the same query uses a Bitmap Index Scan and runs significantly faster:

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

EXPLAIN ANALYZE SELECT * FROM test_trgm WHERE t LIKE '%abcd%';

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=48.67..882.67 rows=8000 width=32) (actual time=4.353..4.550 rows=431 loops=1)
   ->  Bitmap Heap Scan on test_trgm  (cost=48.67..776.00 rows=2667 width=32) (actual time=1.612..3.091 rows=146 loops=1)
         Recheck Cond: (t ~~ '%abcd%'::text)
         Rows Removed by Index Recheck: 10
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..48.00 rows=2667 width=0) (actual time=1.536..1.537 rows=163 loops=1)
               Index Cond: (t ~~ '%abcd%'::text)
 Optimizer: Postgres-based planner
 Planning Time: 1.353 ms
   (slice0)    Executor memory: 44K bytes.
   (slice1)    Executor memory: 2438K bytes avg x 3 workers, 2438K bytes max (seg0).
 Memory used:  128000kB
 Execution Time: 5.385 ms
(12 rows)

Execution time drops from 180.533 ms to 5.385 ms — a reduction of more than 33x.

Reference

GUC parameters

Grand Unified Configuration (GUC) parameters control the similarity thresholds used by pg_trgm operators.

`pg_trgm.similarity_threshold`

Controls the minimum similarity score required for the % operator to return true. Default: 0.3. Valid range: 0 to 1.

Raise this value to require a closer match before rows are included in results. Lower it to cast a wider net and tolerate more variation.

`pg_trgm.word_similarity_threshold`

Controls the minimum word similarity score required for the <% and %> operators to return true. Default: 0.6. Valid range: 0 to 1.

Raise this value to require a closer word-level match. Lower it to match more loosely against partial words.

Functions

Function	Return type	Description
`similarity(text, text)`	`real`	Returns a similarity score from 0 (no shared trigrams) to 1 (identical strings).
`show_trgm(text)`	`text[]`	Returns an array of all trigrams in the string. Primarily useful for debugging.
`word_similarity(text, text)`	`real`	Returns the greatest similarity between the trigrams in the first string and any continuous extent of an ordered trigram set in the second string. Returns 0 (no match) to 1 (exact match).

Operators

Operator	Return type	Description
`text % text`	boolean	Returns `true` if the similarity score exceeds `pg_trgm.similarity_threshold`.
`text <% text`	boolean	Returns `true` if the word similarity of the first string against the second exceeds `pg_trgm.word_similarity_threshold`.
`text %> text`	boolean	Commutator of `<%`.
`text <-> text`	`real`	Returns the distance between two strings (1 − `similarity()`).
`text <<-> text`	`real`	Returns the word distance between two strings (1 − `word_similarity()`).
`text <->> text`	`real`	Commutator of `<<->`.

Index operators

Operator	Description
`gist_trgm_ops`	Converts text to a trigram set and stores it in a GiST index.
`gin_trgm_ops`	Converts text to a trigram set and stores it in a GIN index.

What's next

Install, update, and uninstall extensions
View the minor engine version
For the full set of pg_trgm capabilities supported in ApsaraDB RDS for PostgreSQL, see Supported extensions.