All Products
Search
Document Center

AnalyticDB for PostgreSQL:pg_trgm

Last Updated:Jan 02, 2024

The pg_trgm extension provides functions and operators that can be used to determine the similarity of text and index operators that support fast search for similar strings. You can use database search tools or indexes to accelerate fuzzy searches for text.

Overview

The pg_trgm extension introduces the trigram concept. A trigram consists of three consecutive characters that are taken from a string. pg_trgm extracts trigrams that consist of three characters. A trigram that has less than three characters contains a prefix of up to two spaces or a suffix of only one space. Example:

postgres=# SELECT show_trgm('abc');
        show_trgm
-------------------------
 {"  a"," ab","abc","bc "}
(1 row)

You can measure the similarity between two strings by counting the number of trigrams that the strings share. This idea is very effective for measuring the similarity of words in many natural languages. This topic describes how to use the pg_trgm extension.

Usage notes

  • You cannot create the pg_trgm extension. To install or upgrade the pg_trgm extension, contact technical support.

  • AnalyticDB for PostgreSQL V6.0 instances of V6.3.8.9 or later support the pg_trgm extension.

  • AnalyticDB for PostgreSQL V7.0 instances of V7.0.2 or later support the pg_trgm extension.

Note

For information about how to view the minor version of an instance, see View the minor engine version.

Use trigram matching to perform a text similarity search

After the pg_trgm extension is created, you can use trigram matching to measure the similarity of text and return search results in an order from the best match to the worst match. For example, create a test table and insert data into the table.

CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm values('word'), ('This is a pg_trgm test'), ('word test'), ('w0rd'), ('test word');

You can use one of the following methods to perform a text similarity search based on trigram matching:

  • Method 1: Use the similarity() function and the % operator in the pg_trgm extension. For example, query the similarity between the values of column t and word. For more information about the similarity() function and the % operator, see the "Appendixes" section of this topic.

    SELECT t, similarity(t, 'word') AS sml
      FROM test_trgm
      WHERE t % 'word'
      ORDER BY sml DESC, t;

    The column values that contain word are displayed in descending order of similarity.

         t     | sml
    -----------+-----
     word      |   1
     test word | 0.5
     word test | 0.5
    (3 rows)
  • Method 2: Use the <-> operator in the pg_trgm extension. For example, query the distance between the values of column t and word. The distance is the opposite of similarity. The column values are displayed in order of distance from the nearest to the farthest. For more information about the <-> operator, see the "Appendixes" section of this topic.

    postgres=# SELECT t, t <-> 'word' AS dist
      FROM test_trgm
      ORDER BY dist LIMIT 10;
               t            | dist
    ------------------------+------
     word                   |    0
     word test              |  0.5
     test word              |  0.5
     w0rd                   | 0.75
     This is a pg_trgm test |    1
    (5 rows)

Use indexes to accelerate fuzzy searches

Before the pg_trgm extension became available, the LIKE operator can be used to perform fuzzy searches. However, the LIKE operator does not support indexes and provides poor query performance. The pg_trgm extension provides GiST and GIN index operators that allow you to create an index for a text column to accelerate similarity searches. The following example describes how to use the pg_trgm extension and the GIN index to accelerate fuzzy searches:

Create a test table and insert data into the table.

CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm 
SELECT md5(random()::text) FROM generate_series(1,1000000) i; 

Use the EXPLAIN ANALYZE statement to perform a fuzzy search on the table. The execution plan shows that sequential scans are performed on the entire table. This results in poor query performance in scenarios that involve large amounts of data.

postgres=# explain analyze  SELECT * FROM test_trgm WHERE t LIKE '%abcd%';
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..259.63 rows=422 width=32) (actual time=137.606..179.674 rows=431 loops=1)
   ->  Seq Scan on test_trgm  (cost=0.00..254.00 rows=141 width=32) (actual time=0.961..136.977 rows=146 loops=1)
         Filter: (t ~~ '%abcd%'::text)
         Rows Removed by Filter: 333458
 Optimizer: Postgres-based planner
 Planning Time: 0.328 ms
   (slice0)    Executor memory: 37K bytes.
   (slice1)    Executor memory: 36K bytes avg x 3 workers, 36K bytes max (seg0).
 Memory used:  128000kB
 Execution Time: 180.533 ms
(10 rows)

Use the gin_trgm_ops index operator in the pg_trgm extension to create a GIN index. For more information about the gin_trgm_ops index operator, see the "Appendixes" section of this topic.

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

Use the EXPLAIN ANALYZE statement to perform a fuzzy search on the table. The execution plan shows that bitmap index scans are performed. The query performance is significantly improved.

postgres=# explain analyze  SELECT * FROM test_trgm WHERE t LIKE '%abcd%';
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=48.67..882.67 rows=8000 width=32) (actual time=4.353..4.550 rows=431 loops=1)
   ->  Bitmap Heap Scan on test_trgm  (cost=48.67..776.00 rows=2667 width=32) (actual time=1.612..3.091 rows=146 loops=1)
         Recheck Cond: (t ~~ '%abcd%'::text)
         Rows Removed by Index Recheck: 10
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..48.00 rows=2667 width=0) (actual time=1.536..1.537 rows=163 loops=1)
               Index Cond: (t ~~ '%abcd%'::text)
 Optimizer: Postgres-based planner
 Planning Time: 1.353 ms
   (slice0)    Executor memory: 44K bytes.
   (slice1)    Executor memory: 2438K bytes avg x 3 workers, 2438K bytes max (seg0).
 Memory used:  128000kB
 Execution Time: 5.385 ms
(12 rows)

Appendixes

GUC parameters

pg_trgm.similarity_threshold

Specifies the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1. Default value: 0.3.

pg_trgm.word_similarity_threshold

Specifies the current word similarity threshold that is used by the <% and %> operators. The threshold must be between 0 and 1. Default value: 0.6.

Functions

Function

Type of the return value

Description

similarity(text, text)

real

Returns a number that indicates the similarity between the two strings. The number ranges from 0 to 1.

  • 0 indicates that the two strings are completely dissimilar.

  • 1 indicates that the two strings are identical.

show_trgm(text)

text[]

Returns an array of all trigrams in the string. In actual scenarios, this function is seldom useful except for debugging.

word_similarity(text, text)

real

Returns a number that indicates the greatest similarity between the trigram set in the first string and a continuous extent of an ordered trigram set in the second string. The number ranges from 0 to 1.

  • 0 indicates that the two trigram sets are completely dissimilar.

  • 1 indicates that the trigram set in the first string and a continuous extent of an ordered trigram set in the second string are identical.

Operators

Operator

Type of the return value

Description

text % text

boolean

Returns true if the result of the function is greater than the threshold that is specified by the pg_trgm.similarity_threshold parameter.

text <% text

boolean

Returns true if the similarity between the trigram set in the first string and a continuous extent of an ordered trigram set in the second string is greater than the threshold that is specified by the pg_trgm.word_similarity_threshold parameter.

text %> text

boolean

The commutator of the <% operator.

text <-> text

real

Returns the distance between two strings. The distance is one minus the similarity() function value.

text <<-> text

real

Returns the distance between two strings. The distance is one minus the word_similarity() function value.

text <->> text

real

The commutator of the <<-> operator.

Index operators

Operator

Description

gist_trgm_ops

Converts text data to a trigram set and uses a GiST index to store the trigram set.

gin_trgm_ops

Converts text data to a trigram set and uses a GIN index to store the trigram set.