All Products
Search
Document Center

AnalyticDB:BM25 high-performance full-text search

Last Updated:Feb 19, 2025

This topic provides an overview of the pgsearch extension and describes how to install, uninstall, and use the extension to perform full-text search.

Overview

The pgsearch extension helps you create and manage Best Matching 25 (BM25) indexes that are built based on Tantivy, a high-performance full-text search engine. The BM25 algorithm is preferentially used in search engines, such as Elasticsearch, to score and rank rows based on the number of occurrences of a term in a row and the rarity of a term across rows. The BM25 algorithm is suitable for scenarios in which you want to search for keywords or phrases from a large number of rows. You can use BM25 indexes to perform full-text search in AnalyticDB for PostgreSQL tables and use the BM25 algorithm to match accurate search results based on relevance scores.

Installation and uninstallation

The pgsearch extension does not support GUI-based installation. To install the pgsearch extension, submit a ticket. After the extension is installed, you must restart your AnalyticDB for PostgreSQL instance. To uninstall the pgsearch extension, submit a ticket.

Test table

In this example, a test table that has dozens of rows is used. You can call the pgsearch.create_test_table() function to create the test table.

CALL pgsearch.create_test_table(table_name => 'mock_items', schema_name => 'public');

Table schema

CREATE TABLE mock_items (
 description TEXT,
 rating INTEGER CHECK (
 rating BETWEEN 1
 AND 5
 ),
 category VARCHAR(255),
 in_stock BOOLEAN,
 metadata JSONB,
 created_at TIMESTAMP,
 last_updated_date DATE,
 latest_available_time TIME
);

Index management

Important

The creation and deletion of indexes cannot be rolled back.

Create an index

You can call the pgsearch.create_bm25() function to create a BM25 index. This function allows you to create indexes on multiple fields and specify an index configuration for each field. If you create multiple indexes, excessive resources are consumed, which affects user experience. We recommend that you create only one BM25 index for each table.

Syntax

CALL pgsearch.create_bm25(
  index_name => '<index_name>',
  table_name => '<table_name>',
  schema_name => '<schem_name>'
  text_fields => '<text_fields>|pgsearch.field()',
  numeric_fields => '<numeric_fields>|pgsearch.field()',
  boolean_fields => '<boolean_fields>|pgsearch.field()',
  json_fields => '<json_fields>|pgsearch.field()',
  datetime_fields => '<datetime_fields>|pgsearch.field()'
)

Parameters

Parameter

Required

Supported data type

Description

index_name

Yes

STRING

The name of the index.

table_name

Yes

STRING

The name of the table for which you want to create the index.

schema_name

No

STRING

The schema of the table. By default, the current schema is used.

text_fields

Yes (at least one)

  • VARCHAR

  • VARCHAR[]

  • TEXT

  • TEXT[]

The text fields to be indexed and their corresponding index configurations. Specify an index configuration for each field by using a JSON5 string or invoking the pgsearch.field() function.

  • JSON5 string

    A JSON5 string consists of key-value pairs. Set the key to the name of a field in the table and the value to the index configuration. Supported index configuration items:

    • fast: specifies whether the field can be quickly accessed in a random manner. The configuration item can be used to accelerate scoring and filtering. Default value: false.

    • fieldnorms: specifies whether the length of the text field is stored. Default value: true. The BM25 score can be computed only when you set the fieldnorms configuration item to true.

    • tokenizer: the tokenizer and its corresponding configuration, which is in the JSON5 format. For more information, see the "Tokenizers" section of this topic.

    • record: the information about the index. Default value: position. For more information, see the "Records" section of this topic.

  • pgsearch.field() function.

    You can invoke the pgsearch.field() function to generate the index configuration of a specific field. To specify index configurations for multiple fields, invoke multiple pgsearch.field() functions and concatenate the functions by using double vertical bars (||).

    Syntax

    pgsearch.field(<name>, <fast>, <fieldnorms>, <record>, <expand_dots>, <tokenizer>)

    Parameters

    • name: the name of the column to be indexed, which is of the text type.

    • For more information about the fast, fieldnorms, record, expand_dots and tokenizer parameters, see the preceding information in this table.

Important

Only AnalyticDB for PostgreSQL instances of V7.2.1.0 or later support the pgsearch.field() function. To create an index for AnalyticDB for PostgreSQL instances of V7.2.1.0 or later, you must invoke the pgsearch.field() function. Otherwise, the custom word segmentation dictionary and the stop word dictionary of the jieba tokenizer cannot take effect.

numeric_fields

  • INT2

  • INT2[]

  • INT4

  • INT4[]

  • INT8

  • INT8[]

  • OID

  • OID[]

  • XID

  • XID[]

  • FLOAT4

  • FLOAT4[]

  • FLOAT8

  • FLOAT8[]

  • NUMERIC

  • NUMERIC[]

The numeric fields to be indexed and their corresponding index configurations. Specify an index configuration for each field by using a JSON5 string or invoking the pgsearch.field() function.

  • JSON5 string

    A JSON5 string consists of key-value pairs. Set the key to the name of a field in the table and the value to the index configuration. Supported index configuration items:

    fast: specifies whether the field can be quickly accessed in a random manner. The configuration item can be used to accelerate scoring and filtering. Default value: true.

  • pgsearch.field() function.

    You can invoke the pgsearch.field() function to generate the index configuration of a specific field. To specify index configurations for multiple fields, invoke multiple pgsearch.field() functions and concatenate the functions by using double vertical bars (||). For more information, see the preceding information in this table.

boolean_fields

  • BOOLEAN

  • BOOLEAN[]

The Boolean fields to be indexed and their corresponding index configurations.

  • Use <boolean_fields> to configure the index.

    Supported index configuration items:

    fast: specifies whether the field can be quickly accessed in a random manner. The configuration item can be used to accelerate scoring and filtering. Default value: true.

  • pgsearch.field() function.

    You can invoke the pgsearch.field() function to generate the index configuration of a specific field. To specify index configurations for multiple fields, invoke multiple pgsearch.field() functions and concatenate the functions by using double vertical bars (||). For more details, see the preceding information in this table.

json_fields

  • JSON

  • JSONB

The JSON fields to be indexed and their corresponding index configurations.

  • JSON5 string

    A JSON5 string consists of key-value pairs. Set the key to the name of a field in the table and the value to the index configuration. After the index is created, the index can be used to perform search on the nested text fields within JSON values. Supported index configuration items:

    • fast: specifies whether the field can be quickly accessed in a random manner. The configuration item can be used to accelerate scoring and filtering. Default value: false.

    • expand_dots: specifies whether JSON keys that contain periods (.) are expanded. Default value: true. For example, if you set the expand_dots configuration item to true, {"metadata.color": "red"} is indexed as {"metadata": {"color": "red"}}.

    • tokenizer: the tokenizer and its corresponding configuration. We recommend that you set this parameter to en_stem. For more information, see the "Tokenizers" section of this topic.

    • record: the information about the index. Default value: position. For more information, see the "Records" section of this topic.

  • pgsearch.field() function.

    You can invoke the pgsearch.field() function to generate the index configuration of a specific field. To specify index configurations for multiple fields, invoke multiple pgsearch.field() functions and concatenate the functions by using double vertical bars (||). For more details, see the preceding information in this table.

datetime_fields

  • DATE

  • DATE[]

  • TIMESTAMP

  • TIMESTAMP[]

  • TIMESTAMPTZ

  • TIMESTAMPTZ[]

  • TIME

  • TIME[]

  • TIMETZ

  • TIMETZ[]

The date and time fields to be indexed and their corresponding index configurations. If you leave this parameter empty, the UTC time zone and the RFC 3339 format are used.

  • Set the parameter to the index configuration item of <datetime_fields>.

    Supported index configuration items:

    fast: specifies whether the field can be quickly accessed in a random manner. The configuration item can be used to accelerate scoring and filtering. Default value: true.

  • pgsearch.field() function.

    You can invoke the pgsearch.field() function to generate the index configuration of a specific field. To specify index configurations for multiple fields, invoke multiple pgsearch.field() functions and concatenate the functions by using double vertical bars (||). For more details, see the preceding information in this table.

Records

The pgsearch extension supports three types of records.

  • raw: does not perform word segmentation.

  • freq: records the row ID and the term frequency.

  • position: records the row ID, term frequency, and occurrence position.

Tokenizers

The pgsearch extension provides a variety of built-in tokenizers, such as jieba, ngram, lindera, en_stem, and whitespace. The tokenizers can meet your business requirements without the need to install the pg_jieba or zhparser extension.

Tokenizer name

Description

Configuration

default

Segments text based on spaces and punctuation marks, and converts the text into lowercase letters. This tokenizer filters out terms whose length exceeds 255 bytes.

{type: "default"}

raw

Does not segment text.

{type: "raw"}

en_stem

Segments text based on spaces and punctuation marks, converts the text into lowercase letters, and then extracts the stem of each term. This tokenizer filters out terms whose length exceeds 40 characters.

{type: "en_stem"}

whitespace

Segments text based on spaces.

{type: "whitespace"}

ngram

Segments text into a sequence of n-grams by using the following parameters. An n-gram is a sequence of characters of length n.

  • min_gram: the minimum length of each n-gram. For example, if you set the min_gram parameter to 2, each n-gram contains a minimum of 2 characters.

  • max_gram: the maximum length of each n-gram. For example, if you set the max_gram parameter to 5, each n-gram contains up to 5 characters.

  • prefix_only: specifies whether to generate n-grams only from the beginning of a word to ensure that all n-grams are prefixes of the word. Valid values: true and false. If you set the prefix_only parameter to false, n-grams can be generated from all combinations of characters that meet the requirements in the range of the min_gram and max_gram parameters.

{type: "ngram", min_gram: 1, max_gram: 2, prefix_only: true}

chinese_compatible

Segments text based on spaces and punctuation marks, which is suitable for Chinese text. For consecutive non-Chinese characters, a token is generated. For a separate Chinese character, a separate token is generated. For non-numeric or non-alphabetic characters, such as punctuation marks, no token is generated. For example, the "我爱吃橙子 oranges!12" string is segmented into "我, 爱, 吃, 橙, 子, oranges, 12" that contains seven tokens.

{type: "chinese_compatible"}

chinese_lindera

Segments text based on the lindera tokenizer and the CC-CEDICT dictionary.

{type: "chinese_lindera"}

korean_lindera

Segments text based on the lindera tokenizer and the KoDic dictionary.

{type: "korean_lindera"}

japanese_lindera

Segments text based on the lindera tokenizer and the IPADIC dictionary.

{type: "japanese_lindera"}

jieba

Segments text based on the jieba tokenizer, which is suitable for most Chinese text. The jieba tokenizer supports custom word segmentation dictionaries and stop word dictionaries. For more information, see Configure a custom word segmentation dictionary and Configure a stop word dictionary.

  • hmm: specifies whether to enable Hidden Markov Model (HMM) assisted word segmentation. Default value: true. When you set the parameter to true, the tokenizer uses HMM to handle out-of-vocabulary words. HMM-assisted word segmentation determines word boundaries based on context information to improve the accuracy of word segmentation. If the dictionary is relatively complete, you can set the parameter to false to disable HMM-assisted word segmentation. This accelerates the processing of word segmentation.

  • search: specifies whether to use the finest granularity of word segmentation. Default value: true. This further segments the long words and increases the matching accuracy of phrases. When you set the parameter to false, the text is segmented at a coarse granularity. For example, the text "南京市长江大桥" is segmented into "南京/京市/南京市/长江/大桥/长江大桥" when you set the search parameter to true. When you set the search parameter to false, the text is segmented into "南京市/长江大桥".

Note

Only AnalyticDB for PostgreSQL V7.0 instances of V7.2.1.0 or later support the hmm and search parameters.

{type: "jieba",hmm=>true,search=>true}

You can invoke the pgsearch.tokenizer() function to generate a string of configuration items for a tokenizer. This simplifies the configuration process. The following section describes the syntax and parameters:

Syntax

pgsearch.tokenizer(<name>, <min_gram>, <max_gram>, <prefix_only>, <search>, <hmm>, <dict>,<stopword>,<lowercase>,<remove_long>, <stemmer>);

Parameters

  • name: the name of the tokenzier, which is of the text type. For information about the valid values, see the "Tokenizers" section of this topic.

  • For information about the min_gram, max_gram, prefix_only, search, and hmm parameters, see the "Tokenizers" section of this topic.

  • dict: the name of the word segmentation dictionary used by the jieba tokenizer, which is of the text type. For more information, see Configure a custom word segmentation dictionary.

  • stopword: the name of the stop word dictionary, which is of the text type. By default, you do not need to specify this parameter. For more information, see Configure a stop word dictionary.

  • lowercase: specifies whether to convert a keyword into lowercase, which is of the Boolean type. Default value: true.

  • remove_long: the length of keywords to be removed, which is of the Integer type. By default, you do not need to specify this parameter. After you set the parameter, pgsearch.tokenizer() filters out keywords whose length equals or exceeds remove_long bytes and remove them. Each Chinese character occupies 3 bytes.

  • stemmer: the stem to be extracted, which is of the text type. By default, you do not need to specify this parameter. If you set this parameter, the stem is extracted. For example, the run, running, runs, and ran words are truncated to run. You can retrieve relevant results regardless of which run you search for. Only English is supported for stemming and you must set the value to en.

Important

Only AnalyticDB for PostgreSQL instances of V7.2.1.0 or later support the pgsearch.tokenizer() function. To create an index for AnalyticDB for PostgreSQL instances of V7.2.1.0 or later, you must invoke the pgsearch.tokenizer() function. Otherwise, the custom word segmentation dictionary and stop word dictionary of the jieba tokenizer cannot take effect.

Examples of creating indexes

  • Use the text_fields parameter to create an index.

    -- Use the ngram tokenizer. You must specify all configuration fields of the ngram tokenizer.
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => '{description: { tokenizer: {type: "ngram", min_gram: 2, max_gram: 3, prefix_only: false}}}'
    );
    
    -- Use the jieba tokenizer.
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => '{description: { tokenizer: {type: "jieba"}}}'
    );
    
    -- Use the lindera tokenizer.
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => '{description: { tokenizer: {type: "chinese_lindera"}}}'
    );
    -- Create an index on multiple fields.
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => '{description: {fast: false, filednorms: true, tokenizer: {type: "jieba"}}, category: {}}',
        datetime_fields => '{created_at: {fast: true}, last_updated_date: {fast: true}}',
        numeric_fields => '{rating: {fast: true}}',
        json_fields => '{metadata: {fast: true, expand_dost: true, tokenizer: {type: "en_stem"}, record: "position"}}',
        boolean_fields => '{in_stock: {fast: true} }'
    );
    
    -- In most cases, you can configure only the tokenizer field for the text_fields and json_fields parameters. Use the default values for other parameters.  
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => '{description: {tokenizer: {type: "jieba"}}, category: {}}',
        datetime_fields => '{created_at: {}, last_updated_date: {}}',
        numeric_fields => '{rating: {}}',
        json_fields => '{metadata: {tokenizer: {type: "en_stem"}}}',
        boolean_fields => '{in_stock: {} }'
    );
  • Use the pgsearch.field() and pgsearch.tokenizer() functions to create indexes and query the word segmentation effect.

    -- Set the fast parameter in the description column to true, set the record parameter to position, and set the tokenizer parameter to jieba. In the jieba tokenizer, the non-search mode is configured, user_dict is used as the custom word segmentation dictionary, and CN_SIMPLE is used as the built-in stop word.
    -- Use the default configuration for the category column.
    -- The configurations of the two columns can be concatenated by using a double vertical bars (||).
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => pgsearch.field('description', fast=>true, record=>'position', tokenizer=>pgsearch.tokenizer('jieba', search=>false, dict=>'user_dict', stopword=>'CN_SIMPLE')) 
                        || pgsearch.field('category')
    );
    
    CALL pgsearch.create_bm25(
        index_name => 'search_idx',
        table_name => 'mock_items',
        text_fields => pgsearch.field('description', tokenizer=>pgsearch.tokenizer('jieba'),
        datetime_fields => '{created_at: {}, last_updated_date: {}}',
        numeric_fields => '{rating: {}}',
        json_fields => '{metadata: {tokenizer: {type: "en_stem"}}}',
        boolean_fields => '{in_stock: {} }'
    );
    
    -- Query the word segmentation effect by using the specified jieba tokenizer.
    SELECT pgsearch.tokenizer(pgsearch.tokenizer('jieba', hmm=>false, search=>false, dict=>'custom_dict_1'), '永和服装饰品有限公司');
             tokenizer         
    
    -- Use various filters to process word segmentation results.
    SELECT pgsearch.tokenizer(
        pgsearch.tokenizer('jieba',
        search=>false, dict=>'user_dict', stopword=>'CN_SIMPLE',
        lowercase=>false, remove_long=>27, stemmer=>'en')::text,
        '永和服装饰品有限公司。 Shoping'
    );

Query the configuration of an index

SELECT * FROM pgsearch.schema_bm25('index_name'::regclass);

Delete an index

DROP INDEX index_name;

Rebuild an index

REINDEX INDEX index_name;

Search syntax

@@@ operator

You can use the @@@ operator in k-nearest neighbor (KNN) search to sort the search results based on BM25 scores and return the top results.

Syntax

SELECT * FROM <table_name> 
ORDER BY <index_col> @@@ pgsearch.config('<query>');

Parameters

  • <table_name>: the name of the table.

  • <index_col>: the name of an index field that is specified when you create an index. We recommend that you use the first index field that is contained in the search keyword specified by the query parameter.

  • <query>: the search keyword.

  • pgsearch.config: the search function, which supports character strings or other types of search objects. Search objects can be a combination or in any granularity. When a search keyword is passed into the pgsearch.config() function, the pgsearch.parse() function implicitly parses the search keyword to the search object. The following sample statements provide the same query effect:

  • SELECT * FROM mock_items 
    ORDER BY description @@@ pgsearch.config('description:socks');
    
    SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
      query => pgsearch.parse('description:socks')
    );

Basic queries

Specific field search

Text field search

Query the rows that include the specified word or phrase. If the phrase contains spaces, you must enclose the phrase in double quotation marks (").

Query the rows that include the "keyboard" word or the "hello world" phrase.

SELECT * FROM mock_items 
ORDER BY description @@@ pgsearch.config('description:keyboard');

-- Query the rows that include the space-separated phrase "hello world".
SELECT * FROM mock_items 
ORDER BY description @@@ pgsearch.config('description:"hello world"');

JSON field search

Query the rows that include {"metadata": {"color": "white"}}.

SELECT * FROM mock_items 
ORDER BY metadata @@@ pgsearch.config('metadata.color:white');

DATETIME field search

When you query a DATETIME field, the UTC time zone and the RFC 3339 format are used.

SELECT * FROM mock_items ORDER BY created_at @@@ pgsearch.config('created_at:"2023-05-01T09:12:34Z"') LIMIT 10;

SELECT * FROM mock_items ORDER BY created_at @@@ pgsearch.config('created_at:"2023-05-01T04:12:34-05:00"') LIMIT 10;

Proximity operator search

You can use the ~ operator to query the phrases that are separated by a word. For example, the description field includes the "ergonomic metal keyboard" phrase. You can execute the following statement to query "ergonomic metal keyboard" because "ergonomic" and "keyboard" are separated by a word.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:"ergonomic keyboard"~1');

Efficient filter search

Filters are suitable for only numeric fields and Boolean fields on which BM25 indexes are created. Compared with the WHERE clauses in standard SQL statements, the filter search can obtain results in a faster manner.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:keyboard AND rating:<4');

Rank improvement search

If you want to improve the result ranking of a search keyword, you can use the ^ character followed by an improvement coefficient. The coefficient is used to increase the BM25 scores of search results. This helps improve the result ranking.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:keyboard^2 OR category:electronics^3');

Boolean operator search

You can use the AND, OR, and NOT operators to filter multiple keywords. Parentheses () can be used to manage grouping and operation sequence.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:keyboard OR category:toy');

Set operator search

You can use the IN operator to provide the same capabilities as one or more OR operators. The IN operator consumes less CPU resources.

When you use the IN operator, each phrase is separated by spaces, and the word segmentation result of each phrase cannot contain more than one token. The number of tokens generated from Chinese word segmentation depends on the tokenizer and is hard to be controlled. Therefore, the IN operator is applicable only to English. In Chinese scenarios, we recommend that you use multiple OR operators instead of this syntax.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:IN [keyboard toy]');

Limit- and offset-based search

You can use the OFFSET and LIMIT clauses to perform search.

SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:socks') OFFSET 2 LIMIT 10;

Advanced query

Obtain the word segmentation result

You can use the pgsearch.tokenizer() function to obtain the word segmentation result. The function returns an array that consists of the segmented words.

Syntax

SELECT pgsearch.tokenizer(<tokenizer_config>, <query_str>);

Parameters

  • tokenizer_config: the JSON string that includes the tokenizer configuration information. You can also invoke the pgsearch.tokenizer() function with the same name to generate a string of tokenizer configuration items. For more information, see the Tokenizers section of this topic.

  • query_str: the character string that you want to segment.

Example

SELECT pgsearch.tokenizer('{type: "ngram", min_gram: 1, max_gram: 2, prefix_only: true}', 'hell');
SELECT pgsearch.tokenizer('{type: "jieba"}', '数据仓库');
SELECT pgsearch.tokenizer(pgsearch.tokenizer('jieba'), '永和服装饰品有限公司');

Term set search

You can search for rows that include any keyword in an array of terms.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query =>  pgsearch.term_set(
	    terms => ARRAY[
	        pgsearch.term(field => 'description', VALUE => 'socks'),
	        pgsearch.term(field => 'description', VALUE => 'novel')
            
	    ]
	)
);

Parameters

  • field: the search field. If you leave this parameter empty, all index fields are searched.

  • terms: an array of terms to be searched.

Note

By default, the tokenizer converts the indexed keywords into lowercase. You cannot obtain relevant query results by using uppercase keywords in the pgsearch.term() function. We recommend that you use lowercase keywords for your queries.

Phrase search

You can query the rows that include an array of ordered keywords. Before you perform phrase search, make sure that the record configuration item of the index is set to position.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.phrase(
    field => 'description',
    phrases => ARRAY['little', 'red', 'riding' 'hood'],
        slop => 0
    )
);

Parameters

  • field: the search field. If you leave this parameter empty, all index fields are searched.

  • phrases: an array of keywords. If a row includes all keywords in the array and the keywords in the row are in the same order as the keywords in the array, the row is returned. A row is returned only when the words match all the keywords and the order is consistent with the order of the keywords.

  • slop: the maximum distance between two keywords. This parameter is optional. A value of 0 specifies that the matching row must be in the same order as the keywords in the array and every two keywords must be adjacent. For example, the little, red, riding, and hood keywords must be consecutively sorted in the row. A value of greater than 0 specifies that the keywords can be inconsecutive. For example, other words can exist between any two of the little, red, riding, and hood keywords in the matching row.

Phrase prefix search

You can search for rows that include the content prefixed with an array of ordered keywords. Before you perform phrase prefix search, make sure that an index is created on the search field.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.phrase_prefix(
    field => 'description',
    phrases => ARRAY['little', 'red', 'riding' 'hood'],
        max_expansion => 1
    )
);

Parameters

  • field: the search field. If you leave this parameter empty, all index fields are searched.

  • phrases: an array of keywords. The query parameter specifies a subquery. You can query the character strings that are prefixed with the specified array of ordered keywords. For example, if you set the phrases parameter to ARRAY['little', 're'], "little re" is used as the prefix. In this case, "little red riding hood" can be returned.

  • max_expansion: the maximum number of keyword variables that can be expanded from the prefix. This parameter is optional. This parameter narrows the search scope.

The word segments search (tokenizer_terms)

You can use the specified tokenizer to segment the query text, and then query the generated tokens. The pgsearch.tokenizer_terms() function supports the use of Boolean logic between the specified tokens.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
pgsearch.tokenizer_terms('description', '朝阳百货', pgsearch.tokenizer('jieba'), 'OR'));

Parameters

  • tokenizer: the configuration string of the tokenizer specified by the pgsearch.tokenizer() function. Default value: jieba.

  • operator: the operator that is used to explain the Boolean logic between tokens after word segmentation. Valid values: OR and AND. Default value: OR. For example, the word "朝阳百货" is segmented into "朝阳/百货". If you set the operator parameter to OR, the rows that include "朝阳" or "百货" are returned. If you set the operator parameter to AND, only the rows that include both "朝阳" and "百货" are returned.

All search

You can search for all rows that include the index field and have the same score of 1.0. All search results are equal in preference.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query =>  pgsearch.all()
);

Boolean search

You can query the rows that are filtered by using the logical relationships defined in a subquery.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.boolean(
	    should => ARRAY[
		    pgsearch.parse('description:socks'),
		    pgsearch.phrase_prefix(field => 'description', phrases => ARRAY['book']),
		    pgsearch.term(field => 'description', VALUE => 'writer'),
		    pgsearch.fuzzy_term(field => 'description', VALUE => 'wow')
	    ],
        must_not => ARRAY[
            pgsearch.term(field => 'description', VALUE => 'writer')
        ],
        must => ARRAY[
            pgsearch.term(field => 'rating', VALUE => 4)
        ]
    )
);
-- You can configure one or more of the should, must_not, and must parameters.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.boolean(
	    should => pgsearch.parse('description:socks')
    )
);

Parameters

  • must: the conditions that must be matched for an array of search objects. The rows that match all conditions specified by the must parameter can be returned.

  • must_not: the conditions that cannot be matched for an array of search objects. The rows that match a condition specified by the must_not parameter are not returned.

  • should: the conditions that can be matched for an array of search objects when the must parameter does not exist.

  • If the must parameter does not exist, the rows that match at least one condition specified by the should parameter are returned.

Rank boost search

You can increase the score of a subquery by using rank boost search to boost the result ranking of the subquery. In this example, query => pgsearch.parse('description:socks') is used.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.boost(query => pgsearch.parse('description:socks'), boost => 2)
);

Parameters

  • boost: the coefficient that is multiplied by each result score.

  • query: the search object.

Constant score search

You can use a constant score to query the rows that match the conditions specified by a subquery. In this example, query => pgsearch.all() is used. This prevents unnecessary score calculations on subqueries.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.const_score(query => pgsearch.all(), score => 2)
);

Parameters

  • score: the constant score of each result of the subquery.

  • query: the search object.

Maximum disjunction search

You can query the rows that match one or more subqueries. The rows that match more conditions can obtain higher BM25 scores.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.disjunction_max(
      disjuncts => ARRAY[
          pgsearch.parse('description:socks'),
          pgsearch.parse('description:Generic')
      ],
      tie_breaker => 0.75
    )
);

Parameters

  • disjuncts: an array that consists of one or more subqueries.

  • tie_breaker: the coefficient that is used to adjust the BM25 score of a row when the row matches multiple subqueries. This parameter is optional. If a row matches multiple subqueries, the BM25 score of the row is calculated by using the following rule: First, calculate the BM25 score of the best matching subquery. Second, add the incremental scores of the other matching subqueries to the preceding BM25 score. In the preceding example, if the BM25 score of a row is 1.0 with the pgsearch.parse('description:socks') subquery and 0.5 with the pgsearch.parse('description:Generic') subquery, the final BM25 score of the row is calculated by using the following formula: 1.0 + 0.75 × 0.5 = 1.375.

Empty search

You can use empty search as a placeholder in test scenarios or edge cases. In this case, no rows are returned.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.empty()
);

Fuzzy term search

You can query rows that include fuzzy terms of the search keyword. Search results can be returned even if the spelling is not completely correct.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.fuzzy_term(
      field => 'description', 
      VALUE => 'wow',
      distance => 2,
      tranposition_cost_one => true,
      prefix => true)
);

Parameters

  • field: the search field. If you leave this parameter empty, all index fields are searched.

  • value: the search keyword. Fuzzy matching based on the Levenshtein distance is used to search for results similar to the specified keyword.

  • distance: the maximum edit distance (maximum number of single-character changes) for matching the specified field. This parameter is optional. Default value: 2. Maximum value: 2.

  • tranposition_cost_one: specifies whether the exchange of two adjacent characters is considered a single edit in the Levenshtein distance calculation. This parameter is optional. Default value: true. If you set the tranposition_cost_one parameter to false, the exchange is considered two separate edits (deletion and insertion).

  • prefix: specifies whether the prefix of the search keyword is excluded from the calculation of the fuzzy edit distance. This parameter is optional. Default value: true. If you set the prefix parameter to false, the entire search keyword is used for the calculation.

Range search

You can query the rows that include values within a specific range. The range search is suitable for the numeric_fields and datetime_fields parameters.

Example

SELECT * FROM mock_items ORDER BY rating @@@ pgsearch.config(
    query => pgsearch.range(
        field => 'rating',
        RANGE => '[1,4)'::int4range
  )
);

Parameters

  • field: the search field. If you leave this parameter empty, all index fields are searched.

  • range: the value range for matching the specified field. The following range types are supported: INT4RANGE, INT8RANGE, DATERANGE, TSRANGE, and TSTZRANGE.

Regular expression search

You can query the rows that include the words that match a specific regular expression.

Example

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.regex(
		field => 'description',
		PATTERN => '(glass|screen|like|cloth|phone)'
	)
);
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.regex(
		field => 'description',
		PATTERN => '(.*screen.*)'
	)
);

Parameters

  • field: the search field. If you leave this parameter empty, all index fields are searched.

  • pattern: the regular expression pattern.

Others

Obtain BM25 scores

You can place the @@@ operator in the front part of a SELECT statement and use the AS keyword to obtain BM25 scores.

By default, the ORDER BY clause returns ascending results for AnalyticDB for PostgreSQL databases. A higher BM25 score indicates a higher result ranking. In this case, the negative values of BM25 scores are returned. For example, if a BM25 score is 2.86, the return value is -2.86. BM25 scores are calculated only for searches on the fields specified by the text_fields and json_fields parameters. For searches on the fields specified by the numeric_fields, datetime_fields, and boolean_fields parameters, BM25 scores cannot be calculated.

Syntax

SELECT *, mock_items @@@ pgsearch.config(<query>) as BM25
FROM mock_items
ORDER BY BM25;

Example

-- 1. Perform full-text search only on a string field. A BM25 score is returned.
SELECT description, rating, description @@@ pgsearch.config('description:socks') AS bm25 FROM mock_items ORDER BY bm25 limit 1;

-- 2. Perform full-text search on a string field and filter data on a numeric field. A BM25 score is returned.
SELECT description, rating, description @@@ pgsearch.config('description:socks AND rating:4') AS bm25 FROM mock_items ORDER BY bm25 limit 4;
 
-- 3. Perform full-text search on a JSON field. A BM25 score is returned.
SELECT metadata, metadata @@@ pgsearch.config('metadata.color:White') AS bm25 FROM mock_items ORDER BY bm25 LIMIT 1;

-- 4. Perform range search on a rank field. BM25 scores cannot be calculated. The default value -1 is returned.
 SELECT description, rating, rating @@@ pgsearch.config('rating:[4 TO 5]') AS bm25 FROM mock_items ORDER BY bm25 LIMIT 1;

Query result

-- 1. Perform full-text search only on a string field. A BM25 score is returned.
   description  | rating |    bm25    
---------------+--------+------------
 Generic socks |      4 | -2.1048825
(1 row)

-- 2. Perform full-text search on a string field and filter data on a numeric field. A BM25 score is returned.
 description  | rating |    bm25    
---------------+--------+------------
 Generic socks |      4 | -3.1081846
(1 row)

-- 3. Perform full-text search on a JSON field. A BM25 score is returned.
                metadata                 |   bm25    
-----------------------------------------+-----------
 {"color": "White", "location": "China"} | -3.453373
(1 row)

-- 4. Perform range search on a rank field. BM25 scores cannot be calculated. The default value -1 is returned.
  description    | rating | bm25 
------------------+--------+------
 Plastic Keyboard |      4 |   -1
(1 row)