All Products
Search
Document Center

AnalyticDB:High-performance BM25 full-text search

Last Updated:Mar 28, 2026

The pgsearch extension brings BM25-based full-text search to AnalyticDB for PostgreSQL. Built on Tantivy, a high-performance search engine, it ranks results by relevance — the same algorithm that powers Elasticsearch. Use it to search keywords and phrases across large tables directly from SQL, without an external search service.

Prerequisites

Before you begin, ensure that you have:

  • An AnalyticDB for PostgreSQL instance

  • A submitted ticket to install the pgsearch extension (no console-based installation is available)

  • Restarted your instance after the extension is installed

To uninstall the pgsearch extension, submit a ticket.

Quick start

This section walks through the full workflow: create a test table, build a BM25 index, and run your first search.

1. Create a test table.

CALL pgsearch.create_test_table(table_name => 'mock_items', schema_name => 'public');

This creates a mock_items table with the following schema:

CREATE TABLE mock_items (
  description TEXT,
  rating INTEGER CHECK (rating BETWEEN 1 AND 5),
  category VARCHAR(255),
  in_stock BOOLEAN,
  metadata JSONB,
  created_at TIMESTAMP,
  last_updated_date DATE,
  latest_available_time TIME
);

2. Create a BM25 index.

CALL pgsearch.create_bm25(
    index_name => 'search_idx',
    table_name => 'mock_items',
    text_fields => '{description: {tokenizer: {type: "jieba"}}, category: {}}',
    datetime_fields => '{created_at: {}, last_updated_date: {}}',
    numeric_fields => '{rating: {}}',
    json_fields => '{metadata: {tokenizer: {type: "en_stem"}}}',
    boolean_fields => '{in_stock: {}}'
);

3. Run a search.

SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard')
LIMIT 10;

Results are ranked by BM25 score — the most relevant rows appear first.

Create a BM25 index

Call pgsearch.create_bm25() to create an index. Create only one BM25 index per table — multiple indexes consume excessive resources.

Important

Index creation and deletion cannot be rolled back.

Syntax

CALL pgsearch.create_bm25(
  index_name => '<index_name>',
  table_name => '<table_name>',
  schema_name => '<schema_name>',
  text_fields => '<text_fields> | pgsearch.field()',
  numeric_fields => '<numeric_fields> | pgsearch.field()',
  boolean_fields => '<boolean_fields> | pgsearch.field()',
  json_fields => '<json_fields> | pgsearch.field()',
  datetime_fields => '<datetime_fields> | pgsearch.field()'
)

Parameters

ParameterRequiredSupported data typesDescription
index_nameYesSTRINGName of the index.
table_nameYesSTRINGName of the table to index.
schema_nameNoSTRINGSchema of the table. Defaults to the current schema.
text_fieldsYes (at least one)VARCHAR, VARCHAR[], TEXT, TEXT[]Text fields to index, with per-field configuration. Use a JSON5 string or pgsearch.field().
numeric_fieldsNoINT2, INT2[], INT4, INT4[], INT8, INT8[], OID, OID[], XID, XID[], FLOAT4, FLOAT4[], FLOAT8, FLOAT8[], NUMERIC, NUMERIC[]Numeric fields to index.
boolean_fieldsNoBOOLEAN, BOOLEAN[]Boolean fields to index.
json_fieldsNoJSON, JSONBJSON fields to index. Supports full-text search on nested text fields within JSON values.
datetime_fieldsNoDATE, DATE[], TIMESTAMP, TIMESTAMP[], TIMESTAMPTZ, TIMESTAMPTZ[], TIME, TIME[], TIMETZ, TIMETZ[]Date and time fields to index. Defaults to UTC timezone and RFC 3339 format.

Field configuration options

Each field accepts a JSON5 configuration string or pgsearch.field(). The following options apply:

OptionDescriptionDefault
fastEnables fast random access for scoring and filtering.false for text and JSON fields; true for numeric, boolean, and datetime fields
fieldnormsStores field length. Required for BM25 score computation.true
tokenizerTokenizer and its configuration. See Tokenizers.
recordIndex record type: raw, freq, or position. position enables phrase search.position
expand_dots(JSON fields only) Expands JSON keys that contain periods. For example, {"metadata.color": "red"} is indexed as {"metadata": {"color": "red"}}.true

pgsearch.field() function

pgsearch.field() generates the index configuration for a single field. To configure multiple fields, concatenate multiple calls with ||.

pgsearch.field(<name>, <fast>, <fieldnorms>, <record>, <expand_dots>, <tokenizer>)
Important

AnalyticDB for PostgreSQL V7.2.1.0 and later require pgsearch.field() when creating an index. Without it, custom word segmentation dictionaries and stop word dictionaries for the jieba tokenizer do not take effect.

Tokenizers

The pgsearch extension includes built-in tokenizers for English, Chinese, Korean, and Japanese. No additional extensions are needed.

TokenizerDescriptionConfiguration
defaultSplits on spaces and punctuation, lowercases, removes tokens longer than 255 bytes.{type: "default"}
rawTreats the entire field value as a single token.{type: "raw"}
en_stemSplits on spaces and punctuation, lowercases, and applies English stemming. Removes tokens longer than 40 characters. Use for JSON fields with English content.{type: "en_stem"}
whitespaceSplits on whitespace only.{type: "whitespace"}
ngramGenerates n-gram sequences. Parameters: min_gram, max_gram, prefix_only.{type: "ngram", min_gram: 1, max_gram: 2, prefix_only: true}
chinese_compatibleSplits on spaces and punctuation. Each Chinese character becomes a separate token; consecutive non-Chinese characters form one token. For example, "我爱吃橙子 oranges!12"我, 爱, 吃, 橙, 子, oranges, 12.{type: "chinese_compatible"}
chinese_linderaChinese tokenizer using the CC-CEDICT dictionary.{type: "chinese_lindera"}
korean_linderaKorean tokenizer using the KoDic dictionary.{type: "korean_lindera"}
japanese_linderaJapanese tokenizer using the IPADIC dictionary.{type: "japanese_lindera"}
jiebaChinese tokenizer using the jieba library. Supports custom word segmentation dictionaries and stop word dictionaries. Additional parameters: hmm (Hidden Markov Model for out-of-vocabulary words, default true) and search (fine-grained segmentation, default true). For more information, see Configure a custom word segmentation dictionary and Configure a stop word dictionary.{type: "jieba", hmm: true, search: true}
Note

The hmm and search parameters for the jieba tokenizer are available on AnalyticDB for PostgreSQL V7.2.1.0 and later.

Verify tokenizer output before creating an index.

Use pgsearch.tokenizer() to see how a tokenizer segments text before committing to an index.

-- Check how jieba segments Chinese text
SELECT pgsearch.tokenizer('{type: "jieba"}', '数据仓库');

-- Check how ngram segments text
SELECT pgsearch.tokenizer('{type: "ngram", min_gram: 1, max_gram: 2, prefix_only: true}', 'hell');

pgsearch.tokenizer() function

pgsearch.tokenizer() generates a tokenizer configuration string. Use it with pgsearch.field() to configure tokenizers and apply filters such as stemming and stop word removal.

pgsearch.tokenizer(<name>, <min_gram>, <max_gram>, <prefix_only>, <search>, <hmm>, <dict>, <stopword>, <lowercase>, <remove_long>, <stemmer>)
ParameterDescription
nameTokenizer name. See the tokenizer table above for valid values.
min_gram, max_gram, prefix_onlyngram tokenizer parameters.
search, hmmjieba tokenizer parameters.
dictCustom word segmentation dictionary name for jieba. See Configure a custom word segmentation dictionary.
stopwordStop word dictionary name. See Configure a stop word dictionary.
lowercaseConverts keywords to lowercase. Default: true.
remove_longRemoves tokens whose length in bytes equals or exceeds this value. Each Chinese character is 3 bytes.
stemmerExtracts word stems. Set to en for English. For example, run, running, runs, and ran all resolve to run.
Important

AnalyticDB for PostgreSQL V7.2.1.0 and later require pgsearch.tokenizer() to apply custom word segmentation dictionaries and stop word dictionaries for the jieba tokenizer.

Examples

Use a JSON5 configuration string:

-- ngram tokenizer — all parameters must be specified
CALL pgsearch.create_bm25(
    index_name => 'search_idx',
    table_name => 'mock_items',
    text_fields => '{description: {tokenizer: {type: "ngram", min_gram: 2, max_gram: 3, prefix_only: false}}}'
);

-- jieba tokenizer
CALL pgsearch.create_bm25(
    index_name => 'search_idx',
    table_name => 'mock_items',
    text_fields => '{description: {tokenizer: {type: "jieba"}}}'
);

-- Index multiple field types
CALL pgsearch.create_bm25(
    index_name => 'search_idx',
    table_name => 'mock_items',
    text_fields => '{description: {tokenizer: {type: "jieba"}}, category: {}}',
    datetime_fields => '{created_at: {}, last_updated_date: {}}',
    numeric_fields => '{rating: {}}',
    json_fields => '{metadata: {tokenizer: {type: "en_stem"}}}',
    boolean_fields => '{in_stock: {}}'
);

Use `pgsearch.field()` and `pgsearch.tokenizer()` (required on V7.2.1.0 and later):

-- Configure description with fast access, position records, and a jieba tokenizer
-- using a custom dictionary and built-in CN_SIMPLE stop words.
-- Concatenate multiple field configurations with ||.
CALL pgsearch.create_bm25(
    index_name => 'search_idx',
    table_name => 'mock_items',
    text_fields => pgsearch.field('description',
                       fast => true,
                       record => 'position',
                       tokenizer => pgsearch.tokenizer('jieba',
                           search => false,
                           dict => 'user_dict',
                           stopword => 'CN_SIMPLE'))
                   || pgsearch.field('category')
);

-- Verify jieba tokenizer output with filters applied
SELECT pgsearch.tokenizer(
    pgsearch.tokenizer('jieba',
        search => false, dict => 'user_dict', stopword => 'CN_SIMPLE',
        lowercase => false, remove_long => 27, stemmer => 'en')::text,
    '永和服装饰品有限公司。 Shoping'
);

Search syntax

All searches use the @@@ operator with pgsearch.config(). Results are sorted by BM25 score in descending order.

SELECT * FROM <table_name>
ORDER BY <index_col> @@@ pgsearch.config('<query>')
LIMIT <n>;
  • <index_col>: an indexed field name. Use the field that the search keyword targets.

  • <query>: the search expression. When you pass a string, pgsearch.parse() implicitly parses it.

The following two statements are equivalent:

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:socks');
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(query => pgsearch.parse('description:socks'));

Basic queries

Search a text field

Search for a word or phrase. Enclose multi-word phrases in double quotation marks.

-- Search for the word "keyboard"
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard');

-- Search for the exact phrase "hello world"
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:"hello world"');

Search a JSON field

-- Search for rows where metadata contains {"color": "white"}
SELECT * FROM mock_items
ORDER BY metadata @@@ pgsearch.config('metadata.color:white');

Search a datetime field

Datetime fields use UTC timezone and RFC 3339 format.

SELECT * FROM mock_items
ORDER BY created_at @@@ pgsearch.config('created_at:"2023-05-01T09:12:34Z"')
LIMIT 10;

Filter on numeric and Boolean fields

Use filter expressions inside pgsearch.config() for numeric and Boolean fields. Filters on indexed fields are faster than standard SQL WHERE clauses.

-- Full-text search on description, filtered to rating < 4
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard AND rating:<4');

Proximity search

Use ~N to match terms separated by up to N words.

-- Match "ergonomic keyboard" with at most 1 word between them
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:"ergonomic keyboard"~1');

Boolean operators

Combine search terms with AND, OR, and NOT. Use parentheses to group conditions.

SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard OR category:toy');

Rank improvement

Use ^ followed by a multiplier to boost the BM25 score of specific terms, pushing matching rows higher in results.

SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard^2 OR category:electronics^3');

Set search (IN operator)

The IN operator is equivalent to multiple OR conditions but uses less CPU. Each phrase must produce a single token.

Note

Use IN for English only. Chinese word segmentation produces unpredictable token counts, making token-level IN unreliable. Use OR for Chinese.

SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:IN [keyboard toy]');

Pagination

SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:socks')
OFFSET 2 LIMIT 10;

Advanced queries

Term set search

Search for rows that match any term in an array.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.term_set(
        terms => ARRAY[
            pgsearch.term(field => 'description', value => 'socks'),
            pgsearch.term(field => 'description', value => 'novel')
        ]
    )
);
Note

The tokenizer lowercases indexed keywords. Use lowercase values in pgsearch.term().

ParameterDescription
fieldField to search. Leave blank to search all indexed fields.
termsArray of terms to match.

Phrase search

Search for rows containing an array of keywords in order. The index record option must be set to position.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.phrase(
        field => 'description',
        phrases => ARRAY['little', 'red', 'riding', 'hood'],
        slop => 0
    )
);
ParameterDescription
fieldField to search. Leave blank to search all indexed fields.
phrasesOrdered array of keywords. All must appear in the row in the same order.
slopMaximum distance between adjacent keywords. 0 means the keywords must be consecutive. Values greater than 0 allow other words between them.

Phrase prefix search

Search for rows that start with an ordered sequence of keywords.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.phrase_prefix(
        field => 'description',
        phrases => ARRAY['little', 're'],
        max_expansion => 1
    )
);
ParameterDescription
fieldField to search. Leave blank to search all indexed fields.
phrasesPrefix keyword array. For example, ARRAY['little', 're'] matches "little red riding hood".
max_expansionMaximum number of term variants expanded from the prefix. Narrows the search scope.

Tokenizer terms search

Segment the query with a specified tokenizer, then search the resulting tokens. Supports Boolean logic between tokens.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    pgsearch.tokenizer_terms('description', '朝阳百货', pgsearch.tokenizer('jieba'), 'OR')
);
ParameterDescription
tokenizerTokenizer configuration string from pgsearch.tokenizer(). Default: jieba.
operatorBoolean operator between tokens. OR returns rows matching any token. AND requires all tokens. Default: OR.

Fuzzy term search

Match terms within a specified edit distance (Levenshtein distance), returning results even if the spelling is imperfect.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.fuzzy_term(
        field => 'description',
        value => 'wow',
        distance => 2,
        tranposition_cost_one => true,
        prefix => true
    )
);
ParameterDescription
fieldField to search. Leave blank to search all indexed fields.
valueSearch keyword. Fuzzy matching uses Levenshtein distance.
distanceMaximum edit distance. Default: 2. Maximum: 2.
tranposition_cost_oneCounts swapping two adjacent characters as one edit instead of two. Default: true.
prefixExcludes the keyword prefix from the edit distance calculation. Default: true.

Range search

Search for values within a numeric or datetime range.

SELECT * FROM mock_items ORDER BY rating @@@ pgsearch.config(
    query => pgsearch.range(
        field => 'rating',
        range => '[1,4)'::int4range
    )
);
ParameterDescription
fieldField to search. Leave blank to search all indexed fields.
rangeValue range. Supported types: INT4RANGE, INT8RANGE, DATERANGE, TSRANGE, TSTZRANGE.

Regular expression search

Match rows containing words that fit a regular expression pattern.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.regex(
        field => 'description',
        pattern => '(glass|screen|like|cloth|phone)'
    )
);
ParameterDescription
fieldField to search. Leave blank to search all indexed fields.
patternRegular expression pattern.

Boolean query

Combine subqueries with must, must_not, and should conditions.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.boolean(
        should => ARRAY[
            pgsearch.parse('description:socks'),
            pgsearch.phrase_prefix(field => 'description', phrases => ARRAY['book']),
            pgsearch.term(field => 'description', value => 'writer'),
            pgsearch.fuzzy_term(field => 'description', value => 'wow')
        ],
        must_not => ARRAY[
            pgsearch.term(field => 'description', value => 'writer')
        ],
        must => ARRAY[
            pgsearch.term(field => 'rating', value => 4)
        ]
    )
);
ParameterDescription
mustConditions that all matching rows must satisfy.
must_notConditions that exclude rows when matched.
shouldConditions where at least one must match (when must is not set).

Rank boost

Multiply the BM25 score of a subquery by a constant to push its results higher in rankings.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.boost(query => pgsearch.parse('description:socks'), boost => 2)
);
ParameterDescription
boostMultiplier applied to each result score.
queryThe subquery to boost.

Constant score search

Assign a fixed score to all rows matching a subquery, bypassing BM25 score calculations.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.const_score(query => pgsearch.all(), score => 2)
);
ParameterDescription
scoreFixed score assigned to each matching row.
queryThe subquery to evaluate.

Maximum disjunction search

Match rows against multiple subqueries. Rows matching more subqueries receive higher scores.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.disjunction_max(
        disjuncts => ARRAY[
            pgsearch.parse('description:socks'),
            pgsearch.parse('description:Generic')
        ],
        tie_breaker => 0.75
    )
);
ParameterDescription
disjunctsArray of subqueries.
tie_breakerScore adjustment for rows matching multiple subqueries. The final score is: best_match_score + tie_breaker × sum_of_other_scores. For example, if a row scores 1.0 on the first subquery and 0.5 on the second, the final score is 1.0 + 0.75 × 0.5 = 1.375.

All search

Return all rows that have the indexed field, each with a score of 1.0.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.all()
);

Empty search

Return no rows. Use as a placeholder in test scenarios or edge cases.

SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
    query => pgsearch.empty()
);

Get BM25 scores

Place the @@@ expression in the SELECT list and alias it to retrieve the BM25 score for each row.

AnalyticDB for PostgreSQL returns scores as negative values so that ORDER BY score (ascending) naturally ranks the highest-scoring rows first. A score of -2.86 means the actual BM25 score is 2.86.

BM25 scores are computed only for text_fields and json_fields. Searches on numeric_fields, datetime_fields, and boolean_fields return -1 as the score.

SELECT *, mock_items @@@ pgsearch.config(<query>) AS bm25
FROM mock_items
ORDER BY bm25;

Examples:

-- Full-text search on a text field
SELECT description, rating, description @@@ pgsearch.config('description:socks') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 1;
   description  | rating |    bm25
----------------+--------+------------
 Generic socks  |      4 | -2.1048825
(1 row)
-- Full-text search with a numeric filter
SELECT description, rating, description @@@ pgsearch.config('description:socks AND rating:4') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 4;
   description  | rating |    bm25
----------------+--------+------------
 Generic socks  |      4 | -3.1081846
(1 row)
-- Search on a JSON field
SELECT metadata, metadata @@@ pgsearch.config('metadata.color:White') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 1;
                 metadata                  |   bm25
-------------------------------------------+-----------
 {"color": "White", "location": "China"}   | -3.453373
(1 row)
-- Range search on a numeric field — BM25 score is not available, returns -1
SELECT description, rating, rating @@@ pgsearch.config('rating:[4 TO 5]') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 1;
    description    | rating | bm25
-------------------+--------+------
 Plastic Keyboard  |      4 |   -1
(1 row)

Index management

Query index configuration

SELECT * FROM pgsearch.schema_bm25('search_idx'::regclass);

Delete an index

DROP INDEX search_idx;

Rebuild an index

REINDEX INDEX search_idx;
Important

Index deletion cannot be rolled back.

What's next