The pgsearch extension brings BM25-based full-text search to AnalyticDB for PostgreSQL. Built on Tantivy, a high-performance search engine, it ranks results by relevance — the same algorithm that powers Elasticsearch. Use it to search keywords and phrases across large tables directly from SQL, without an external search service.
Prerequisites
Before you begin, ensure that you have:
An AnalyticDB for PostgreSQL instance
A submitted ticket to install the pgsearch extension (no console-based installation is available)
Restarted your instance after the extension is installed
To uninstall the pgsearch extension, submit a ticket.
Quick start
This section walks through the full workflow: create a test table, build a BM25 index, and run your first search.
1. Create a test table.
CALL pgsearch.create_test_table(table_name => 'mock_items', schema_name => 'public');This creates a mock_items table with the following schema:
CREATE TABLE mock_items (
description TEXT,
rating INTEGER CHECK (rating BETWEEN 1 AND 5),
category VARCHAR(255),
in_stock BOOLEAN,
metadata JSONB,
created_at TIMESTAMP,
last_updated_date DATE,
latest_available_time TIME
);2. Create a BM25 index.
CALL pgsearch.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
text_fields => '{description: {tokenizer: {type: "jieba"}}, category: {}}',
datetime_fields => '{created_at: {}, last_updated_date: {}}',
numeric_fields => '{rating: {}}',
json_fields => '{metadata: {tokenizer: {type: "en_stem"}}}',
boolean_fields => '{in_stock: {}}'
);3. Run a search.
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard')
LIMIT 10;Results are ranked by BM25 score — the most relevant rows appear first.
Create a BM25 index
Call pgsearch.create_bm25() to create an index. Create only one BM25 index per table — multiple indexes consume excessive resources.
Index creation and deletion cannot be rolled back.
Syntax
CALL pgsearch.create_bm25(
index_name => '<index_name>',
table_name => '<table_name>',
schema_name => '<schema_name>',
text_fields => '<text_fields> | pgsearch.field()',
numeric_fields => '<numeric_fields> | pgsearch.field()',
boolean_fields => '<boolean_fields> | pgsearch.field()',
json_fields => '<json_fields> | pgsearch.field()',
datetime_fields => '<datetime_fields> | pgsearch.field()'
)Parameters
| Parameter | Required | Supported data types | Description |
|---|---|---|---|
index_name | Yes | STRING | Name of the index. |
table_name | Yes | STRING | Name of the table to index. |
schema_name | No | STRING | Schema of the table. Defaults to the current schema. |
text_fields | Yes (at least one) | VARCHAR, VARCHAR[], TEXT, TEXT[] | Text fields to index, with per-field configuration. Use a JSON5 string or pgsearch.field(). |
numeric_fields | No | INT2, INT2[], INT4, INT4[], INT8, INT8[], OID, OID[], XID, XID[], FLOAT4, FLOAT4[], FLOAT8, FLOAT8[], NUMERIC, NUMERIC[] | Numeric fields to index. |
boolean_fields | No | BOOLEAN, BOOLEAN[] | Boolean fields to index. |
json_fields | No | JSON, JSONB | JSON fields to index. Supports full-text search on nested text fields within JSON values. |
datetime_fields | No | DATE, DATE[], TIMESTAMP, TIMESTAMP[], TIMESTAMPTZ, TIMESTAMPTZ[], TIME, TIME[], TIMETZ, TIMETZ[] | Date and time fields to index. Defaults to UTC timezone and RFC 3339 format. |
Field configuration options
Each field accepts a JSON5 configuration string or pgsearch.field(). The following options apply:
| Option | Description | Default |
|---|---|---|
fast | Enables fast random access for scoring and filtering. | false for text and JSON fields; true for numeric, boolean, and datetime fields |
fieldnorms | Stores field length. Required for BM25 score computation. | true |
tokenizer | Tokenizer and its configuration. See Tokenizers. | — |
record | Index record type: raw, freq, or position. position enables phrase search. | position |
expand_dots | (JSON fields only) Expands JSON keys that contain periods. For example, {"metadata.color": "red"} is indexed as {"metadata": {"color": "red"}}. | true |
pgsearch.field() function
pgsearch.field() generates the index configuration for a single field. To configure multiple fields, concatenate multiple calls with ||.
pgsearch.field(<name>, <fast>, <fieldnorms>, <record>, <expand_dots>, <tokenizer>)AnalyticDB for PostgreSQL V7.2.1.0 and later require pgsearch.field() when creating an index. Without it, custom word segmentation dictionaries and stop word dictionaries for the jieba tokenizer do not take effect.
Tokenizers
The pgsearch extension includes built-in tokenizers for English, Chinese, Korean, and Japanese. No additional extensions are needed.
| Tokenizer | Description | Configuration |
|---|---|---|
default | Splits on spaces and punctuation, lowercases, removes tokens longer than 255 bytes. | {type: "default"} |
raw | Treats the entire field value as a single token. | {type: "raw"} |
en_stem | Splits on spaces and punctuation, lowercases, and applies English stemming. Removes tokens longer than 40 characters. Use for JSON fields with English content. | {type: "en_stem"} |
whitespace | Splits on whitespace only. | {type: "whitespace"} |
ngram | Generates n-gram sequences. Parameters: min_gram, max_gram, prefix_only. | {type: "ngram", min_gram: 1, max_gram: 2, prefix_only: true} |
chinese_compatible | Splits on spaces and punctuation. Each Chinese character becomes a separate token; consecutive non-Chinese characters form one token. For example, "我爱吃橙子 oranges!12" → 我, 爱, 吃, 橙, 子, oranges, 12. | {type: "chinese_compatible"} |
chinese_lindera | Chinese tokenizer using the CC-CEDICT dictionary. | {type: "chinese_lindera"} |
korean_lindera | Korean tokenizer using the KoDic dictionary. | {type: "korean_lindera"} |
japanese_lindera | Japanese tokenizer using the IPADIC dictionary. | {type: "japanese_lindera"} |
jieba | Chinese tokenizer using the jieba library. Supports custom word segmentation dictionaries and stop word dictionaries. Additional parameters: hmm (Hidden Markov Model for out-of-vocabulary words, default true) and search (fine-grained segmentation, default true). For more information, see Configure a custom word segmentation dictionary and Configure a stop word dictionary. | {type: "jieba", hmm: true, search: true} |
The hmm and search parameters for the jieba tokenizer are available on AnalyticDB for PostgreSQL V7.2.1.0 and later.
Verify tokenizer output before creating an index.
Use pgsearch.tokenizer() to see how a tokenizer segments text before committing to an index.
-- Check how jieba segments Chinese text
SELECT pgsearch.tokenizer('{type: "jieba"}', '数据仓库');
-- Check how ngram segments text
SELECT pgsearch.tokenizer('{type: "ngram", min_gram: 1, max_gram: 2, prefix_only: true}', 'hell');pgsearch.tokenizer() function
pgsearch.tokenizer() generates a tokenizer configuration string. Use it with pgsearch.field() to configure tokenizers and apply filters such as stemming and stop word removal.
pgsearch.tokenizer(<name>, <min_gram>, <max_gram>, <prefix_only>, <search>, <hmm>, <dict>, <stopword>, <lowercase>, <remove_long>, <stemmer>)| Parameter | Description |
|---|---|
name | Tokenizer name. See the tokenizer table above for valid values. |
min_gram, max_gram, prefix_only | ngram tokenizer parameters. |
search, hmm | jieba tokenizer parameters. |
dict | Custom word segmentation dictionary name for jieba. See Configure a custom word segmentation dictionary. |
stopword | Stop word dictionary name. See Configure a stop word dictionary. |
lowercase | Converts keywords to lowercase. Default: true. |
remove_long | Removes tokens whose length in bytes equals or exceeds this value. Each Chinese character is 3 bytes. |
stemmer | Extracts word stems. Set to en for English. For example, run, running, runs, and ran all resolve to run. |
AnalyticDB for PostgreSQL V7.2.1.0 and later require pgsearch.tokenizer() to apply custom word segmentation dictionaries and stop word dictionaries for the jieba tokenizer.
Examples
Use a JSON5 configuration string:
-- ngram tokenizer — all parameters must be specified
CALL pgsearch.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
text_fields => '{description: {tokenizer: {type: "ngram", min_gram: 2, max_gram: 3, prefix_only: false}}}'
);
-- jieba tokenizer
CALL pgsearch.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
text_fields => '{description: {tokenizer: {type: "jieba"}}}'
);
-- Index multiple field types
CALL pgsearch.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
text_fields => '{description: {tokenizer: {type: "jieba"}}, category: {}}',
datetime_fields => '{created_at: {}, last_updated_date: {}}',
numeric_fields => '{rating: {}}',
json_fields => '{metadata: {tokenizer: {type: "en_stem"}}}',
boolean_fields => '{in_stock: {}}'
);Use `pgsearch.field()` and `pgsearch.tokenizer()` (required on V7.2.1.0 and later):
-- Configure description with fast access, position records, and a jieba tokenizer
-- using a custom dictionary and built-in CN_SIMPLE stop words.
-- Concatenate multiple field configurations with ||.
CALL pgsearch.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
text_fields => pgsearch.field('description',
fast => true,
record => 'position',
tokenizer => pgsearch.tokenizer('jieba',
search => false,
dict => 'user_dict',
stopword => 'CN_SIMPLE'))
|| pgsearch.field('category')
);
-- Verify jieba tokenizer output with filters applied
SELECT pgsearch.tokenizer(
pgsearch.tokenizer('jieba',
search => false, dict => 'user_dict', stopword => 'CN_SIMPLE',
lowercase => false, remove_long => 27, stemmer => 'en')::text,
'永和服装饰品有限公司。 Shoping'
);Search syntax
All searches use the @@@ operator with pgsearch.config(). Results are sorted by BM25 score in descending order.
SELECT * FROM <table_name>
ORDER BY <index_col> @@@ pgsearch.config('<query>')
LIMIT <n>;<index_col>: an indexed field name. Use the field that the search keyword targets.<query>: the search expression. When you pass a string,pgsearch.parse()implicitly parses it.
The following two statements are equivalent:
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config('description:socks');
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(query => pgsearch.parse('description:socks'));Basic queries
Search a text field
Search for a word or phrase. Enclose multi-word phrases in double quotation marks.
-- Search for the word "keyboard"
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard');
-- Search for the exact phrase "hello world"
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:"hello world"');Search a JSON field
-- Search for rows where metadata contains {"color": "white"}
SELECT * FROM mock_items
ORDER BY metadata @@@ pgsearch.config('metadata.color:white');Search a datetime field
Datetime fields use UTC timezone and RFC 3339 format.
SELECT * FROM mock_items
ORDER BY created_at @@@ pgsearch.config('created_at:"2023-05-01T09:12:34Z"')
LIMIT 10;Filter on numeric and Boolean fields
Use filter expressions inside pgsearch.config() for numeric and Boolean fields. Filters on indexed fields are faster than standard SQL WHERE clauses.
-- Full-text search on description, filtered to rating < 4
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard AND rating:<4');Proximity search
Use ~N to match terms separated by up to N words.
-- Match "ergonomic keyboard" with at most 1 word between them
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:"ergonomic keyboard"~1');Boolean operators
Combine search terms with AND, OR, and NOT. Use parentheses to group conditions.
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard OR category:toy');Rank improvement
Use ^ followed by a multiplier to boost the BM25 score of specific terms, pushing matching rows higher in results.
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:keyboard^2 OR category:electronics^3');Set search (IN operator)
The IN operator is equivalent to multiple OR conditions but uses less CPU. Each phrase must produce a single token.
Use IN for English only. Chinese word segmentation produces unpredictable token counts, making token-level IN unreliable. Use OR for Chinese.
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:IN [keyboard toy]');Pagination
SELECT * FROM mock_items
ORDER BY description @@@ pgsearch.config('description:socks')
OFFSET 2 LIMIT 10;Advanced queries
Term set search
Search for rows that match any term in an array.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.term_set(
terms => ARRAY[
pgsearch.term(field => 'description', value => 'socks'),
pgsearch.term(field => 'description', value => 'novel')
]
)
);The tokenizer lowercases indexed keywords. Use lowercase values in pgsearch.term().
| Parameter | Description |
|---|---|
field | Field to search. Leave blank to search all indexed fields. |
terms | Array of terms to match. |
Phrase search
Search for rows containing an array of keywords in order. The index record option must be set to position.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.phrase(
field => 'description',
phrases => ARRAY['little', 'red', 'riding', 'hood'],
slop => 0
)
);| Parameter | Description |
|---|---|
field | Field to search. Leave blank to search all indexed fields. |
phrases | Ordered array of keywords. All must appear in the row in the same order. |
slop | Maximum distance between adjacent keywords. 0 means the keywords must be consecutive. Values greater than 0 allow other words between them. |
Phrase prefix search
Search for rows that start with an ordered sequence of keywords.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.phrase_prefix(
field => 'description',
phrases => ARRAY['little', 're'],
max_expansion => 1
)
);| Parameter | Description |
|---|---|
field | Field to search. Leave blank to search all indexed fields. |
phrases | Prefix keyword array. For example, ARRAY['little', 're'] matches "little red riding hood". |
max_expansion | Maximum number of term variants expanded from the prefix. Narrows the search scope. |
Tokenizer terms search
Segment the query with a specified tokenizer, then search the resulting tokens. Supports Boolean logic between tokens.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
pgsearch.tokenizer_terms('description', '朝阳百货', pgsearch.tokenizer('jieba'), 'OR')
);| Parameter | Description |
|---|---|
tokenizer | Tokenizer configuration string from pgsearch.tokenizer(). Default: jieba. |
operator | Boolean operator between tokens. OR returns rows matching any token. AND requires all tokens. Default: OR. |
Fuzzy term search
Match terms within a specified edit distance (Levenshtein distance), returning results even if the spelling is imperfect.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.fuzzy_term(
field => 'description',
value => 'wow',
distance => 2,
tranposition_cost_one => true,
prefix => true
)
);| Parameter | Description |
|---|---|
field | Field to search. Leave blank to search all indexed fields. |
value | Search keyword. Fuzzy matching uses Levenshtein distance. |
distance | Maximum edit distance. Default: 2. Maximum: 2. |
tranposition_cost_one | Counts swapping two adjacent characters as one edit instead of two. Default: true. |
prefix | Excludes the keyword prefix from the edit distance calculation. Default: true. |
Range search
Search for values within a numeric or datetime range.
SELECT * FROM mock_items ORDER BY rating @@@ pgsearch.config(
query => pgsearch.range(
field => 'rating',
range => '[1,4)'::int4range
)
);| Parameter | Description |
|---|---|
field | Field to search. Leave blank to search all indexed fields. |
range | Value range. Supported types: INT4RANGE, INT8RANGE, DATERANGE, TSRANGE, TSTZRANGE. |
Regular expression search
Match rows containing words that fit a regular expression pattern.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.regex(
field => 'description',
pattern => '(glass|screen|like|cloth|phone)'
)
);| Parameter | Description |
|---|---|
field | Field to search. Leave blank to search all indexed fields. |
pattern | Regular expression pattern. |
Boolean query
Combine subqueries with must, must_not, and should conditions.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.boolean(
should => ARRAY[
pgsearch.parse('description:socks'),
pgsearch.phrase_prefix(field => 'description', phrases => ARRAY['book']),
pgsearch.term(field => 'description', value => 'writer'),
pgsearch.fuzzy_term(field => 'description', value => 'wow')
],
must_not => ARRAY[
pgsearch.term(field => 'description', value => 'writer')
],
must => ARRAY[
pgsearch.term(field => 'rating', value => 4)
]
)
);| Parameter | Description |
|---|---|
must | Conditions that all matching rows must satisfy. |
must_not | Conditions that exclude rows when matched. |
should | Conditions where at least one must match (when must is not set). |
Rank boost
Multiply the BM25 score of a subquery by a constant to push its results higher in rankings.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.boost(query => pgsearch.parse('description:socks'), boost => 2)
);| Parameter | Description |
|---|---|
boost | Multiplier applied to each result score. |
query | The subquery to boost. |
Constant score search
Assign a fixed score to all rows matching a subquery, bypassing BM25 score calculations.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.const_score(query => pgsearch.all(), score => 2)
);| Parameter | Description |
|---|---|
score | Fixed score assigned to each matching row. |
query | The subquery to evaluate. |
Maximum disjunction search
Match rows against multiple subqueries. Rows matching more subqueries receive higher scores.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.disjunction_max(
disjuncts => ARRAY[
pgsearch.parse('description:socks'),
pgsearch.parse('description:Generic')
],
tie_breaker => 0.75
)
);| Parameter | Description |
|---|---|
disjuncts | Array of subqueries. |
tie_breaker | Score adjustment for rows matching multiple subqueries. The final score is: best_match_score + tie_breaker × sum_of_other_scores. For example, if a row scores 1.0 on the first subquery and 0.5 on the second, the final score is 1.0 + 0.75 × 0.5 = 1.375. |
All search
Return all rows that have the indexed field, each with a score of 1.0.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.all()
);Empty search
Return no rows. Use as a placeholder in test scenarios or edge cases.
SELECT * FROM mock_items ORDER BY description @@@ pgsearch.config(
query => pgsearch.empty()
);Get BM25 scores
Place the @@@ expression in the SELECT list and alias it to retrieve the BM25 score for each row.
AnalyticDB for PostgreSQL returns scores as negative values so that ORDER BY score (ascending) naturally ranks the highest-scoring rows first. A score of -2.86 means the actual BM25 score is 2.86.
BM25 scores are computed only for text_fields and json_fields. Searches on numeric_fields, datetime_fields, and boolean_fields return -1 as the score.
SELECT *, mock_items @@@ pgsearch.config(<query>) AS bm25
FROM mock_items
ORDER BY bm25;Examples:
-- Full-text search on a text field
SELECT description, rating, description @@@ pgsearch.config('description:socks') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 1; description | rating | bm25
----------------+--------+------------
Generic socks | 4 | -2.1048825
(1 row)-- Full-text search with a numeric filter
SELECT description, rating, description @@@ pgsearch.config('description:socks AND rating:4') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 4; description | rating | bm25
----------------+--------+------------
Generic socks | 4 | -3.1081846
(1 row)-- Search on a JSON field
SELECT metadata, metadata @@@ pgsearch.config('metadata.color:White') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 1; metadata | bm25
-------------------------------------------+-----------
{"color": "White", "location": "China"} | -3.453373
(1 row)-- Range search on a numeric field — BM25 score is not available, returns -1
SELECT description, rating, rating @@@ pgsearch.config('rating:[4 TO 5]') AS bm25
FROM mock_items ORDER BY bm25 LIMIT 1; description | rating | bm25
-------------------+--------+------
Plastic Keyboard | 4 | -1
(1 row)Index management
Query index configuration
SELECT * FROM pgsearch.schema_bm25('search_idx'::regclass);Delete an index
DROP INDEX search_idx;Rebuild an index
REINDEX INDEX search_idx;Index deletion cannot be rolled back.