Full-text search converts natural language text into queryable data. AnalyticDB for PostgreSQL builds on the PostgreSQL engine to provide full-text search — letting you index, search, rank, and highlight text data without leaving your data warehouse.
Why full-text search
Standard SQL operators such as LIKE and ILIKE work for simple text matching, but fall short in three ways:
No linguistic support.
LIKE 'satisfy%'misses satisfies, satisfied, and other word forms. Covering all forms withORconditions is error-prone at scale.No relevance ranking. When thousands of rows match, there is no built-in way to order results by how well they match the query.
No index support. Every query scans the full table, which becomes prohibitively slow on large datasets.
Full-text search in AnalyticDB for PostgreSQL addresses all three limitations through precomputed indexes.
How it works
AnalyticDB for PostgreSQL precomputes text data in three steps:
Tokenize — Parse text into tokens and classify them by type (numbers, adjectives, adverbs, and so on). The default parser handles most languages; custom parsers are available for specialized use cases.
Normalize — Map tokens to canonical word forms. For example, satisfies and satisfied both normalize to satisfy. Dictionaries drive this process; custom dictionaries are supported.
Index — Store normalized words in a tsvector, then build a Generalized Inverted Index (GIN) over it for fast lookups.
Key concepts
tsvector
A tsvector is a sorted list of normalized words with their positions in the source text. Use to_tsvector to convert a string:
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats');
to_tsvector
---------------------------------------------------------------
'ate':9 'cat':3 'fat':2,11 'jump':4 'mat':7 'rat':12 'two':10
(1 row)Notice that:
Conjunctions (a, on, and) are dropped as stop words.
jumped is normalized to jump, and rats to rat.
'fat':2,11means fat appears at positions 2 and 11.
tsquery
A tsquery stores the words to search for. Use to_tsquery to build a query, then the @@ operator to test whether a tsvector contains a match:
-- Exact match (normalized): both 'cat' and 'cats' return true
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat');
?column?
----------
t
(1 row)
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cats');
?column?
----------
t
(1 row)
-- Boolean OR: matches documents containing 'cat' or 'dog'
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat | dog');
?column?
----------
t
(1 row)
-- Boolean AND: no match because 'dog' is not in the text
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat & dog');
?column?
----------
f
(1 row)Both cat and cats return t because to_tsquery normalizes the search term to its root form.
Distance search
Distance search is supported only on AnalyticDB for PostgreSQL V7.0 instances.
Use the phrase search operator <N> to match words separated by an exact distance of N positions:
-- 'cat' followed by 'jump' at distance 1: matches
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<1>jump');
?column?
----------
t
(1 row)
-- 'cat' and 'mat' at distance 2: no match (actual distance is 4)
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<2>mat');
?column?
----------
f
(1 row)
-- 'cat' and 'mat' at distance 4: matches
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<4>mat');
?column?
----------
t
(1 row)AnalyticDB for PostgreSQL extends the standard operator with <N,M> to match a range of distances from N to M:
-- 'cat' and 'mat' with distance between 1 and 5: matches
SELECT to_tsvector('a fat cat jumped on a mat and ate two fat rats') @@ to_tsquery('cat<1,5>mat');
?column?
----------
t
(1 row)cat<1,5>mat matches any document where cat and mat are between 1 and 5 positions apart.
Chinese word segmentation with zhparser
Chinese text has no spaces between words, so the default PostgreSQL parser cannot segment it correctly. AnalyticDB for PostgreSQL bundles the zhparser extension to address this.
zhparser is built on Simple Chinese Word Segmentation (SCWS), an open source engine written in C that uses word frequency dictionaries. It integrates as a PostgreSQL extension via a dynamic-link library, making it compatible with the standard full-text search framework, and supports custom configuration and custom dictionaries.
To enable Chinese word segmentation, create a text search configuration that uses the zhparser parser:
-- Create a text search configuration using the zhparser parser
CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);
-- Map word types: n=noun, v=verb, a=adjective, i=idiom, e=exclamation, l=temporary idiom, x=custom
ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l,x WITH simple;After this configuration, pass 'zh_cn' as the language argument to to_tsvector and to_tsquery.
For the full configuration guide, see Use zhparser to perform Chinese word segmentation.
Custom dictionaries
If the built-in word frequency dictionary does not segment your domain-specific terms correctly, add entries to the zhparser.zhprs_custom_word system table:
Table "zhparser.zhprs_custom_word"
Column | Type | Collation | Nullable | Default
--------+------------------+-----------+----------+-----------------------
word | text | | not null |
tf | double precision | | | '1'::double precision
idf | double precision | | | '1'::double precision
attr | character(1) | | | '@'::bpchar
Indexes:
"zhprs_custom_word_pkey" PRIMARY KEY, btree (word)
Check constraints:
"zhprs_custom_word_attr_check" CHECK (attr = '@'::bpchar OR attr = '!'::bpchar)| Column | Type | Description |
|---|---|---|
word | text | The custom word (primary key) |
tf | double precision | Term frequency weight (default: 1). Part of TF-IDF (term frequency-inverse document frequency) scoring. |
idf | double precision | Inverse document frequency weight (default: 1) |
attr | character(1) | @ = regular word segment; ! = stop word |
Custom dictionaries are stored in the data directory of each data node, making them database-level configuration.
Full-text search indexes
For large tables, use a Generalized Inverted Index (GIN) to speed up tsvector queries. An inverted index stores the mapping from each word to its positions in the dataset, enabling fast lookups instead of full table scans.
CREATE INDEX text_idx ON document USING GIN (to_tsvector('zh_cn',text));Queries that use the @@ operator against the indexed column automatically use the GIN index.