PostgreSQL's default text parser splits input on spaces and punctuation, which works well for European languages where word boundaries are marked with spaces. Chinese text has no spaces between words, so the default parser treats an entire sentence as a single token and cannot match individual words. The zhparser extension solves this by applying a dedicated Chinese lexicon and segmentation algorithm, enabling full-text search over Chinese content in AnalyticDB for PostgreSQL.
Prerequisites
Before you begin, ensure that you have:
An AnalyticDB for PostgreSQL instance
The zhparser extension installed on the Extensions page of your instance (see Install, update, and uninstall extensions)
How full-text search works in PostgreSQL
Full-text search in PostgreSQL relies on two data types:
tsvector — a preprocessed document representation that stores normalized lexemes with their positions
tsquery — a search query expression
Two common query patterns:
Direct query:
SELECT name FROM <table>
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'friend');With a Generalized Inverted Index (GIN) index for better performance:
CREATE INDEX <idx_name> ON <table> USING gin(to_tsvector('english', name));Once you configure zhparser, replace 'english' with 'zh_cn' in both patterns to enable Chinese segmentation.
Configure zhparser
Step 1: Create a text search configuration
After installing the extension, create a text search configuration named zh_cn that uses zhparser as its parser:
CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);To verify the configuration was created, run \dF or \dFp in psql.
Step 2: Review available token types
zhparser classifies Chinese text into 26 token types. Run the following to see them:
SELECT ts_token_type('zhparser');The output lists all available types:
tokid | alias | description
-------+-------+--------------------
97 | a | adjective
98 | b | differentiation
99 | c | conjunction
100 | d | adverb
101 | e | exclamation
102 | f | position
103 | g | root
104 | h | head
105 | i | idiom
106 | j | abbreviation
107 | k | tail
108 | l | tmp
109 | m | numeral
110 | n | noun
111 | o | onomatopoeia
112 | p | prepositional
113 | q | quantity
114 | r | pronoun
115 | s | space
116 | t | time
117 | u | auxiliary
118 | v | verb
119 | w | punctuation
120 | x | unknown
121 | y | modal
122 | z | status
(26 rows)To check the current zh_cn configuration:
SELECT * FROM pg_ts_config_map
WHERE mapcfg = (SELECT oid FROM pg_ts_config WHERE cfgname = 'zh_cn');Step 3: Map token types to a dictionary
Add mappings to specify which token types are indexed. The following example maps nouns, verbs, adjectives, idioms, exclamations, and temporary idioms to the simple dictionary:
ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l WITH simple;To remove those mappings:
ALTER TEXT SEARCH CONFIGURATION zh_cn DROP MAPPING IF EXISTS FOR n,v,a,i,e,l;Step 4: Verify segmentation
Test to_tsvector and to_tsquery with the zh_cn configuration:
SELECT to_tsvector('zh_cn', '有两种方法进行全文检索'); to_tsvector
--------------------------------------
'全文检索':4 '方法':2 '有':1 '进行':3
(1 row)SELECT to_tsquery('zh_cn', '有两种方法进行全文检索'); to_tsquery
-------------------------------------
'有' & '方法' & '进行' & '全文检索'
(1 row)Custom dictionaries
zhparser supports a custom dictionary table — zhparser.zhprs_custom_word — for adding domain-specific terms or stop words. The table is created automatically when you install the extension.
Table structure
CREATE TABLE zhparser.zhprs_custom_word
(
word text PRIMARY KEY, -- Custom word
tf FLOAT DEFAULT '1.0', -- Term frequency (TF). Default: 1.0.
idf FLOAT DEFAULT '1.0', -- Inverse document frequency (IDF). Default: 1.0.
attr CHAR DEFAULT '@', CHECK(attr = '@' OR attr = '!') -- Word type: @ (new word), ! (stop word).
);Configure the custom dictionary
Add a mapping so that unknown tokens (type x) are looked up in the custom dictionary:
ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR x WITH simple;Add and remove words
Add a word to the custom dictionary:
INSERT INTO zhparser.zhprs_custom_word(word, attr) VALUES('两种方法', '@');Remove a word:
DELETE FROM zhparser.zhprs_custom_word WHERE word = '两种方法';Query all entries:
SELECT * FROM zhparser.zhprs_custom_word;Reload and verify
After any change to zhparser.zhprs_custom_word, reload the table for the changes to take effect:
SELECT sync_zhprs_custom_word();Run the same query before and after to confirm the effect:
SELECT to_tsvector('zh_cn', '有两种方法进行全文检索');Before adding `两种方法` — the phrase splits into individual tokens:
to_tsvector
--------------------------------------
'全文检索':4 '方法':2 '有':1 '进行':3
(1 row)After adding `两种方法` — the phrase is treated as a single unit:
to_tsvector
----------------------------------------------
'两种方法':2 '全文检索':4 '有':1 '进行':3
(1 row)What's next
Full-Text Search — PostgreSQL full-text search reference
Text Search Functions and Operators — functions and operators for full-text search