Configure zhparser for Chinese word segmentation in full-text search - AnalyticDB

PostgreSQL's default text parser splits input on spaces and punctuation, which works well for European languages where word boundaries are marked with spaces. Chinese text has no spaces between words, so the default parser treats an entire sentence as a single token and cannot match individual words. The zhparser extension solves this by applying a dedicated Chinese lexicon and segmentation algorithm, enabling full-text search over Chinese content in AnalyticDB for PostgreSQL.

Prerequisites

Before you begin, ensure that you have:

An AnalyticDB for PostgreSQL instance
The zhparser extension installed on the Extensions page of your instance (see Install, update, and uninstall extensions)

How full-text search works in PostgreSQL

Full-text search in PostgreSQL relies on two data types:

tsvector — a preprocessed document representation that stores normalized lexemes with their positions
tsquery — a search query expression

Two common query patterns:

Direct query:

SELECT name FROM <table>
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'friend');

With a Generalized Inverted Index (GIN) index for better performance:

CREATE INDEX <idx_name> ON <table> USING gin(to_tsvector('english', name));

Once you configure zhparser, replace 'english' with 'zh_cn' in both patterns to enable Chinese segmentation.

Configure zhparser

Step 1: Create a text search configuration

After installing the extension, create a text search configuration named zh_cn that uses zhparser as its parser:

CREATE TEXT SEARCH CONFIGURATION zh_cn (PARSER = zhparser);

To verify the configuration was created, run \dF or \dFp in psql.

Step 2: Review available token types

zhparser classifies Chinese text into 26 token types. Run the following to see them:

SELECT ts_token_type('zhparser');

The output lists all available types:

 tokid | alias |      description
-------+-------+--------------------
    97 | a     | adjective
    98 | b     | differentiation
    99 | c     | conjunction
   100 | d     | adverb
   101 | e     | exclamation
   102 | f     | position
   103 | g     | root
   104 | h     | head
   105 | i     | idiom
   106 | j     | abbreviation
   107 | k     | tail
   108 | l     | tmp
   109 | m     | numeral
   110 | n     | noun
   111 | o     | onomatopoeia
   112 | p     | prepositional
   113 | q     | quantity
   114 | r     | pronoun
   115 | s     | space
   116 | t     | time
   117 | u     | auxiliary
   118 | v     | verb
   119 | w     | punctuation
   120 | x     | unknown
   121 | y     | modal
   122 | z     | status
(26 rows)

To check the current zh_cn configuration:

SELECT * FROM pg_ts_config_map
WHERE mapcfg = (SELECT oid FROM pg_ts_config WHERE cfgname = 'zh_cn');

Step 3: Map token types to a dictionary

Add mappings to specify which token types are indexed. The following example maps nouns, verbs, adjectives, idioms, exclamations, and temporary idioms to the simple dictionary:

ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR n,v,a,i,e,l WITH simple;

To remove those mappings:

ALTER TEXT SEARCH CONFIGURATION zh_cn DROP MAPPING IF EXISTS FOR n,v,a,i,e,l;

Step 4: Verify segmentation

Test to_tsvector and to_tsquery with the zh_cn configuration:

SELECT to_tsvector('zh_cn', '有两种方法进行全文检索');

              to_tsvector
--------------------------------------
 '全文检索':4 '方法':2 '有':1 '进行':3
(1 row)

SELECT to_tsquery('zh_cn', '有两种方法进行全文检索');

              to_tsquery
-------------------------------------
 '有' & '方法' & '进行' & '全文检索'
(1 row)

Custom dictionaries

zhparser supports a custom dictionary table — zhparser.zhprs_custom_word — for adding domain-specific terms or stop words. The table is created automatically when you install the extension.

Table structure

CREATE TABLE zhparser.zhprs_custom_word
(
    word text PRIMARY KEY,                                    -- Custom word
    tf   FLOAT DEFAULT '1.0',                                -- Term frequency (TF). Default: 1.0.
    idf  FLOAT DEFAULT '1.0',                                -- Inverse document frequency (IDF). Default: 1.0.
    attr CHAR  DEFAULT '@', CHECK(attr = '@' OR attr = '!')  -- Word type: @ (new word), ! (stop word).
);

Configure the custom dictionary

Add a mapping so that unknown tokens (type x) are looked up in the custom dictionary:

ALTER TEXT SEARCH CONFIGURATION zh_cn ADD MAPPING FOR x WITH simple;

Add and remove words

Add a word to the custom dictionary:

INSERT INTO zhparser.zhprs_custom_word(word, attr) VALUES('两种方法', '@');

Remove a word:

DELETE FROM zhparser.zhprs_custom_word WHERE word = '两种方法';

Query all entries:

SELECT * FROM zhparser.zhprs_custom_word;

Reload and verify

After any change to zhparser.zhprs_custom_word, reload the table for the changes to take effect:

SELECT sync_zhprs_custom_word();

Run the same query before and after to confirm the effect:

SELECT to_tsvector('zh_cn', '有两种方法进行全文检索');

Before adding `两种方法` — the phrase splits into individual tokens:

              to_tsvector
--------------------------------------
 '全文检索':4 '方法':2 '有':1 '进行':3
(1 row)

After adding `两种方法` — the phrase is treated as a single unit:

                  to_tsvector
----------------------------------------------
 '两种方法':2 '全文检索':4 '有':1 '进行':3
(1 row)

What's next

Full-Text Search — PostgreSQL full-text search reference
Text Search Functions and Operators — functions and operators for full-text search