The zhparser extension enables Chinese full-text search on ApsaraDB RDS for PostgreSQL. Unlike English, Chinese text has no spaces between words, so PostgreSQL's built-in parser cannot segment it correctly. zhparser segments Chinese text based on semantics, enabling accurate full-text indexing and search.
Prerequisites
Before you begin, make sure that:
-
The RDS instance runs PostgreSQL 10 or later
-
The minor engine version is 20230830 or later. For PostgreSQL 17, the minor engine version must be 20241030 or later
-
zhparseris added to theshared_preload_librariesparameter of the instance. For instructions, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance
New and recreated extensions require minor engine version 20230830 or later. If your instance runs an earlier version, update it before creating the extension. See Update the minor engine version. If the extension is already installed on an earlier version, it continues to work. For more information, see [Product changes/Feature changes] Limits on extension creation for ApsaraDB RDS for PostgreSQL instances.
Enable zhparser
Run the following statements to create the extension and configure a text search configuration named testzhcfg:
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
-- Optional: enable short-word compounding for more granular segmentation
ALTER ROLE CURRENT_ROLE SET zhparser.multi_short=on;
The ADD MAPPING FOR n,v,a,i,e,l line maps six core token types (n, v, a, i, e, l) to the simple dictionary.
Verify segmentation
PostgreSQL full-text search uses three key functions:
| Function | Purpose | Example input → output |
|---|---|---|
ts_parse() |
Returns raw token output with token type IDs | Raw tokens from the parser |
to_tsvector() |
Converts text to normalized lexemes for indexing | Text → indexed lexeme list |
to_tsquery() |
Converts a phrase into a query expression | Phrase → query for matching against a tsvector |
To verify that zhparser segments text correctly, run the following test queries:
-- Test raw token output
-- Returns a (tokid, token) result set, where tokid identifies the token type
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了 保 障 房 的 建 设 和 投 入 力 度 。 2011年,保障房进入了更大规模的建设阶段。 住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。 ');
-- Convert text to a tsvector (normalized lexeme list used for indexing)
SELECT to_tsvector('testzhcfg','"今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历史纪录。"陈国强说。 在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。 ');
-- Convert a search phrase to a tsquery for use with the @@ operator
SELECT to_tsquery('testzhcfg', '保障房资金压力');
Create a full-text index and run queries
After enabling zhparser, create a GIN (Generalized Inverted Index) index on the column to search. The following example creates a full-text index on the name column of table t1:
-- Replace t1 and name with your actual table name and column name
CREATE INDEX idx_t1 ON t1 USING gin (to_tsvector('testzhcfg', upper(name)));
Use the @@ operator to match rows against a search query:
SELECT * FROM t1 WHERE to_tsvector('testzhcfg', upper(t1.name)) @@ to_tsquery('testzhcfg', '(防火)');
The index and the WHERE clause must use the same expression — to_tsvector('testzhcfg', upper(name)) — so that PostgreSQL can use the index during query execution.
Add custom word segments
To add domain-specific terms that the built-in dictionary does not segment correctly, insert them into pg_ts_custom_word.
The following example adds 保障房资 as a single word segment:
-- Check the current segmentation result
SELECT to_tsquery('testzhcfg', '保障房资金压力');
-- Add the new word segment
INSERT INTO pg_ts_custom_word VALUES ('保障房资');
-- Sync the dictionary to apply the change
SELECT zhprs_sync_dict_xdb();
-- Reconnect to the database (the new session picks up the updated dictionary)
\c
-- Verify the new segmentation result
SELECT to_tsquery('testzhcfg', '保障房资金压力');
After adding, deleting, or changing word segments, you must call zhprs_sync_dict_xdb() and reconnect to the database for the changes to take effect. Changes do not apply to existing sessions.
Custom dictionary limits
| Limit | Value |
|---|---|
| Maximum number of custom word segments | 1,000,000 |
| Maximum length per word segment | 128 bytes |
If the number of word segments exceeds 1,000,000, word segments beyond the limit are ignored. Word segments longer than 128 bytes are truncated at the 128th byte.
The custom dictionary and the built-in dictionary are active at the same time.