zhparser is a PostgreSQL extension for Chinese full-text search. It tokenizes Chinese text into word segments so you can build full-text indexes and run text queries on Chinese content in PolarDB for PostgreSQL.
Enable zhparser
Step 1: Install the extension and create a text search configuration.
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;Step 2: (Optional) Configure segmentation parameters.
Set parameters at the role level to control how zhparser segments text:
Parameter | Default | Description |
|
| Combines short words into compound segments |
To enable multi_short for all roles:
ALTER ROLE ALL SET zhparser.multi_short = on;Step 3: Test the parser.
Run a quick test to verify the extension is working:
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大了保障房的建设和投入力度。2011年,保障房进入了更大规模的建设阶段。住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。');Verify the text search vector and query functions:
SELECT to_tsvector('testzhcfg', '"今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历史纪录。"陈国强说。在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。');
SELECT to_tsquery('testzhcfg', '保障房资金压力');Create a full-text index
Use a Generalized Inverted Index (GIN) to index Chinese text for fast full-text search. The following example creates a GIN index on the name column of table t1:
-- Create the GIN index
CREATE INDEX idx_t1 ON t1 USING gin (to_tsvector('zhcfg', upper(name)));
-- Query using the index
SELECT * FROM t1 WHERE to_tsvector('zhcfg', upper(t1.name)) @@ to_tsquery('zhcfg', '(防火)');Customize a Chinese word segmentation dictionary
The default dictionary covers common Chinese words. For domain-specific terms such as industry jargon or product names, add custom word segments to pg_ts_custom_word.
Step 1: Check the current segmentation result.
SELECT to_tsquery('testzhcfg', '保障房资金压力');Step 2: Add the custom word segment.
INSERT INTO pg_ts_custom_word VALUES ('保障房资');Step 3: Sync the dictionary and reconnect.
SELECT zhprs_sync_dict_xdb();After the sync completes, close and reopen the connection:
\cStep 4: Verify the new segmentation result.
SELECT to_tsquery('testzhcfg', '保障房资金压力');Limits
Limit | Value | Behavior when exceeded |
Custom word segments | 1,000,000 | Word segments beyond the limit are ignored |
Word segment length | 128 bytes | Bytes beyond 128 are truncated |
The custom dictionary and the default dictionary are both active at the same time.
After any add, delete, or update to word segments, run SELECT zhprs_sync_dict_xdb(); and reconnect for the changes to take effect.