PostgreSQL's built-in full-text search cannot split Chinese text into words because Chinese does not use spaces as delimiters. The zhparser extension adds Chinese word segmentation to PolarDB for PostgreSQL (Compatible with Oracle), enabling accurate full-text indexing and search on Chinese text fields.
Prerequisites
Before you begin, ensure that you have:
A PolarDB for PostgreSQL (Compatible with Oracle) cluster
Superuser privileges or a role with
CREATE EXTENSIONprivileges
Enable Chinese tokenization
Install the extension
CREATE EXTENSION IF NOT EXISTS zhparser;Create a text search configuration
A text search configuration ties a parser to one or more dictionaries. The following example creates a configuration named zh_cfg using the zhparser parser, then maps the most useful token types to the simple dictionary.
-- Create the configuration and assign the zhparser parser.
CREATE TEXT SEARCH CONFIGURATION zh_cfg (PARSER = zhparser);
-- Map token types to the simple dictionary.
-- The simple dictionary lowercases tokens and filters stop words.
ALTER TEXT SEARCH CONFIGURATION zh_cfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;The token types in the ADD MAPPING FOR clause correspond to Chinese part-of-speech categories:
| Token type | Part of speech | Example |
|---|---|---|
n | Noun | 政府, 住房 |
v | Verb | 建设, 增加 |
a | Adjective | 大规模, 严峻 |
i | Idiom (fixed phrase) | 任重道远 |
e | Exclamation | 哦, 啊 |
l | String (multi-character fixed phrase) | 保障房 |
Map only the token types relevant to your use case. For example, omit e (exclamations) to reduce noise in search results, or omit i (idioms) if your content does not contain fixed phrases.
Configure zhparser parameters
zhparser exposes adjustable parameters that you can set at the role or database level using ALTER ROLE or ALTER DATABASE.
-- Enable multi-short phrase mode.
-- This mode combines contiguous single characters into longer words,
-- which improves recall for abbreviated or informal Chinese text.
ALTER ROLE ALL SET zhparser.multi_short = ON;Verify the configuration
Run the following three queries in sequence to confirm that each layer of the pipeline works correctly. The queries use the sample text that appears in the full-text search examples later in this topic.
Step 1: Check raw lexical analysis
SELECT * FROM ts_parse('zhparser', 'In 2011, affordable housing entered a larger-scale construction phase.');This returns the raw token stream from zhparser. Expected output includes tokens such as 2011, affordable housing, and construction.
Step 2: Check the tsvector output
SELECT to_tsvector('zh_cfg', 'Compared to 2011, the pressure on funding for affordable housing construction in 2012 will be more severe.');Expected output:
'2011':2 '2012':11 'affordable housing':6 'construction':9 'funding':8 'pressure':7The output lists the indexed lexemes and their positions. Verify that the key terms from the input appear as lexemes.
Step 3: Check query parsing
SELECT to_tsquery('zh_cfg', 'affordable housing & funding & pressure');Expected output:
'affordable housing' & 'funding' & 'pressure'If all three steps produce expected output, the configuration is ready to use.
Create and use a full-text index
Prepare sample data
-- Create a sample table.
CREATE TABLE t1 (
id SERIAL PRIMARY KEY,
name TEXT
);
-- Insert test rows.
INSERT INTO t1 (name) VALUES
('The central government increases the construction and investment in affordable housing'),
('The funding pressure for affordable housing is severe'),
('Fire and theft prevention are the focus of security work');Create a GIN index
A Generalized Inverted Index (GIN) index on a tsvector expression enables fast full-text lookups. Create the index on the name column using the zh_cfg configuration:
CREATE INDEX idx_gin_t1_name ON t1 USING gin (to_tsvector('zh_cfg', name));Always use the two-argument form of to_tsvector—to_tsvector('zh_cfg', name)—in the index definition. This locks the configuration name into the index so that its contents are not affected by the session-level default_text_search_config setting. The query in the WHERE clause must use the same two-argument form with the same configuration name. If you omit the configuration name in the query—for example, to_tsvector(name)—the planner cannot verify that the index contents match, and the index will not be used.
Run full-text queries
Use the @@ operator to match a tsvector against a tsquery. The to_tsvector expression in the WHERE clause must exactly match the expression in the CREATE INDEX statement.
Example 1: Single-term search
SELECT name FROM t1
WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'affordable housing');Expected result:
The central government increases the construction and investment in affordable housing
The funding pressure for affordable housing is severeExample 2: AND search
SELECT name FROM t1
WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'affordable housing & funding');Expected result:
The funding pressure for affordable housing is severeExample 3: Single-term search
SELECT name FROM t1
WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'fire prevention');Expected result:
Fire and theft prevention are the focus of security work