The built-in full-text index feature in PostgreSQL does not effectively process Chinese text. This topic describes how to use the zhparser extension in and PolarDB for PostgreSQL (Compatible with Oracle) to enable efficient and accurate Chinese full-text indexing.
Enable Chinese tokenization
Enable the extension.
CREATE EXTENSION IF NOT EXISTS zhparser;Create a Chinese tokenization configuration.
-- Create a configuration named zh_cfg and specify the parser as zhparser. CREATE TEXT SEARCH CONFIGURATION zh_cfg (PARSER = zhparser); -- Add a mapping for specific token types, such as nouns (n), verbs (v), and adjectives (a). -- This uses the 'simple' dictionary, which converts text to lowercase and checks for stop words. ALTER TEXT SEARCH CONFIGURATION zh_cfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;Set global parameters.
The
zhparserextension provides adjustable parameters. You can set these parameters usingALTER ROLEorALTER DATABASE.-- Example: Enable the "multi-short phrase" mode. This mode attempts to combine contiguous single characters to form longer words. ALTER ROLE ALL SET zhparser.multi_short = ON;Test the tokenization results.
-- 1. Test basic lexical analysis (ts_parse). SELECT * FROM ts_parse('zhparser', 'In 2011, affordable housing entered a larger-scale construction phase.'); -- 2. Test converting text to a word embedding (to_tsvector). -- This uses the 'zh_cfg' configuration that we just created. SELECT to_tsvector('zh_cfg', 'Compared to 2011, the pressure on funding for affordable housing construction in 2012 will be more severe.'); -- 3. Test converting a query string to a query tree (to_tsquery). -- 'affordable housing' & 'funding' & 'pressure' SELECT to_tsquery('zh_cfg', 'affordable housing & funding & pressure');
Create and use a full-text index
Prepare a sample table and data:
-- Create a sample table t1. CREATE TABLE t1 ( id SERIAL PRIMARY KEY, name TEXT ); -- Insert some test data. INSERT INTO t1 (name) VALUES ('The central government increases the construction and investment in affordable housing'), ('The funding pressure for affordable housing is severe'), ('Fire and theft prevention are the focus of security work');Create a GIN index for the target field:
Create a GIN index for the
namefield in thet1table. The index is based on thezh_cfgconfiguration.-- Create a GIN index for the name field using the 'zh_cfg' configuration. CREATE INDEX idx_gin_t1_name ON t1 USING gin (to_tsvector('zh_cfg', name));Execute an efficient full-text index query:
You can use the
@@operator to run the query. Theto_tsvectorexpression in theWHEREclause must exactly match the expression in theCREATE INDEXstatement. This ensures that the optimizer uses the created index.-- Example 1: Query for records that contain "affordable housing". -- The query expression to_tsvector('zh_cfg', name) exactly matches the index definition and will use the index efficiently. SELECT name FROM t1 WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'affordable housing'); -- Example 2: Query for records that contain both "affordable housing" and "funding" (using the & operator). SELECT name FROM t1 WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'affordable housing & funding'); -- Example 3: Query for records that contain "fire prevention". SELECT name FROM t1 WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'fire prevention');