All Products
Search
Document Center

PolarDB:Use Chinese tokenization

Last Updated:Jan 05, 2026

The built-in full-text index feature in PostgreSQL does not effectively process Chinese text. This topic describes how to use the zhparser extension in and PolarDB for PostgreSQL (Compatible with Oracle) to enable efficient and accurate Chinese full-text indexing.

Enable Chinese tokenization

  • Enable the extension.

    CREATE EXTENSION IF NOT EXISTS zhparser;
  • Create a Chinese tokenization configuration.

    -- Create a configuration named zh_cfg and specify the parser as zhparser.
    CREATE TEXT SEARCH CONFIGURATION zh_cfg (PARSER = zhparser);
    
    -- Add a mapping for specific token types, such as nouns (n), verbs (v), and adjectives (a).
    -- This uses the 'simple' dictionary, which converts text to lowercase and checks for stop words.
    ALTER TEXT SEARCH CONFIGURATION zh_cfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
  • Set global parameters.

    The zhparser extension provides adjustable parameters. You can set these parameters using ALTER ROLE or ALTER DATABASE.

    -- Example: Enable the "multi-short phrase" mode. This mode attempts to combine contiguous single characters to form longer words.
    ALTER ROLE ALL SET zhparser.multi_short = ON;
  • Test the tokenization results.

    -- 1. Test basic lexical analysis (ts_parse).
    SELECT * FROM ts_parse('zhparser', 'In 2011, affordable housing entered a larger-scale construction phase.');
    
    -- 2. Test converting text to a word embedding (to_tsvector).
    --    This uses the 'zh_cfg' configuration that we just created.
    SELECT to_tsvector('zh_cfg', 'Compared to 2011, the pressure on funding for affordable housing construction in 2012 will be more severe.');
    
    -- 3. Test converting a query string to a query tree (to_tsquery).
    --    'affordable housing' & 'funding' & 'pressure'
    SELECT to_tsquery('zh_cfg', 'affordable housing & funding & pressure');
    

Create and use a full-text index

  1. Prepare a sample table and data:

    -- Create a sample table t1.
    CREATE TABLE t1 (
        id SERIAL PRIMARY KEY,
        name TEXT
    );
    
    -- Insert some test data.
    INSERT INTO t1 (name) VALUES
    ('The central government increases the construction and investment in affordable housing'),
    ('The funding pressure for affordable housing is severe'),
    ('Fire and theft prevention are the focus of security work');
  2. Create a GIN index for the target field:

    Create a GIN index for the name field in the t1 table. The index is based on the zh_cfg configuration.

    -- Create a GIN index for the name field using the 'zh_cfg' configuration.
    CREATE INDEX idx_gin_t1_name ON t1 USING gin (to_tsvector('zh_cfg', name));
  3. Execute an efficient full-text index query:

    You can use the @@ operator to run the query. The to_tsvector expression in the WHERE clause must exactly match the expression in the CREATE INDEX statement. This ensures that the optimizer uses the created index.

    -- Example 1: Query for records that contain "affordable housing".
    -- The query expression to_tsvector('zh_cfg', name) exactly matches the index definition and will use the index efficiently.
    SELECT name FROM t1 WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'affordable housing');
    
    -- Example 2: Query for records that contain both "affordable housing" and "funding" (using the & operator).
    SELECT name FROM t1 WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'affordable housing & funding');
    
    -- Example 3: Query for records that contain "fire prevention".
    SELECT name FROM t1 WHERE to_tsvector('zh_cfg', name) @@ to_tsquery('zh_cfg', 'fire prevention');