All Products
Search
Document Center

PolarDB:pg_jieba (Chinese tokenization)

Last Updated:Mar 30, 2026

PostgreSQL's built-in full-text search supports alphabet-based languages but cannot tokenize Chinese text. pg_jieba bridges this gap: it tokenizes Chinese text into searchable lexemes, enabling full-text search for Chinese content in PolarDB for PostgreSQL.

Supported versions

pg_jieba is available on PolarDB for PostgreSQL clusters running:

  • PostgreSQL 15 (revision version 15.7.2.0 or later)

  • PostgreSQL 14 (revision version 14.5.2.0 or later)

  • PostgreSQL 11 (revision version 1.1.28 or later)

To check your current revision version:

  • PostgreSQL 15 and 14:

    SELECT version();
  • PostgreSQL 11:

    SHOW polar_version;

How it works

pg_jieba provides three text search configurations. Each produces a different tokenization strategy suited to different search scenarios:

Configuration

Mode

Best for

jiebacfg

Exact mode

Indexing. Splits text into the most precise set of terms, without redundant sub-words.

jiebaqry

Full mode

Maximizing recall in queries. Extracts all possible word combinations, including sub-words.

jiebacfg_pos

Exact mode with position

When you need character-offset positions for each token. Also includes stop words that exact mode drops.

Choosing between exact mode and full mode: Use jiebacfg for most full-text index scenarios — it returns fewer, more precise tokens. Use jiebaqry when missing a search hit matters more than filtering noise; it returns every possible sub-word, which improves recall at the cost of potential duplicate matches.

Enable and disable the extension

Note

Only privileged accounts can run these statements.

To create the pg_jieba extension:

CREATE EXTENSION pg_jieba;

To remove it:

DROP EXTENSION pg_jieba;

Tokenize Chinese text

All three configurations work with the standard to_tsvector function. The configuration name is the first argument.

Example 1 — Input: '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'

Exact mode (jiebacfg):

SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
                        to_tsvector
----------------------------------------------------------------------------------
 '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6
(1 row)

Full mode (jiebaqry) — returns more tokens, including sub-words like '中国', '学院', and '科学':

SELECT * FROM to_tsvector('jiebaqry', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
                                  to_tsvector
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 '中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11
(1 row)

Exact mode with position (jiebacfg_pos) — includes character offsets and stop words like '于', '后', '在':

SELECT * FROM to_tsvector('jiebacfg_pos', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
                                       to_tsvector
------------------------------------------------------------------------------------------------------------------------------------------
 '中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 ',:15':7
(1 row)

Example 2 — Input: '李小福是创新办主任也是云计算方面的专家'

Exact mode (jiebacfg):

SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');
                    to_tsvector
-------------------------------------------------------------------
 '专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1
(1 row)

Full mode (jiebaqry) — also returns '计算' as a sub-word of '云计算':

SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家');
                          to_tsvector
-----------------------------------------------------------------------------
 '专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8
(1 row)

Exact mode with position (jiebacfg_pos) — includes stop words like '是', '也', '的' with their offsets:

SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家');
                                              to_tsvector
---------------------------------------------------------------------------------------------------------------------------
 '专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10
(1 row)

Use custom dictionaries

pg_jieba supports multiple custom dictionaries. Load domain-specific terms into a custom dictionary to improve segmentation accuracy for your content.

Important

Before using custom dictionaries, add pg_jieba to the shared_preload_libraries parameter. Configure this in the console — see . The cluster restarts after you save this change.

The following example shows how custom dictionary terms change segmentation results. Without the custom dictionary, '阿里云' and '研发工程师' are split into sub-words. After loading the custom dictionary, they are treated as single tokens.

  1. Insert terms into the first custom dictionary (index 0, weight 10):

    INSERT INTO jieba_user_dict VALUES ('阿里云');
    INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);
  2. Check segmentation with the built-in dictionary. At this point the custom dictionary is not active:

    SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
                    to_tsvector
    ------------------------------------------------------
     'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
    (1 row)

    Notice that '阿里云' is split into '阿里' and '云', and '研发工程师' is split into '研发' and '工程师'.

  3. Switch to custom dictionary 0:

    SELECT jieba_load_user_dict(0);
     jieba_load_user_dict
    ----------------------
    
    (1 row)
  4. Run the same query again. The custom terms are now recognized as single tokens:

    SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
                    to_tsvector
    --------------------------------------------
     'zth':1 '一个':5 '研发工程师':6 '阿里云':3
    (1 row)

What's next