Use pg_jieba for Chinese full-text search - PolarDB

PostgreSQL's built-in full-text search does not segment Chinese text, which makes it unsuitable for Chinese-language content. pg_jieba integrates the jieba tokenizer into PolarDB for PostgreSQL (Compatible with Oracle), enabling accurate Chinese word segmentation for full-text search queries.

Supported versions

pg_jieba requires one of the following engine versions:

PolarDB for PostgreSQL (Compatible with Oracle) 2.0 (revision version 2.0.14.2.0 or later)
PolarDB for PostgreSQL (Compatible with Oracle) 1.0 (revision version 1.1.28 or later)

To check your revision version, run:

SHOW polar_version;

Text search configurations

pg_jieba provides three text search configurations. Each produces different tokenization results for the same input:

Configuration	Mode	Behavior
`jiebacfg`	Exact mode	Splits text into the most precise tokens without redundancy. Use this for standard search queries.
`jiebaqry`	Full mode	Extracts all possible word combinations, including overlapping tokens. Use this when broader recall matters more than precision.
`jiebacfg_pos`	Exact mode with position	Same as exact mode, plus subscript position information for each token; stop words are included in output. Use this for result highlighting or tokenization debugging.

Comparison — the same input processed by all three configurations:

Input: '小明硕士毕业于中国科学院计算所，后在日本京都大学深造'

jiebacfg     →  '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6
jiebaqry     →  '中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11
jiebacfg_pos →  '中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 '，:15':7

Install and remove the extension

Note

Only privileged accounts can run the following statements.

To install pg_jieba:

CREATE EXTENSION pg_jieba;

To remove pg_jieba:

DROP EXTENSION pg_jieba;

Query examples

Pass a text search configuration name as the first argument to to_tsvector.

Exact mode

SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');

Output:

to_tsvector
-------------------------------------------------------------------
 '专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1
(1 row)

Full mode

SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家');

Output:

to_tsvector
-----------------------------------------------------------------------------
 '专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8
(1 row)

Exact mode with position

SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家');

Output:

to_tsvector
---------------------------------------------------------------------------------------------------------------------------
 '专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10
(1 row)

Custom dictionaries

pg_jieba supports multiple custom dictionaries. Switching to a custom dictionary overrides the predefined segmentation for words you define.

Important

To use custom dictionaries, add pg_jieba to the shared_preload_libraries parameter in the console. For instructions, see Configure cluster parameters. The cluster restarts after you save this parameter. Proceed with caution.

The following steps show how to add entries to a custom dictionary and switch to it.

Insert words into the first custom dictionary (index 0, weight 10):

INSERT INTO jieba_user_dict VALUES ('阿里云');
INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);

Verify segmentation with the predefined dictionary (before switching):

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

Output:

to_tsvector
------------------------------------------------------
 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
(1 row)

The predefined dictionary splits 阿里云 and 研发工程师 into smaller tokens.

Switch to custom dictionary 0:
```
SELECT jieba_load_user_dict(0);
```

Run the same query with custom dictionary 0 active:

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

Output:

to_tsvector
--------------------------------------------
 'zth':1 '一个':5 '研发工程师':6 '阿里云':3
(1 row)

The custom dictionary now treats 阿里云 and 研发工程师 as single tokens.

What's next

Configure cluster parameters