PostgreSQL's built-in full-text search does not segment Chinese text, which makes it unsuitable for Chinese-language content. pg_jieba integrates the jieba tokenizer into PolarDB for PostgreSQL (Compatible with Oracle), enabling accurate Chinese word segmentation for full-text search queries.
Supported versions
pg_jieba requires one of the following engine versions:
PolarDB for PostgreSQL (Compatible with Oracle) 2.0 (revision version 2.0.14.2.0 or later)
PolarDB for PostgreSQL (Compatible with Oracle) 1.0 (revision version 1.1.28 or later)
To check your revision version, run:
SHOW polar_version;Text search configurations
pg_jieba provides three text search configurations. Each produces different tokenization results for the same input:
| Configuration | Mode | Behavior |
|---|---|---|
jiebacfg | Exact mode | Splits text into the most precise tokens without redundancy. Use this for standard search queries. |
jiebaqry | Full mode | Extracts all possible word combinations, including overlapping tokens. Use this when broader recall matters more than precision. |
jiebacfg_pos | Exact mode with position | Same as exact mode, plus subscript position information for each token; stop words are included in output. Use this for result highlighting or tokenization debugging. |
Comparison — the same input processed by all three configurations:
Input: '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'
jiebacfg → '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6
jiebaqry → '中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11
jiebacfg_pos → '中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 ',:15':7Install and remove the extension
Only privileged accounts can run the following statements.
To install pg_jieba:
CREATE EXTENSION pg_jieba;To remove pg_jieba:
DROP EXTENSION pg_jieba;Query examples
Pass a text search configuration name as the first argument to to_tsvector.
Exact mode
SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');Output:
to_tsvector
-------------------------------------------------------------------
'专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1
(1 row)Full mode
SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家');Output:
to_tsvector
-----------------------------------------------------------------------------
'专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8
(1 row)Exact mode with position
SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家');Output:
to_tsvector
---------------------------------------------------------------------------------------------------------------------------
'专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10
(1 row)Custom dictionaries
pg_jieba supports multiple custom dictionaries. Switching to a custom dictionary overrides the predefined segmentation for words you define.
To use custom dictionaries, add pg_jieba to the shared_preload_libraries parameter in the console. For instructions, see Configure cluster parameters. The cluster restarts after you save this parameter. Proceed with caution.
The following steps show how to add entries to a custom dictionary and switch to it.
Insert words into the first custom dictionary (index
0, weight10):INSERT INTO jieba_user_dict VALUES ('阿里云'); INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);Verify segmentation with the predefined dictionary (before switching):
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');Output:
to_tsvector ------------------------------------------------------ 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3 (1 row)The predefined dictionary splits
阿里云and研发工程师into smaller tokens.Switch to custom dictionary 0:
SELECT jieba_load_user_dict(0);Run the same query with custom dictionary 0 active:
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');Output:
to_tsvector -------------------------------------------- 'zth':1 '一个':5 '研发工程师':6 '阿里云':3 (1 row)The custom dictionary now treats
阿里云and研发工程师as single tokens.