PostgreSQL's built-in full-text search supports alphabet-based languages but cannot tokenize Chinese text. pg_jieba bridges this gap: it tokenizes Chinese text into searchable lexemes, enabling full-text search for Chinese content in PolarDB for PostgreSQL.
Supported versions
pg_jieba is available on PolarDB for PostgreSQL clusters running:
PostgreSQL 15 (revision version 15.7.2.0 or later)
PostgreSQL 14 (revision version 14.5.2.0 or later)
PostgreSQL 11 (revision version 1.1.28 or later)
To check your current revision version:
PostgreSQL 15 and 14:
SELECT version();PostgreSQL 11:
SHOW polar_version;
How it works
pg_jieba provides three text search configurations. Each produces a different tokenization strategy suited to different search scenarios:
Configuration | Mode | Best for |
| Exact mode | Indexing. Splits text into the most precise set of terms, without redundant sub-words. |
| Full mode | Maximizing recall in queries. Extracts all possible word combinations, including sub-words. |
| Exact mode with position | When you need character-offset positions for each token. Also includes stop words that exact mode drops. |
Choosing between exact mode and full mode: Use jiebacfg for most full-text index scenarios — it returns fewer, more precise tokens. Use jiebaqry when missing a search hit matters more than filtering noise; it returns every possible sub-word, which improves recall at the cost of potential duplicate matches.
Enable and disable the extension
Only privileged accounts can run these statements.
To create the pg_jieba extension:
CREATE EXTENSION pg_jieba;To remove it:
DROP EXTENSION pg_jieba;Tokenize Chinese text
All three configurations work with the standard to_tsvector function. The configuration name is the first argument.
Example 1 — Input: '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'
Exact mode (jiebacfg):
SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'); to_tsvector
----------------------------------------------------------------------------------
'中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6
(1 row)Full mode (jiebaqry) — returns more tokens, including sub-words like '中国', '学院', and '科学':
SELECT * FROM to_tsvector('jiebaqry', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'); to_tsvector
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
'中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11
(1 row)Exact mode with position (jiebacfg_pos) — includes character offsets and stop words like '于', '后', '在':
SELECT * FROM to_tsvector('jiebacfg_pos', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'); to_tsvector
------------------------------------------------------------------------------------------------------------------------------------------
'中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 ',:15':7
(1 row)Example 2 — Input: '李小福是创新办主任也是云计算方面的专家'
Exact mode (jiebacfg):
SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家'); to_tsvector
-------------------------------------------------------------------
'专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1
(1 row)Full mode (jiebaqry) — also returns '计算' as a sub-word of '云计算':
SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家'); to_tsvector
-----------------------------------------------------------------------------
'专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8
(1 row)Exact mode with position (jiebacfg_pos) — includes stop words like '是', '也', '的' with their offsets:
SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家'); to_tsvector
---------------------------------------------------------------------------------------------------------------------------
'专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10
(1 row)Use custom dictionaries
pg_jieba supports multiple custom dictionaries. Load domain-specific terms into a custom dictionary to improve segmentation accuracy for your content.
Before using custom dictionaries, add pg_jieba to the shared_preload_libraries parameter. Configure this in the console — see . The cluster restarts after you save this change.
The following example shows how custom dictionary terms change segmentation results. Without the custom dictionary, '阿里云' and '研发工程师' are split into sub-words. After loading the custom dictionary, they are treated as single tokens.
Insert terms into the first custom dictionary (index
0, weight10):INSERT INTO jieba_user_dict VALUES ('阿里云'); INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);Check segmentation with the built-in dictionary. At this point the custom dictionary is not active:
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');to_tsvector ------------------------------------------------------ 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3 (1 row)Notice that
'阿里云'is split into'阿里'and'云', and'研发工程师'is split into'研发'and'工程师'.Switch to custom dictionary
0:SELECT jieba_load_user_dict(0);jieba_load_user_dict ---------------------- (1 row)Run the same query again. The custom terms are now recognized as single tokens:
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');to_tsvector -------------------------------------------- 'zth':1 '一个':5 '研发工程师':6 '阿里云':3 (1 row)
What's next
Specify cluster parameters — configure
shared_preload_librariesto enable custom dictionaries