Add Chinese Full-Text Search to PolarDB PostgreSQL with pg_jieba - PolarDB

pg_jieba is an open source third-party extension for Chinese full-text search in PolarDB for PostgreSQL (Compatible with Oracle).

Search modes

pg_jieba provides three text search configurations. Each produces different tokenization behavior for the same input:

Configuration	Mode	Behavior
`jiebacfg`	Exact mode	Splits text into precise, non-overlapping tokens. Stopwords are excluded from results.
`jiebaqry`	Full mode	Returns all matching word combinations, including overlapping sub-words. Redundant tokens may appear.
`jiebacfg_pos`	Exact mode with position	Same as exact mode, but includes the subscript position of each token and also returns stopwords.

Choose a mode:

Use jiebacfg for standard search indexing where clean, deduplicated tokens are needed.
Use jiebaqry when broad recall matters more than precision (for example, substring matching).
Use jiebacfg_pos when you need token position data, or when you want to inspect stopwords for debugging.

Enable pg_jieba

Run the following statement as a privileged account to create the extension:

CREATE EXTENSION pg_jieba;

To remove the extension:

DROP EXTENSION pg_jieba;

Note

Only privileged accounts can run CREATE EXTENSION and DROP EXTENSION.

Tokenize Chinese text

All three modes use the to_tsvector(config, text) function. Pass the configuration name as the first argument.

Example 1: `'小明硕士毕业于中国科学院计算所，后在日本京都大学深造'`

Exact mode (jiebacfg):

SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造');

                           to_tsvector
----------------------------------------------------------------------------------
 '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6
(1 row)

Full mode (jiebaqry):

SELECT * FROM to_tsvector('jiebaqry', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造');

                                                          to_tsvector
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 '中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11
(1 row)

Exact mode with position (jiebacfg_pos):

SELECT * FROM to_tsvector('jiebacfg_pos', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造');

                                                          to_tsvector
------------------------------------------------------------------------------------------------------------------------------------------
 '中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 '，:15':7
(1 row)

Example 2: `'李小福是创新办主任也是云计算方面的专家'`

Exact mode (jiebacfg):

SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');

                       to_tsvector
-------------------------------------------------------------------
 '专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1
(1 row)

Full mode (jiebaqry):

SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家');

                             to_tsvector
-----------------------------------------------------------------------------
 '专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8
(1 row)

Exact mode with position (jiebacfg_pos):

SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家');

                                                    to_tsvector
---------------------------------------------------------------------------------------------------------------------------
 '专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10
(1 row)

Configure custom dictionaries

pg_jieba supports multiple custom dictionaries. Load a custom dictionary to make the tokenizer recognize domain-specific terms — such as product names or technical jargon — as single tokens.

Important

Before using custom dictionaries, add pg_jieba to the shared_preload_libraries parameter in the console. The cluster restarts after you save this change. Proceed with caution. For instructions, see .

Understand the jieba_user_dict schema

Custom dictionary entries are stored in the jieba_user_dict table. Each entry accepts up to three fields:

Field	Description	Default
Term	The word or phrase to add	Required
Dictionary index	Identifies which custom dictionary the term belongs to. `0` is the first (and default) dictionary.	`0`
Weight value	Controls the term's priority during tokenization.	`10`

Add terms to a custom dictionary

Insert terms into the dictionary before loading it. The following example adds two terms to dictionary 0 with weight 10:

INSERT INTO jieba_user_dict VALUES ('阿里云');
INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);

Without a custom dictionary loaded, the built-in tokenizer splits these terms:

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

                    to_tsvector
------------------------------------------------------
 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
(1 row)

阿里云 and 研发工程师 are each split into sub-tokens because the built-in dictionary does not recognize them as single terms.

Load the custom dictionary

Switch to dictionary 0 to activate the terms you inserted:

SELECT jieba_load_user_dict(0);

 jieba_load_user_dict
----------------------

(1 row)

Run the same query again:

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

              to_tsvector
--------------------------------------------
 'zth':1 '一个':5 '研发工程师':6 '阿里云':3
(1 row)

阿里云 and 研发工程师 are now recognized as single tokens.

PolarDB:pg_jieba