pg_jieba is a PostgreSQL extension that adds Chinese full-text search to ApsaraDB RDS for PostgreSQL. It integrates the Jieba Chinese text segmentation library into PostgreSQL's built-in full-text search framework, so you can run to_tsvector and @@ queries on Chinese text using standard PostgreSQL syntax.
Prerequisites
Before you begin, ensure that you have:
An ApsaraDB RDS for PostgreSQL instance running PostgreSQL 10 or later
The minor engine version 20241030 or later if your instance runs PostgreSQL 17 (see Update the minor engine version)
pg_jiebaadded to theshared_preload_librariesparameter (see Modify instance parameters)A privileged account with permission to run
CREATE EXTENSIONandDROP EXTENSION
Install and remove the extension
Install pg_jieba:
CREATE EXTENSION pg_jieba;Remove pg_jieba:
DROP EXTENSION pg_jieba;Only privileged accounts can runCREATE EXTENSIONandDROP EXTENSION.
Search Chinese text
Pass your text and configuration name to to_tsvector(), then match it against a query built with to_tsquery() or plainto_tsquery(). The default configuration is jiebacfg.
The following examples show how jiebacfg tokenizes Chinese text:
SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'); to_tsvector
--------------------------------------------------------------------------------------------------------------
'中国科学院':5 '于':4 '后':8 '在':9 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6 ',':7
(1 row)SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家'); to_tsvector
-------------------------------------------------------------------------------------------
'专家':11 '主任':5 '也':6 '云计算':8 '创新':3 '办':4 '方面':9 '是':2,7 '李小福':1 '的':10
(1 row)To check which version of pg_jieba is installed:
SELECT * FROM pg_available_extensions WHERE name = 'pg_jieba';Search a table with a GIN index
For production workloads, compute the tsvector once using a generated column and index it with GIN. This avoids recalculating the token vector for every query.
Generated columns withSTOREDrequire PostgreSQL 12 or later. If your instance runs PostgreSQL 10 or 11, store thetsvectorin a regular column and update it via a trigger instead.
-- Create a table with a generated tsvector column
CREATE TABLE articles (
id serial PRIMARY KEY,
body text,
fts tsvector GENERATED ALWAYS AS (to_tsvector('jiebacfg', body)) STORED
);
-- Build a GIN index on the tsvector column
CREATE INDEX articles_fts_idx ON articles USING GIN (fts);
-- Insert sample data
INSERT INTO articles (body) VALUES
('小明硕士毕业于中国科学院计算所,后在日本京都大学深造'),
('李小福是创新办主任也是云计算方面的专家');
-- Search: returns matching rows, not just tokens
SELECT id, body
FROM articles
WHERE fts @@ plainto_tsquery('jiebacfg', '云计算'); id | body
----+----------------------------------------
2 | 李小福是创新办主任也是云计算方面的专家
(1 row)Extended features
The features available depend on your installed version of pg_jieba.
Version 1.1.0: custom dictionaries and offset-based segmentation
Use custom dictionaries
Custom dictionaries let you add domain-specific terms that the built-in dictionary does not recognize.
The jieba_user_dict table stores custom dictionary entries. Each entry has up to three fields:
| Field | Description | Example |
|---|---|---|
| Word | The term to add | '阿里云' |
| Dictionary index | The sequence number of the custom dictionary (0 = first dictionary) | 0 |
| Weight value | Segmentation priority (higher = stronger preference) | 10 |
Insert entries into the default custom dictionary (index 0):
-- Minimal: word only (inserted into dictionary 0)
INSERT INTO jieba_user_dict VALUES ('阿里云');
-- Full: word, dictionary index, weight value
INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);Without loading the custom dictionary, pg_jieba uses its built-in dictionary:
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector
------------------------------------------------------
'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
(1 row)Load custom dictionary 0, then run the same query — 阿里云 and 研发工程师 are now recognized as single terms:
SELECT jieba_load_user_dict(0);
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector
--------------------------------------------
'zth':1 '一个':5 '研发工程师':6 '阿里云':3
(1 row)View segmentation results with character offsets
Use the jiebacfg_pos configuration to see the character offset of each token, which is useful for highlighting matched terms in your application.
SELECT * FROM to_tsvector('jiebacfg_pos', 'zth是阿里云的一个研发工程师'); to_tsvector
--------------------------------------------------------------------------------------
'zth:0':1 '一个:8':6 '云:6':4 '工程师:12':8 '是:3':2 '的:7':5 '研发:10':7 '阿里:4':3
(1 row)Each token is formatted as word:offset, where offset is the character position of the first character of the word in the original string.
Version 1.2.0: optimized dictionary loading
Version 1.2.0 reduces CPU utilization and memory usage for jieba_load_user_dict() and adds a second parameter to control whether the default dictionary is loaded alongside the custom dictionary.
Syntax
jieba_load_user_dict(dictionary_index, use_default_dictionary)| Parameter | Description | Values |
|---|---|---|
dictionary_index | Sequence number of the custom dictionary to load | Integer (0 = first custom dictionary) |
use_default_dictionary | Whether to load the built-in dictionary | 0 = load default dictionary; 1 = skip default dictionary |
Load custom dictionary with the default dictionary
INSERT INTO jieba_user_dict VALUES ('阿里云');
INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);
-- Load custom dictionary 0, also use the default dictionary
SELECT jieba_load_user_dict(0, 0);
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector
--------------------------------------------
'zth':1 '一个':5 '研发工程师':6 '阿里云':3
(1 row)Load custom dictionary only (skip the default dictionary)
-- Load custom dictionary 0, skip the default dictionary
SELECT jieba_load_user_dict(0, 1);
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector
------------------------------------------------------
'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
(1 row)When the default dictionary is skipped, compound terms in the custom dictionary may not segment correctly if they depend on base-word recognition from the default dictionary.
Ifjieba_user_dictorjieba_load_user_dict()does not exist, update the minor engine version of your instance to 20220730 or later and reinstall the extension: For instructions on updating the minor engine version, see Update the minor engine version.
DROP EXTENSION pg_jieba;
CREATE EXTENSION pg_jieba;