Use the pg_jieba extension for Chinese full-text search on PostgreSQL - ApsaraDB RDS

pg_jieba is a PostgreSQL extension that adds Chinese full-text search to ApsaraDB RDS for PostgreSQL. It integrates the Jieba Chinese text segmentation library into PostgreSQL's built-in full-text search framework, so you can run to_tsvector and @@ queries on Chinese text using standard PostgreSQL syntax.

Prerequisites

Before you begin, ensure that you have:

An ApsaraDB RDS for PostgreSQL instance running PostgreSQL 10 or later
The minor engine version 20241030 or later if your instance runs PostgreSQL 17 (see Update the minor engine version)
pg_jieba added to the shared_preload_libraries parameter (see Modify instance parameters)
A privileged account with permission to run CREATE EXTENSION and DROP EXTENSION

Install and remove the extension

Install pg_jieba:

CREATE EXTENSION pg_jieba;

Remove pg_jieba:

DROP EXTENSION pg_jieba;

Only privileged accounts can run CREATE EXTENSION and DROP EXTENSION.

Search Chinese text

Pass your text and configuration name to to_tsvector(), then match it against a query built with to_tsquery() or plainto_tsquery(). The default configuration is jiebacfg.

The following examples show how jiebacfg tokenizes Chinese text:

SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造');

                                                         to_tsvector
--------------------------------------------------------------------------------------------------------------
 '中国科学院':5 '于':4 '后':8 '在':9 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6 '，':7
(1 row)

SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');

                                               to_tsvector
-------------------------------------------------------------------------------------------
 '专家':11 '主任':5 '也':6 '云计算':8 '创新':3 '办':4 '方面':9 '是':2,7 '李小福':1 '的':10
(1 row)

To check which version of pg_jieba is installed:

SELECT * FROM pg_available_extensions WHERE name = 'pg_jieba';

Search a table with a GIN index

For production workloads, compute the tsvector once using a generated column and index it with GIN. This avoids recalculating the token vector for every query.

Generated columns with STORED require PostgreSQL 12 or later. If your instance runs PostgreSQL 10 or 11, store the tsvector in a regular column and update it via a trigger instead.

-- Create a table with a generated tsvector column
CREATE TABLE articles (
    id   serial PRIMARY KEY,
    body text,
    fts  tsvector GENERATED ALWAYS AS (to_tsvector('jiebacfg', body)) STORED
);

-- Build a GIN index on the tsvector column
CREATE INDEX articles_fts_idx ON articles USING GIN (fts);

-- Insert sample data
INSERT INTO articles (body) VALUES
    ('小明硕士毕业于中国科学院计算所，后在日本京都大学深造'),
    ('李小福是创新办主任也是云计算方面的专家');

-- Search: returns matching rows, not just tokens
SELECT id, body
FROM articles
WHERE fts @@ plainto_tsquery('jiebacfg', '云计算');

 id |                  body
----+----------------------------------------
  2 | 李小福是创新办主任也是云计算方面的专家
(1 row)

Extended features

The features available depend on your installed version of pg_jieba.

Version 1.1.0: custom dictionaries and offset-based segmentation

Use custom dictionaries

Custom dictionaries let you add domain-specific terms that the built-in dictionary does not recognize.

The jieba_user_dict table stores custom dictionary entries. Each entry has up to three fields:

Field	Description	Example
Word	The term to add	`'阿里云'`
Dictionary index	The sequence number of the custom dictionary (0 = first dictionary)	`0`
Weight value	Segmentation priority (higher = stronger preference)	`10`

Insert entries into the default custom dictionary (index 0):

-- Minimal: word only (inserted into dictionary 0)
INSERT INTO jieba_user_dict VALUES ('阿里云');

-- Full: word, dictionary index, weight value
INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);

Without loading the custom dictionary, pg_jieba uses its built-in dictionary:

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

              to_tsvector
------------------------------------------------------
 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
(1 row)

Load custom dictionary 0, then run the same query — 阿里云 and 研发工程师 are now recognized as single terms:

SELECT jieba_load_user_dict(0);

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

              to_tsvector
--------------------------------------------
 'zth':1 '一个':5 '研发工程师':6 '阿里云':3
(1 row)

View segmentation results with character offsets

Use the jiebacfg_pos configuration to see the character offset of each token, which is useful for highlighting matched terms in your application.

SELECT * FROM to_tsvector('jiebacfg_pos', 'zth是阿里云的一个研发工程师');

                                         to_tsvector
--------------------------------------------------------------------------------------
 'zth:0':1 '一个:8':6 '云:6':4 '工程师:12':8 '是:3':2 '的:7':5 '研发:10':7 '阿里:4':3
(1 row)

Each token is formatted as word:offset, where offset is the character position of the first character of the word in the original string.

Version 1.2.0: optimized dictionary loading

Version 1.2.0 reduces CPU utilization and memory usage for jieba_load_user_dict() and adds a second parameter to control whether the default dictionary is loaded alongside the custom dictionary.

Syntax

jieba_load_user_dict(dictionary_index, use_default_dictionary)

Parameter	Description	Values
`dictionary_index`	Sequence number of the custom dictionary to load	Integer (0 = first custom dictionary)
`use_default_dictionary`	Whether to load the built-in dictionary	`0` = load default dictionary; `1` = skip default dictionary

Load custom dictionary with the default dictionary

INSERT INTO jieba_user_dict VALUES ('阿里云');
INSERT INTO jieba_user_dict VALUES ('研发工程师', 0, 10);

-- Load custom dictionary 0, also use the default dictionary
SELECT jieba_load_user_dict(0, 0);

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

              to_tsvector
--------------------------------------------
 'zth':1 '一个':5 '研发工程师':6 '阿里云':3
(1 row)

Load custom dictionary only (skip the default dictionary)

-- Load custom dictionary 0, skip the default dictionary
SELECT jieba_load_user_dict(0, 1);

SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');

                    to_tsvector
------------------------------------------------------
 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3
(1 row)

When the default dictionary is skipped, compound terms in the custom dictionary may not segment correctly if they depend on base-word recognition from the default dictionary.

If jieba_user_dict or jieba_load_user_dict() does not exist, update the minor engine version of your instance to 20220730 or later and reinstall the extension: For instructions on updating the minor engine version, see Update the minor engine version.

DROP EXTENSION pg_jieba;
CREATE EXTENSION pg_jieba;

References

pg_jieba on GitHub