Domain-specific terms — such as product names, legal phrases, or industry jargon — are often split incorrectly by general-purpose tokenizers, which causes BM25 full-text search to miss or misrank relevant results. The jieba tokenizer in AnalyticDB for PostgreSQL supports custom word segmentation dictionaries, letting you add specialized vocabulary so the tokenizer treats those terms as single tokens.
Prerequisites
Before you begin, make sure you have:
-
An AnalyticDB for PostgreSQL V7.0 instance running minor version V7.2.1.0 or later. To check your minor version, see View the minor version of an instance. To upgrade, see UpgradeDBVersion.
-
The pgsearch extension installed on your instance. For installation instructions, see the "Installation and uninstallation" section of BM25 high-performance full-text search.
Limitations
Only the jieba tokenizer supports custom dictionaries.
How it works
Custom dictionaries are stored in the pgsearch.jieba_custom_word table. The dict and word columns form a composite primary key, so duplicate word segments in the same dictionary are not allowed. The default dictionary is named default.
When you add a word segment:
-
If you omit the dictionary name, the word is added to
default. -
If the specified dictionary does not exist, a new dictionary is created automatically.
-
If the specified dictionary already exists, the word is added to it.
When you update or delete a word segment without specifying a dictionary name, all dictionaries are scanned.
Manage dictionary entries
Add, update, or delete entries in the default dictionary
-- Add a word to the default dictionary (implicit)
INSERT INTO pgsearch.jieba_custom_word(word) VALUES('永和服装饰品');
-- Add a word to the default dictionary (explicit)
INSERT INTO pgsearch.jieba_custom_word(dict, word) VALUES('default', '永和服装饰品');
-- Update a word in the default dictionary
UPDATE pgsearch.jieba_custom_word SET word = '永和' WHERE dict = 'default' AND word = '永和服装饰品';
-- Delete a word from the default dictionary
DELETE FROM pgsearch.jieba_custom_word WHERE dict = 'default' AND word='永和服装饰品';
Add, update, or delete entries in a custom dictionary
-- Add a word to a custom dictionary (created automatically if it does not exist)
INSERT INTO pgsearch.jieba_custom_word(dict, word) VALUES('custom_dict', '永和服装饰品');
-- Update a word in a custom dictionary
UPDATE pgsearch.jieba_custom_word SET word = '永和' WHERE dict = 'custom_dict' AND word = '永和服装饰品';
-- Delete a word from a custom dictionary
DELETE FROM pgsearch.jieba_custom_word WHERE dict = 'custom_dict' AND word='永和服装饰品';
Load a dictionary
After updating entries in pgsearch.jieba_custom_word, reload the dictionary into memory:
SELECT pgsearch.reload_user_dict('custom_dict');
The reload does not automatically apply to existing indexed data. To apply the updated dictionary to existing data, complete the following steps in order:
-
Close existing database connections and re-establish them.
-
Rebuild the indexes on tables that use the dictionary.
Create a BM25 index with a custom dictionary
Specify a custom dictionary when calling pgsearch.create_bm25().
Single column:
CALL pgsearch.create_bm25(
index_name => '<index_name>',
table_name => '<table_name>',
text_fields => pgsearch.field('<column_name>', tokenizer=>pgsearch.tokenizer('jieba', dict=>'<dict_name>'))
);
Multiple columns with different dictionaries:
CALL pgsearch.create_bm25(
index_name => '<index_name>',
table_name => '<table_name>',
text_fields => pgsearch.field('<column1_name>', tokenizer=>pgsearch.tokenizer('jieba', hmm=>false, SEARCH=>false, dict=>'<dict_name>'))
|| pgsearch.field('<column2_name>', tokenizer=>pgsearch.tokenizer('jieba', hmm=>false, SEARCH=>false, dict=>'<dict2_name>'))
);
Replace the placeholders with actual values:
| Placeholder | Description |
|---|---|
<index_name> |
Name of the BM25 index to create |
<table_name> |
Name of the table to index |
<column_name> |
Name of the column to index |
<dict_name> |
Name of the custom dictionary |
<dict2_name> |
Name of the custom dictionary for the second column |
Verify the word segmentation effect
Use pgsearch.tokenizer() to confirm how the jieba tokenizer segments text with your custom dictionary. This lets you catch segmentation issues before building an index.
For example, after adding 永和服装饰品 to custom_dict, you can verify that the tokenizer treats it as a single token:
SELECT pgsearch.tokenizer(pgsearch.tokenizer('jieba', hmm=>false, SEARCH=>false, dict=>'custom_dict'), '永和服装饰品有限公司');
For more information about pgsearch.tokenizer() parameters, see the "Parameters" section of BM25 high-performance full-text search.
What's next
-
BM25 high-performance full-text search — full usage guide including index creation, search queries, and tokenizer parameters