Stop words are common, high-frequency terms that carry little search value — words like "的", "是", or "the". Without stop word filtering, a search for "同一个世界" treats every token equally. By configuring stop words, the jieba tokenizer focuses on the terms that distinguish documents, which improves BM25 search relevance and query performance.
The pgsearch extension includes two built-in stop word dictionaries: CN_SIMPLE for Chinese and EN_SIMPLE for English.
Prerequisites
Before you begin, ensure that you have:
-
An AnalyticDB for PostgreSQL V7.0 instance running minor version V7.2.1.0 or later
To check the minor version of your instance, see View the minor version of an instance. To upgrade, see UpgradeDBVersion.
Manage stop words
Stop word dictionaries are stored in the pgsearch.stopword_dict table. The dict column holds the dictionary name and the word column holds the stop word. Together, they form the composite primary key, so duplicate stop words within the same dictionary are not allowed.
The jieba tokenizer uses a dictionary named default by default. An instance can have multiple stop word dictionaries.
Add a stop word
When adding a stop word, always specify a dictionary name. If the dictionary does not exist (except default, which always exists), pgsearch creates it automatically and adds the stop word to it.
Add a stop word to the default dictionary:
INSERT INTO pgsearch.stopword_dict(dict, word) VALUES('default', '的');
Add a stop word to a custom dictionary named user_stop_cn:
INSERT INTO pgsearch.stopword_dict(dict, word) VALUES('user_stop_cn', '的');
Update a stop word
Update a stop word in the default dictionary:
UPDATE pgsearch.stopword_dict SET word = '一个' WHERE dict = 'default' AND word = '的';
Update a stop word in the user_stop_cn dictionary:
UPDATE pgsearch.stopword_dict SET word = '一个' WHERE dict = 'user_stop_cn' AND word = '的';
Remove a stop word
Remove a stop word from the default dictionary:
DELETE FROM pgsearch.stopword_dict WHERE dict = 'default' AND word = '的';
Remove a stop word from the user_stop_cn dictionary:
DELETE FROM pgsearch.stopword_dict WHERE dict = 'user_stop_cn' AND word = '的';
If you omit the dict filter in an UPDATE or DELETE statement, pgsearch scans all dictionaries and modifies every matching stop word across all of them.
Reload a dictionary
After updating a dictionary, reload it into memory:
SELECT pgsearch.reload_stopword_dict('user_stop_cn');
Reloading the dictionary does not automatically update existing indexed data. To apply the new stop word configuration to existing data, complete the following steps in order:
-
Close all existing database connections and reconnect.
-
Rebuild the indexes on the affected tables.
Create a BM25 index with a stop word dictionary
When creating a BM25 index, use the stopword parameter to assign a stop word dictionary to the jieba tokenizer:
CALL pgsearch.create_bm25(
index_name => '<index_name>',
table_name => '<table_name>',
text_fields => pgsearch.field('<column_name>', tokenizer => pgsearch.tokenizer('jieba', SEARCH => false, dict => '<dict_name>', stopword => '<stopword_dict_name>'))
);
Replace the placeholders with your actual values:
| Placeholder | Description | Example |
|---|---|---|
<index_name> |
Name of the BM25 index to create | my_bm25_index |
<table_name> |
Table to index | articles |
<column_name> |
Text column to index | content |
<dict_name> |
jieba segmentation dictionary | default |
<stopword_dict_name> |
Stop word dictionary to apply | user_stop_cn |
Verify stop word filtering
Run the following query to confirm that stop words are filtered correctly. The example uses the user_stop_cn dictionary on the input string 同一个世界:
SELECT pgsearch.tokenizer(pgsearch.tokenizer('jieba', stopword => 'user_stop_cn'), '同一个世界');