Configure a stop word dictionary - AnalyticDB - Alibaba Cloud Documentation Center

The pgsearch extension installed on an AnalyticDB for PostgreSQL instance provides built-in stop word dictionaries in Chinese (CN_SIMPLE) and English (EN_SIMPLE). This topic describes how to configure and use a stop word dictionary.

Supported versions

The minor version of an AnalyticDB for PostgreSQL V7.0 instance is V7.2.1.0 or later.

Note

For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor version of an instance. If your AnalyticDB for PostgreSQL instance does not meet the preceding requirements, we recommend that you update the minor version of the instance. For more information, see UpgradeDBVersion.

Update a stop word dictionary

The stop word dictionaries are stored in the pgsearch.stopword_dict table. You can update the data in the table to add, remove, or change stop words. The default dictionary of the jieba tokenizer in an AnalyticDB for PostgreSQL instance is named default. You can create multiple stop word dictionaries in an instance. When you add a stop word, you must specify a dictionary. If the specified dictionary other than default does not exist, a new dictionary is created and the stop word is added to the dictionary. If the specified dictionary exists, the stop word is added to the dictionary. If you do not specify a dictionary when you change or remove a stop word, all dictionaries are scanned to change or remove the stop word.

Add, remove, or change a stop word in the default dictionary.

-- Add a stop word to the default dictionary by specifying the dictionary name.
INSERT INTO pgsearch.stopword_dict(dict,word) VALUES('default', '的');

-- Remove a stop word from the default dictionary.
DELETE FROM pgsearch.stopword_dict WHERE dict='default' AND word='的 ';

-- Change a stop word in the default dictionary.
UPDATE pgsearch.stopword_dict SET word ='一个' WHERE dict='default' AND word ='的';

Add, remove, or change a stop word in the user_stop_cn dictionary.

-- Add a stop word to the user_stop_cn dictionary.
INSERT INTO pgsearch.stopword_dict(dict,word) VALUES('user_stop_cn', '的');

-- Remove a stop word from the user_stop_cn dictionary.
DELETE FROM pgsearch.stopword_dict WHERE dict='user_stop_cn' AND word='的';

-- Change a stop word in the user_stop_cn dictionary.
UPDATE pgsearch.stopword_dict SET word ='一个' WHERE dict='user_stop_cn' AND word ='的';

Note

The dictionary name column dict and the stop word column word serve as the composite primary key of the table. You cannot add duplicate stop words to the same dictionary.

Load a dictionary

After you update a dictionary, you must execute the SELECT pgsearch.reload_stopword_dict() statement to reload the dictionary into memory. In this example, the user_stop_cn dictionary is loaded.

SELECT pgsearch.reload_stopword_dict('user_stop_cn');

After you load the dictionary into memory, you must perform the following steps in sequence to allow the dictionary to take effect on the existing data.

Close the existing database connections and re-establish connections.
The update of the dictionary does not affect the existing data in the table. To allow the dictionary to take effect on the existing data, rebuild indexes.

Use a dictionary to create an index

When you create a BM25 index, you can use the stopword parameter to specify a stop word dictionary for the jieba tokenizer.

CALL pgsearch.create_bm25(
    index_name => '<index_name>',
    table_name => '<table_name>',
    text_fields => pgsearch.field('<column_name>', tokenizer=>pgsearch.tokenizer('jieba', SEARCH=>false, dict=>'<dict_name>', stopword=>'<stopword_dict_name>'))
);

Query the effect of stop word filtering

Query the segmentation effect of a stop word dictionary.

SELECT pgsearch.tokenizer(pgsearch.tokenizer('jieba', stopword=>'user_stop_cn'), '同一个世界');