In the process of full-text search, the accuracy of word segmentation is crucial to the search results. General tokenizers only provide basic support, and often fail to meet the unique requirements of specialized fields or industries. For example, when you work with legal documents, specialized terms such as "force majeure" and "contract performance" may not be included in the general vocabulary. As a result, search results may be biased, or key information may be lost. If word segmentation is not accurate, the returned results cannot fully address the query intent of the user, which impacts the user experience.
To enhance word segmentation accuracy and retrieval efficiency, the jieba tokenizer allows you to configure custom word segmentation dictionaries. You can add proper nouns, industry terms, and trending words to the dictionaries based on the requirements of your industry or specific application scenarios to implement word segmentation that better meets practical requirements. This topic describes how to configure and use a custom word segmentation dictionary.
Supported versions
The minor version of an AnalyticDB for PostgreSQL V7.0 instance is V7.2.1.0 or later.
For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor version of an instance. If your AnalyticDB for PostgreSQL instance does not meet the preceding requirements, we recommend that you update the minor version of the instance. For more information, see UpgradeDBVersion.
Prerequisites
The pgsearch extension is installed on an AnalyticDB for PostgreSQL instance. For information about how to install the pgsearch extension, see the "Installation and uninstallation" section of the BM25 high-performance full-text search topic.
Limits
Only the jieba tokenizer supports custom dictionaries.
Update a dictionary
The dictionaries of the jieba tokenizer are stored in the pgsearch.jieba_custom_word table. You can update the data in the table to add, remove, or change word segments. The default dictionary of the jieba tokenizer in an AnalyticDB for PostgreSQL is named default. You can create multiple custom dictionaries in an instance. If no dictionary is specified when you add a word segment, the default dictionary is updated. If the specified dictionary other than default does not exist, a new dictionary is created and the word segment is added to the dictionary. If the specified dictionary exists, the word segment is added to the dictionary. If you do not specify a dictionary when you change or remove a word segment, all dictionaries are scanned to change or remove the word segment.
Add, remove, or change a custom word segment in the default dictionary.
-- Add a custom word segment to the default dictionary without specifying a dictionary name. INSERT INTO pgsearch.jieba_custom_word(word) VALUES('永和服装饰品'); -- Add a custom word segment to the default dictionary by specifying the dictionary name. INSERT INTO pgsearch.jieba_custom_word(dict, word) VALUES('default', '永和服装饰品'); -- Remove a custom word segment from the default dictionary. DELETE FROM pgsearch.jieba_custom_word WHERE dict = 'default' AND word='永和服装饰品'; -- Change a custom word segment in the default dictionary. UPDATE pgsearch.jieba_custom_word SET word = '永和' WHERE dict = 'default' AND word = '永和服装饰品';Add, remove, or change a custom word segment in a custom dictionary.
-- Add a custom word segment to the custom_dict dictionary. INSERT INTO pgsearch.jieba_custom_word(dict, word) VALUES('custom_dict', '永和服装饰品'); -- Remove a custom word segment from the custom_dict dictionary. DELETE FROM pgsearch.jieba_custom_word WHERE dict = 'custom_dict' AND word='永和服装饰品'; -- Change a custom word segment in the custom_dict dictionary. UPDATE pgsearch.jieba_custom_word SET word = '永和' WHERE dict = 'custom_dict' AND word = '永和服装饰品';
The dictionary name column dict and the custom word segment column word serve as the composite primary key of the table. You cannot add duplicated custom word segments to the same dictionary.
Load a dictionary
After you update a dictionary in the pgsearch.jieba_custom_word table, you must execute the SELECT pgsearch.reload_user_dict() statement to reload the dictionary into memory. In this example, the custom_dict dictionary is loaded.
SELECT pgsearch.reload_user_dict('custom_dict');After you load the dictionary into memory, you must perform the following steps in sequence to allow the custom dictionary to take effect on the existing data.
Close the existing database connections and re-establish connections.
The update of the dictionary does not affect the existing data in the table. To allow the dictionary to take effect on the existing data, rebuild indexes.
Use a dictionary to create an index
When you create a BM25 index, you can specify a custom word segmentation dictionary for the jieba tokenizer.
Specify a dictionary for a column of a table.
CALL pgsearch.create_bm25( index_name => '<index_name>', table_name => '<table_name>', text_fields => pgsearch.field('<column_name>', tokenizer=>pgsearch.tokenizer('jieba',dict=>'<dict_name>')) );Specify different dictionaries for different columns in the same table.
CALL pgsearch.create_bm25( index_name => '<index_name>', table_name => '<table_name>', text_fields => pgsearch.field('<column1_name>', tokenizer=>pgsearch.tokenizer('jieba', hmm=>false, SEARCH=>false, dict=>'<dict_name>')) || pgsearch.field('<column2_name>', tokenizer=>pgsearch.tokenizer('jieba', hmm=>false, SEARCH=>false, dict=>'<dict2_name>')) );
Query the word segmentation effect of a custom dictionary
You can use the pgsearch.tokenizer() function to specify a custom word segmentation dictionary for the jieba tokenizer to query the word segmentation effect of the custom dictionary. For more information about the pgsearch.tokenizer() function, see the "Parameters" section of the BM25 high-performance full-text search topic.
Query the word segmentation effect of a custom dictionary.
SELECT pgsearch.tokenizer(pgsearch.tokenizer('jieba', hmm=>false, SEARCH=>false, dict=>'custom_dict'), '永和服装饰品有限公司');