The analysis-ik plug-in is a Chinese tokenization extension for Alibaba Cloud Elasticsearch (ES). It includes several built-in default dictionaries that you can use directly. You can also update the dictionaries to customize the default dictionaries of the analysis-ik plug-in or add new ones. This process optimizes tokenization and makes the results more suitable for your scenarios. The analysis-ik plug-in supports dynamic loading of dictionary files from Object Storage Service (OSS) for remote management and improved operational efficiency. This topic describes the tokenization rules and dictionary types of the analysis-ik plug-in, and explains how to update dictionaries and use the plug-in.
Background information
The analysis-ik plug-in consists of three main components: tokenizers, dictionary files, and dictionary update mechanisms.
Tokenizer: Splits Chinese text into meaningful words (tokens) and determines the tokenization granularity.
Dictionary files: Provide the vocabulary base that the tokenizer uses for tokenization. These files support extension and customization.
Dictionary update methods: Supports cold and hot updates. This lets you flexibly adjust dictionaries as needed to ensure that the tokenization meets your business requirements.
Tokenization rules
The analysis-ik plug-in supports the following tokenization rules:
ik_max_word: Splits text at the finest granularity, which makes it suitable for term queries.
ik_smart: A tokenizer that splits text at a coarse granularity, making it suitable for phrase queries.
Dictionary types
The following table describes the dictionary types supported by the analysis-ik plug-in.
Dictionary type | Description | Dictionary file requirement | Supported update methods |
Main dictionary | The default built-in main dictionary is If you specify the main dictionary when you create an ES index, the cluster matches the data written to the index against the words in the main dictionary. The cluster then creates an index for the matched words. You can then retrieve the index using the corresponding keywords. | One word per line, saved as a | |
Stopword dictionary | The default built-in stopword dictionary is If you specify a stopword dictionary when you create an ES index, the cluster matches the data written to the index against the words in the stopword dictionary. The matched words are filtered out and do not appear in the inverted index. | ||
Preposition dictionary | The default built-in preposition dictionary is | Not applicable | |
Quantifier dictionary | The default built-in quantifier dictionary is | ||
suffix.dic | Stores suffixes to help the tokenizer split words with suffixes. | Not applicable | Updates are not supported. |
surname.dic | Stores Chinese surnames to help the tokenizer recognize personal names. |
Dictionary update methods
If the default dictionaries do not meet your business requirements, you can update them. The analysis-ik plug-in supports the following dictionary update methods.
Update method | Description | Scenarios |
Dictionary changes take effect by restarting the ES cluster. This method updates the dictionaries for the entire cluster. The system sends the uploaded dictionary file to the ES nodes and then restarts the nodes to load the file. After the restart, the new configuration takes effect. |
| |
Note Only the main and stopword dictionaries can be changed. |
|
Prerequisites
Make sure that the instance is in the Normal status. You can view the instance status on the Basic Information page of the instance.
NoteThe operations in this topic are demonstrated on an Alibaba Cloud ES
7.10.0instance. The console interface and supported features may vary for different versions. The actual console interface takes precedence.(Optional) To update a dictionary, you must complete the following steps.
To update a dictionary using the Upload OSS File method, you must first create an OSS bucket and upload the required dictionary file.
To update using the Upload Local File method, first save the required dictionary file to your computer.
Update IK dictionaries
If the default IK dictionaries do not meet your business requirements, you can update them. Before you update, familiarize yourself with the corresponding update methods. For an index that is configured with IK tokenization, an updated dictionary takes effect only on new data, which includes newly added and updated data. If you want the update to take effect for all data, you must re-create the index.
Cold update
To perform a cold update on an IK dictionary, follow these steps:
A standard update triggers a cluster restart. To ensure that your business is not affected, we recommend that you perform the update during off-peak hours.
Go to the instance details page.
Log on to the Alibaba Cloud Elasticsearch console.
In the left navigation pane, click Elasticsearch Instances.
In the top menu bar, select a resource group and a region.
In the Elasticsearch Instances list, click the ID of the target instance to go to its details page.
Go to the cold update page for the
analysis-ikplug-in.In the left navigation pane, choose .
On the Built-in Plug-ins tab, find the
analysis-ikplug-in, and then click Standard Update in the Actions column.
Perform the cold update.
In the IK Dictionary Configuration - Cold Update dialog box, find the dictionary that you want to update and click Edit. Follow the on-screen instructions to upload the dictionary file and click Save.
You can upload a dictionary file in one of the following ways:
Upload Local File: Click the
icon or drag a local file to upload.Upload OSS File: Enter the bucket name and the name of the dictionary file. Then, click Add.
The bucket and the Alibaba Cloud ES instance must be in the same region.
Automatic synchronization of dictionary files in OSS is not supported. If the content of the source dictionary file in OSS changes, you must manually perform an IK dictionary update to apply the changes.
NoteYou can upload only one file in the
DICformat for each dictionary type. The uploaded file replaces the existing one.The file name must have the
.dicextension. The name can be up to 30 characters long and can contain uppercase letters, lowercase letters, digits, and underscores (_).To restore a default dictionary file, download the file and upload it again. To obtain the default dictionary files, see:
To restart the instance, select the risk notification check box and click OK.
After the ES instance restarts, the dictionary file is updated.
(Optional) Verify that the dictionary is effective.
Click the
icon in the upper-left corner and choose to open the Code Editor page.For example, run the following code to split the input text
computer Chinese character input methodat a coarse-grained level.NoteWhen you use this code, replace
textwith a word from your dictionary.GET _analyze { "analyzer": "ik_smart", "text": "computer Chinese character input method" }The expected result is as follows.
{ "tokens" : [ { "token" : "computer", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "Chinese character input", "start_offset" : 3, "end_offset" : 7, "type" : "CN_WORD", "position" : 1 }, { "token" : "method", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 2 } ] }
Hot update
To perform a hot update on an IK dictionary, follow these steps:
A cluster restart is not required if only the file content changes. If you change the number of files or the file names, the cluster must be restarted. To avoid business interruptions, perform this operation during off-peak hours. After the restart, the dictionary takes effect automatically.
Go to the instance details page.
Log on to the Alibaba Cloud Elasticsearch console.
In the left navigation pane, click Elasticsearch Instances.
In the top menu bar, select a resource group and a region.
In the Elasticsearch Instances list, click the ID of the target instance to go to its details page.
Go to the hot update page for the
analysis-ikplug-in.In the left navigation pane, choose .
On the Built-in Plug-ins tab, in the Actions column for the
analysis-ikplug-in, click Rolling Update.
Perform the hot update.
In the IK Dictionary Configuration - Hot Update dialog box, click Edit for the target dictionary. Follow the on-screen instructions to upload the dictionary file and click Save.
You can upload a dictionary file in one of the following ways:
Upload Local File: Click the
icon or drag a local file to upload.Upload OSS File: Enter the bucket name and the name of the dictionary file. Then, click Add.
The bucket and the Alibaba Cloud ES instance must be in the same region.
Automatic synchronization of dictionary files in OSS is not supported. If the content of the source dictionary file in OSS changes, you must manually perform an IK dictionary update to apply the changes.
NoteYou can upload multiple dictionary files. The file extension must be
.dic. A file name can contain uppercase letters, lowercase letters, digits, and underscores (_) and must be 30 characters or less.To modify an uploaded dictionary file, click the
icon to the right of the file to download and modify it. Then, delete the original file and upload the modified version. You must click Save after deleting the original file. Otherwise, you will receive an error message that a file with the same name already exists when you try to upload the modified file.
Click OK and wait for the dictionary to load on the ES nodes.
The plug-in on the Alibaba Cloud ES nodes can automatically load dictionary files, but the time it takes for each node to retrieve the file varies. After the file is loaded, the dictionary takes effect. This process may take some time.
(Optional) Verify that the dictionary is effective.
Click the
icon in the upper-left corner and choose to open the Code Editor page.For example, run the following code to split the input text
computer Chinese character input methodat a coarse-grained level.NoteWhen you use this code, replace
textwith a word from your dictionary.GET _analyze { "analyzer": "ik_smart", "text": "computer Chinese character input method" }The expected result is as follows.
{ "tokens" : [ { "token" : "computer", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "Chinese character input", "start_offset" : 3, "end_offset" : 7, "type" : "CN_WORD", "position" : 1 }, { "token" : "method", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 2 } ] }
Use the analysis-ik plug-in
This example shows how to use the IK tokenizer and the Pinyin filter to tokenize specified text.
Go to the Kibana Dev Tools page of the ES instance.
Click the
icon in the upper-left corner and choose to open the code editor.
Create an index and configure the IK tokenizer and Pinyin filter.
On the Dev Tools page, run the following command to create the
ik_pinyinindex and a custom analyzer namedik_pinyin_analyzer. This analyzer uses theik_max_wordfine-grained tokenization rule and a Pinyin filter to convert Chinese words into Pinyin.NoteThe Pinyin filter runs after Chinese tokenization is complete. It first tokenizes the Chinese text and then converts the tokenization results into Pinyin for output.
PUT ik_pinyin { "settings":{ "analysis": { "filter": { "my_pinyin" : { "type" : "pinyin", "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "remove_duplicated_term" : true } }, "analyzer": { "ik_pinyin_analyzer": { "type": "custom", "tokenizer": "ik_max_word", "filter": ["my_pinyin"] } } } }, "mappings":{ "properties":{ "text":{ "type" : "text", "analyzer" : "ik_pinyin_analyzer" } } } }The core parameters are described as follows:
Pinyin Filter (
filter)NoteFor more information about the configurations of the Pinyin analysis plug-in, see Pinyin Analysis for Elasticsearch.
Parameter
Description
my_pinyin
The name of the Pinyin filter that you define.
type
Set to
pinyinto specify the Pinyin filter.keep_separate_first_letter
Set to
falseto not keep the first letter of each character.keep_full_pinyin
Set to
trueto keep the full Pinyin.keep_original
Set to
trueto keep the original input text.limit_first_letter_length
Set to
16to limit the length of the first letter to a maximum of 16 characters.lowercase
Set to
trueto use lowercase for the output Pinyin.remove_duplicated_term
Set to
trueto remove duplicate terms. For example, this avoids results such as"zh, zh".Analyzer (
analyzer):Parameter
Description
ik_pinyin_analyzer
The name of the analyzer that you define.
type
Set to
customto specify a custom analyzer.tokenizer
Set to
ik_max_wordto split text at the finest granularity.filter
Set to
my_pinyinto call themy_pinyinPinyin filter.The following result is returned, which indicates that the index was created successfully.

Verify the tokenization results.
You can run the following code to tokenize the input text
This is a test.GET ik_pinyin/_analyze { "text": "This is a test", "analyzer": "ik_pinyin_analyzer" }The expected result is as follows.
{ "tokens" : [ { "token" : "zhe", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "This is", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "zs", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "shi", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1 }, { "token" : "ge", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "a", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "g", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "ce", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 3 }, { "token" : "shi", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "test", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 }, { "token" : "cs", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 } ] }
References
API for hot-updating IK dictionaries: UpdateHotIkDicts
API for cold-updating IK dictionaries: UpdateDict