The IK analyzer plugin (analysis-ik) provides smart Chinese text tokenization for Elasticsearch, breaking down Chinese text into meaningful terms for accurate search results. This plugin supports two tokenization modes—ik_max_word for comprehensive indexing and ik_smart for precise searching—and enables dynamic dictionary updates without cluster restarts.
When to use this plugin
Use the IK analyzer plugin when:
Building Chinese-language search features: E-commerce product search, article search, or any application requiring Chinese text analysis.
Improving search quality over default analyzers: The built-in
smartcnanalyzer provides basic Chinese tokenization, but IK analyzer offers more accurate segmentation and custom dictionary support.Managing domain-specific terminology: Add industry terms, product names, or company-specific vocabulary to improve tokenization accuracy.
Maintaining search quality without downtime: Update dictionaries dynamically as your vocabulary evolves.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud Elasticsearch instance in Normal status
Administrative access to the Elasticsearch console
(Optional) An Object Storage Service (OSS) bucket in the same region as your Elasticsearch instance, if using the OSS upload method
Basic understanding of Elasticsearch analyzers and index mappings
Important: The IK analyzer plugin comes pre-installed on Alibaba Cloud Elasticsearch instances. You do not need to install it manually.
Note: Console interfaces and available features may vary across Elasticsearch versions. This guide uses version 7.10.0 for examples.
Understand the plugin components
The IK analyzer plugin consists of three main components that work together to tokenize Chinese text:
Tokenizers: The analysis engines that break Chinese text into individual terms
Dictionary files: Word lists that guide tokenization decisions
Update mechanisms: Methods for refreshing dictionaries with new terms
Tokenization modes
The plugin provides two tokenization modes optimized for different purposes:
Mode | Use case | Behavior | Example output |
ik_max_word | Indexing | Fine-grained segmentation. Exhaustively splits text into all possible word combinations to maximize search recall. | Input: Output: |
ik_smart | Searching | Coarse-grained segmentation. Produces fewer, more semantically meaningful tokens for precise queries. | Input: Output: |
Recommended configuration: Use ik_max_word for indexing to capture all possible term combinations, and ik_smart for search queries to match precise phrases. This asymmetric approach balances recall and precision.
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
}Dictionary types
The plugin includes several dictionary files that control tokenization behavior:
Dictionary type | File name | Description | Editable | Update methods |
Main dictionary | main.dic | Contains over 270,000 common Chinese words. The primary vocabulary for tokenization. | Yes | Dynamic update, Update requiring restart |
Stop word list | stopword.dic | Words filtered out during analysis. Default list includes English stop words: a, an, and, are, as, at, be, and so on. | Yes | Dynamic update, Update requiring restart |
Preposition dictionary | preposition.dic | Chinese prepositions and function words for specialized filtering. | Yes | Update requiring restart only |
Quantifier dictionary | quantifier.dic | Chinese measure words and quantifiers (e.g., 个、只、条). | Yes | Update requiring restart only |
Surname dictionary | surname.dic | Common Chinese surnames for name recognition. | No | Not supported |
Suffix dictionary | suffix.dic | Used for tokenizing terms with suffixes | No | Not supported |
Format requirements: All dictionary files must be:
Saved with UTF-8 encoding
Plain text files with .dic extension
One word per line
File names contain only letters, digits, and underscores
File names (excluding extension) do not exceed 30 characters
Update methods
The plugin supports two methods for updating dictionaries, each suited for different scenarios:
Method | When to use | Supported dictionaries | Impact | Limitations |
Dynamic update | Runtime updates without downtime | Main dictionary, stop word list | No cluster restart needed. Each node loads the updated dictionary automatically. | Cannot add or remove dictionary files, only modify content. Changing file names triggers a restart. |
Update requiring restart | Initial dictionary setup, adding new dictionary files, or updating preposition/quantifier dictionaries | All dictionary types | Triggers a rolling cluster restart. Downtime depends on cluster size and configuration. | Service interruption during restart. |
Performance note: During dynamic updates, each node retrieves and loads dictionary files at different times. Changes take effect progressively across the cluster, not simultaneously.
Data impact: Dictionary updates only affect newly indexed or updated documents. To apply changes to all existing data, you must reindex.
Configure the IK analyzer
You can configure IK analyzer in two ways: as the default analyzer for an index, or for specific fields.
Set IK as the default analyzer
This configuration applies IK tokenization to all text fields in the index:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "ik_max_word"
}
}
}
}
}Configure IK for specific fields
This configuration gives you fine-grained control over which fields use IK tokenization:
PUT /my_index
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"tags": {
"type": "text",
"analyzer": "ik_smart"
}
}
}
}Use IK with the Pinyin plugin
You can combine IK analyzer with the Pinyin analysis plugin to support both Chinese characters and Pinyin input:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["my_pinyin", "word_delimiter"]
}
},
"filter": {
"my_pinyin": {
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
}
}
}
}
}Important: The Pinyin filter runs after Chinese tokenization completes. The analyzer first tokenizes Chinese text with IK, then converts the resulting tokens into Pinyin.
Update dictionaries
You can add custom terms to dictionaries or modify existing dictionaries to improve tokenization accuracy for your domain.
Method 1: Dynamic update (no restart needed)
Use this method to update the main dictionary or stop word list without interrupting service.
When to use:
Adding new product names, technical terms, or domain vocabulary
Updating stop words based on search analytics
Frequent dictionary changes in development or testing
Limitations:
Only main dictionary (main.dic) and stop word list (stopword.dic) supported
Cannot add or remove dictionary files—only modify existing file content
Changing file names or file count triggers a cluster restart
Procedure
Log on to the Alibaba Cloud Elasticsearch console.
Navigate to your Elasticsearch instance:
In the left navigation menu, choose Elasticsearch Clusters.
Click the name of your target instance.
In the left navigation pane, click Plug-ins.
On the Built-in Plug-ins tab, find analysis-ik in the plug-in list.
Click Rolling Update in the Actions column.
In the Configure IK Dictionaries - Rolling Update panel, click Edit for the target dictionary type, and upload the required dictionary file:
Click Edit for the target dictionary type.
Add your custom terms using one of two methods:
Option A: Upload a local file
Select Upload On-premises File.
Select your .dic file (UTF-8 encoded, one word per line).
Click Save.
Option B: Upload from OSS
Select Upload OSS File.
Enter the OSS bucket name.
Enter the complete path to your dictionary file, including file name and extension.
Click Save.
NoteTo update an existing dictionary, download and modify it, remove the original file, and upload the updated one.
The OSS bucket and Elasticsearch instance must be in the same region.
File name restrictions: Maximum 30 characters (excluding .dic extension), containing only letters, digits, and underscores.
Review your changes and click OK.
The IK configuration takes effect after the dictionary is loaded on each node. The time it takes for a node to load the dictionary file varies. Wait for about one to two minutes and then verify whether the IK dictionary is updated.
Method 2: Update requiring restart
Use this method for initial dictionary setup, adding new dictionary files, or updating preposition/quantifier dictionaries.
When to use:
First-time dictionary configuration
Adding new dictionary files (not just modifying existing files)
Updating preposition or quantifier dictionaries
Major dictionary overhauls
Warning: This method triggers a rolling cluster restart. Your cluster will experience temporary service interruption as nodes restart sequentially. Plan this update during maintenance windows.
Procedure
Log on to the Alibaba Cloud Elasticsearch console.
Navigate to your Elasticsearch instance:
In the left navigation menu, choose Elasticsearch Clusters.
Click the name of your target instance.
In the left navigation pane, click Plug-ins.
On the Built-in Plug-ins tab, find analysis-ik in the plug-in list.
Click Standard Update in the Actions column.
In the Configure IK Dictionaries - Standard Update panel, click Edit for the target dictionary type, and upload the required dictionary file:
Option A: Upload a local file
Select Upload On-premises File.
Select your .dic file (UTF-8 encoded, one word per line).
Click Save.
Option B: Upload from OSS
Select Upload OSS File.
Enter the OSS bucket name.
Enter the complete path to your dictionary file, including file name and extension.
Click Save.
NoteYou can upload only one .dic file per dictionary type. The uploaded file replaces the existing dictionary.
The OSS bucket and Elasticsearch instance must be in the same region.
File name restrictions: Maximum 30 characters (excluding .dic extension), containing only letters, digits, and underscores.
Review the current configuration to verify your upload.
Click OK.
The system prompts: "After the update, the cluster will be restarted. Continue?"
Click OK to confirm. The cluster begins a rolling restart to apply the configuration.
Monitor the restart progress on the cluster details page. The cluster status changes to Initializing during the restart, then returns to Normal when complete.
Verify basic IK tokenization
Verify that the IK analyzer is working correctly by testing tokenization with the _analyze API.
Test that Chinese text is segmented into words, not individual characters:
GET _analyze
{
"analyzer": "ik_smart",
"text": "计算机汉字输入方法"
}Expected output:
{
"tokens" : [
{
"token" : "计算机",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "汉字输入",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "方法",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 2
}
]
}What to look for:
Chinese text is segmented into meaningful words (计算机, 汉字输入...)
If you see individual characters (计, 算, 机...), the analyzer is not configured correctly
Troubleshooting
Issue | Possible cause | Solution |
Chinese text is tokenized into individual characters | IK analyzer not configured for the field | Verify the field mapping uses ik_max_word or ik_smart. Use the GET /my_index/_mapping API to check field configuration. |
Custom terms are not recognized as single tokens | Dictionary not loaded or update not applied | Wait 1-2 minutes after dynamic update for all nodes to load the dictionary. For updates requiring restart, verify the cluster status is Normal. |
Dynamic update (hot reload) stopped working | File name or file count changed | Dynamic update only supports modifying existing file content. Adding, removing, or renaming files requires an update requiring restart. |
Dictionary upload fails with encoding error | Dictionary file not saved as UTF-8 | Save the dictionary file with UTF-8 encoding. Avoid using ANSI or other encodings. |
OSS upload fails | OSS bucket and Elasticsearch instance in different regions | Verify the OSS bucket is in the same region as your Elasticsearch instance using the console or API. |
Dictionary changes do not affect existing documents | Updates only apply to new data | Reindex your data to apply dictionary changes to all documents. Use the _reindex API to copy data to a new index with the updated analyzer. |
Cluster fails to restart after update | Invalid dictionary file format or content | Check the dictionary file format: UTF-8 encoding, one word per line, no special characters. Review Elasticsearch logs for specific error messages. |
Stop words are not filtered | Stop word filter not configured in the analyzer | Add the stop token filter to your custom analyzer configuration, or use a built-in analyzer that includes stop word filtering. |
Diagnostic commands
Use these commands to diagnose IK analyzer issues:
Check plugin installation:
GET _cat/plugins?vExpected output should include analysis-ik in the plugin list.
Verify analyzer configuration for an index:
GET /my_index/_settingsCheck the analysis section for IK analyzer configuration.
Check field mapping:
GET /my_index/_mapping/field/contentVerify the analyzer and search_analyzer settings for the field.
Test analyzer directly:
GET _analyze
{
"analyzer": "ik_max_word",
"text": "测试文本"
}This tests the analyzer without requiring an index.
Best practices
Choosing between ik_smart and ik_max_word
Scenario | Recommended configuration | Reason |
E-commerce product search | Index: ik_max_word Search: ik_smart | Product names are often short and need comprehensive matching. Search queries should be precise to avoid irrelevant results. |
Long-form content (articles, documentation) | Index: ik_max_word Search: ik_smart | Maximize recall for diverse queries while reducing noise in search results. |
Exact phrase matching | Both: ik_smart | Avoid over-segmentation that breaks meaningful phrases. |
Autocomplete and suggestions | ik_smart | Produce cleaner suggestions with semantically meaningful words. |
Log analysis with Chinese text | ik_max_word | Capture all possible keywords for troubleshooting and pattern detection. |
Dictionary maintenance
Review search analytics: Identify frequently searched terms that are being tokenized incorrectly, then add them to your custom dictionary.
Use dynamic updates for frequent changes: If you add new product names or technical terms regularly, use dynamic updates to avoid restart overhead.
Plan updates requiring restart carefully: Schedule these updates during maintenance windows to minimize user impact.
Version control dictionary files: Store dictionary files in version control (Git) to track changes and enable rollback if needed.
Test before production: Always test dictionary changes in a development or staging environment before applying them to production.
Performance considerations
Factor | ik_smart | ik_max_word | Impact |
Index size | Smaller | Larger (30-50% increase) | Storage costs, backup size |
Indexing speed | Faster | Slower (20-40% decrease) | Bulk indexing time |
Search accuracy | Good (higher precision) | Better (higher recall) | Search result quality |
Memory usage | Lower | Higher | Node memory requirements |
Recommendation:
For cost-sensitive applications with tight storage budgets, consider using ik_smart for both indexing and searching.
For quality-critical applications (e-commerce, search engines), use the asymmetric configuration: ik_max_word for indexing and ik_smart for searching.
For large-scale deployments (10+ million documents), test performance impact before choosing a mode. Consider shard sizing and node capacity.
Handling mixed Chinese-English content
The IK analyzer handles mixed Chinese-English content gracefully:
Chinese tokens: Analyzed by IK dictionary
English tokens: Passed through to standard tokenization (word boundary detection)
Numbers and punctuation: Treated as separators
Example:
GET _analyze
{
"analyzer": "ik_smart",
"text": "使用Machine Learning技术"
}Result: 使用, machine, learning, 技术
Note: English tokens are lowercased by default. If you need case-sensitive English tokenization, configure a custom analyzer with appropriate token filters.