All Products
Search
Document Center

Elasticsearch:Use the IK analyzer plugin for Chinese text search

Last Updated:Feb 26, 2026

The IK analyzer plugin (analysis-ik) provides smart Chinese text tokenization for Elasticsearch, breaking down Chinese text into meaningful terms for accurate search results. This plugin supports two tokenization modes—ik_max_word for comprehensive indexing and ik_smart for precise searching—and enables dynamic dictionary updates without cluster restarts.

When to use this plugin

Use the IK analyzer plugin when:

  • Building Chinese-language search features: E-commerce product search, article search, or any application requiring Chinese text analysis.

  • Improving search quality over default analyzers: The built-in smartcn analyzer provides basic Chinese tokenization, but IK analyzer offers more accurate segmentation and custom dictionary support.

  • Managing domain-specific terminology: Add industry terms, product names, or company-specific vocabulary to improve tokenization accuracy.

  • Maintaining search quality without downtime: Update dictionaries dynamically as your vocabulary evolves.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud Elasticsearch instance in Normal status

  • Administrative access to the Elasticsearch console

  • (Optional) An Object Storage Service (OSS) bucket in the same region as your Elasticsearch instance, if using the OSS upload method

  • Basic understanding of Elasticsearch analyzers and index mappings

Important: The IK analyzer plugin comes pre-installed on Alibaba Cloud Elasticsearch instances. You do not need to install it manually.

Note: Console interfaces and available features may vary across Elasticsearch versions. This guide uses version 7.10.0 for examples.

Understand the plugin components

The IK analyzer plugin consists of three main components that work together to tokenize Chinese text:

  1. Tokenizers: The analysis engines that break Chinese text into individual terms

  2. Dictionary files: Word lists that guide tokenization decisions

  3. Update mechanisms: Methods for refreshing dictionaries with new terms

Tokenization modes

The plugin provides two tokenization modes optimized for different purposes:

Mode

Use case

Behavior

Example output

ik_max_word

Indexing

Fine-grained segmentation. Exhaustively splits text into all possible word combinations to maximize search recall.

Input: 计算机汉字输入方法

Output: 计算机,计算,算机,汉字输入,汉字,输入,方法

ik_smart

Searching

Coarse-grained segmentation. Produces fewer, more semantically meaningful tokens for precise queries.

Input: 计算机汉字输入方法

Output: 计算机,汉字输入,方法

{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

Dictionary types

The plugin includes several dictionary files that control tokenization behavior:

Dictionary type

File name

Description

Editable

Update methods

Main dictionary

main.dic

Contains over 270,000 common Chinese words. The primary vocabulary for tokenization.

Yes

Dynamic update, Update requiring restart

Stop word list

stopword.dic

Words filtered out during analysis. Default list includes English stop words: a, an, and, are, as, at, be, and so on.

Yes

Dynamic update, Update requiring restart

Preposition dictionary

preposition.dic

Chinese prepositions and function words for specialized filtering.

Yes

Update requiring restart only

Quantifier dictionary

quantifier.dic

Chinese measure words and quantifiers (e.g., 个、只、条).

Yes

Update requiring restart only

Surname dictionary

surname.dic

Common Chinese surnames for name recognition.

No

Not supported

Suffix dictionary

suffix.dic

Used for tokenizing terms with suffixes

No

Not supported

Format requirements: All dictionary files must be:

  • Saved with UTF-8 encoding

  • Plain text files with .dic extension

  • One word per line

  • File names contain only letters, digits, and underscores

  • File names (excluding extension) do not exceed 30 characters

Update methods

The plugin supports two methods for updating dictionaries, each suited for different scenarios:

Method

When to use

Supported dictionaries

Impact

Limitations

Dynamic update

Runtime updates without downtime

Main dictionary, stop word list

No cluster restart needed. Each node loads the updated dictionary automatically.

Cannot add or remove dictionary files, only modify content. Changing file names triggers a restart.

Update requiring restart

Initial dictionary setup, adding new dictionary files, or updating preposition/quantifier dictionaries

All dictionary types

Triggers a rolling cluster restart. Downtime depends on cluster size and configuration.

Service interruption during restart.

Performance note: During dynamic updates, each node retrieves and loads dictionary files at different times. Changes take effect progressively across the cluster, not simultaneously.

Data impact: Dictionary updates only affect newly indexed or updated documents. To apply changes to all existing data, you must reindex.

Configure the IK analyzer

You can configure IK analyzer in two ways: as the default analyzer for an index, or for specific fields.

Set IK as the default analyzer

This configuration applies IK tokenization to all text fields in the index:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_max_word"
        }
      }
    }
  }
}

Configure IK for specific fields

This configuration gives you fine-grained control over which fields use IK tokenization:

PUT /my_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "tags": {
        "type": "text",
        "analyzer": "ik_smart"
      }
    }
  }
}

Use IK with the Pinyin plugin

You can combine IK analyzer with the Pinyin analysis plugin to support both Chinese characters and Pinyin input:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_pinyin_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["my_pinyin", "word_delimiter"]
        }
      },
      "filter": {
        "my_pinyin": {
          "type": "pinyin",
          "keep_first_letter": true,
          "keep_separate_first_letter": false,
          "keep_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "lowercase": true,
          "remove_duplicated_term": true
        }
      }
    }
  }
}

Important: The Pinyin filter runs after Chinese tokenization completes. The analyzer first tokenizes Chinese text with IK, then converts the resulting tokens into Pinyin.

Update dictionaries

You can add custom terms to dictionaries or modify existing dictionaries to improve tokenization accuracy for your domain.

Method 1: Dynamic update (no restart needed)

Use this method to update the main dictionary or stop word list without interrupting service.

When to use:

  • Adding new product names, technical terms, or domain vocabulary

  • Updating stop words based on search analytics

  • Frequent dictionary changes in development or testing

Limitations:

  • Only main dictionary (main.dic) and stop word list (stopword.dic) supported

  • Cannot add or remove dictionary files—only modify existing file content

  • Changing file names or file count triggers a cluster restart

Procedure

  1. Navigate to your Elasticsearch instance:

    • In the left navigation menu, choose Elasticsearch Clusters.

    • Click the name of your target instance.

  2. In the left navigation pane, click Plug-ins.

  3. On the Built-in Plug-ins tab, find analysis-ik in the plug-in list.

  4. Click Rolling Update in the Actions column.

  5. In the Configure IK Dictionaries - Rolling Update panel, click Edit for the target dictionary type, and upload the required dictionary file:

    1. Click Edit for the target dictionary type.

    2. Add your custom terms using one of two methods:

      Option A: Upload a local file

      • Select Upload On-premises File.

      • Select your .dic file (UTF-8 encoded, one word per line).

      • Click Save.

      Option B: Upload from OSS

      • Select Upload OSS File.

      • Enter the OSS bucket name.

      • Enter the complete path to your dictionary file, including file name and extension.

      • Click Save.

      Note
      • To update an existing dictionary, download and modify it, remove the original file, and upload the updated one.

      • The OSS bucket and Elasticsearch instance must be in the same region.

      • File name restrictions: Maximum 30 characters (excluding .dic extension), containing only letters, digits, and underscores.

  6. Review your changes and click OK.

    The IK configuration takes effect after the dictionary is loaded on each node. The time it takes for a node to load the dictionary file varies. Wait for about one to two minutes and then verify whether the IK dictionary is updated.

Method 2: Update requiring restart

Use this method for initial dictionary setup, adding new dictionary files, or updating preposition/quantifier dictionaries.

When to use:

  • First-time dictionary configuration

  • Adding new dictionary files (not just modifying existing files)

  • Updating preposition or quantifier dictionaries

  • Major dictionary overhauls

Warning: This method triggers a rolling cluster restart. Your cluster will experience temporary service interruption as nodes restart sequentially. Plan this update during maintenance windows.

Procedure

  1. Log on to the Alibaba Cloud Elasticsearch console.

  2. Navigate to your Elasticsearch instance:

    • In the left navigation menu, choose Elasticsearch Clusters.

    • Click the name of your target instance.

  3. In the left navigation pane, click Plug-ins.

  4. On the Built-in Plug-ins tab, find analysis-ik in the plug-in list.

  5. Click Standard Update in the Actions column.

  6. In the Configure IK Dictionaries - Standard Update panel, click Edit for the target dictionary type, and upload the required dictionary file:

    Option A: Upload a local file

    • Select Upload On-premises File.

    • Select your .dic file (UTF-8 encoded, one word per line).

    • Click Save.

    Option B: Upload from OSS

    • Select Upload OSS File.

    • Enter the OSS bucket name.

    • Enter the complete path to your dictionary file, including file name and extension.

    • Click Save.

    Note
    • You can upload only one .dic file per dictionary type. The uploaded file replaces the existing dictionary.

    • The OSS bucket and Elasticsearch instance must be in the same region.

    • File name restrictions: Maximum 30 characters (excluding .dic extension), containing only letters, digits, and underscores.

  7. Review the current configuration to verify your upload.

  8. Click OK.

  9. The system prompts: "After the update, the cluster will be restarted. Continue?"

  10. Click OK to confirm. The cluster begins a rolling restart to apply the configuration.

  11. Monitor the restart progress on the cluster details page. The cluster status changes to Initializing during the restart, then returns to Normal when complete.

Verify basic IK tokenization

Verify that the IK analyzer is working correctly by testing tokenization with the _analyze API.

Test that Chinese text is segmented into words, not individual characters:

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "计算机汉字输入方法"
}

Expected output:

{
  "tokens" : [
    {
      "token" : "计算机",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "汉字输入",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "方法",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

What to look for:

  • Chinese text is segmented into meaningful words (计算机, 汉字输入...)

  • If you see individual characters (计, 算, 机...), the analyzer is not configured correctly

Troubleshooting

Issue

Possible cause

Solution

Chinese text is tokenized into individual characters

IK analyzer not configured for the field

Verify the field mapping uses ik_max_word or ik_smart. Use the GET /my_index/_mapping API to check field configuration.

Custom terms are not recognized as single tokens

Dictionary not loaded or update not applied

Wait 1-2 minutes after dynamic update for all nodes to load the dictionary. For updates requiring restart, verify the cluster status is Normal.

Dynamic update (hot reload) stopped working

File name or file count changed

Dynamic update only supports modifying existing file content. Adding, removing, or renaming files requires an update requiring restart.

Dictionary upload fails with encoding error

Dictionary file not saved as UTF-8

Save the dictionary file with UTF-8 encoding. Avoid using ANSI or other encodings.

OSS upload fails

OSS bucket and Elasticsearch instance in different regions

Verify the OSS bucket is in the same region as your Elasticsearch instance using the console or API.

Dictionary changes do not affect existing documents

Updates only apply to new data

Reindex your data to apply dictionary changes to all documents. Use the _reindex API to copy data to a new index with the updated analyzer.

Cluster fails to restart after update

Invalid dictionary file format or content

Check the dictionary file format: UTF-8 encoding, one word per line, no special characters. Review Elasticsearch logs for specific error messages.

Stop words are not filtered

Stop word filter not configured in the analyzer

Add the stop token filter to your custom analyzer configuration, or use a built-in analyzer that includes stop word filtering.

Diagnostic commands

Use these commands to diagnose IK analyzer issues:

Check plugin installation:

GET _cat/plugins?v

Expected output should include analysis-ik in the plugin list.

Verify analyzer configuration for an index:

GET /my_index/_settings

Check the analysis section for IK analyzer configuration.

Check field mapping:

GET /my_index/_mapping/field/content

Verify the analyzer and search_analyzer settings for the field.

Test analyzer directly:

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "测试文本"
}

This tests the analyzer without requiring an index.

Best practices

Choosing between ik_smart and ik_max_word

Scenario

Recommended configuration

Reason

E-commerce product search

Index: ik_max_word

Product names are often short and need comprehensive matching. Search queries should be precise to avoid irrelevant results.

Long-form content (articles, documentation)

Index: ik_max_word

Maximize recall for diverse queries while reducing noise in search results.

Exact phrase matching

Both: ik_smart

Avoid over-segmentation that breaks meaningful phrases.

Autocomplete and suggestions

ik_smart

Produce cleaner suggestions with semantically meaningful words.

Log analysis with Chinese text

ik_max_word

Capture all possible keywords for troubleshooting and pattern detection.

Dictionary maintenance

  • Review search analytics: Identify frequently searched terms that are being tokenized incorrectly, then add them to your custom dictionary.

  • Use dynamic updates for frequent changes: If you add new product names or technical terms regularly, use dynamic updates to avoid restart overhead.

  • Plan updates requiring restart carefully: Schedule these updates during maintenance windows to minimize user impact.

  • Version control dictionary files: Store dictionary files in version control (Git) to track changes and enable rollback if needed.

  • Test before production: Always test dictionary changes in a development or staging environment before applying them to production.

Performance considerations

Factor

ik_smart

ik_max_word

Impact

Index size

Smaller

Larger (30-50% increase)

Storage costs, backup size

Indexing speed

Faster

Slower (20-40% decrease)

Bulk indexing time

Search accuracy

Good (higher precision)

Better (higher recall)

Search result quality

Memory usage

Lower

Higher

Node memory requirements

Recommendation:

  • For cost-sensitive applications with tight storage budgets, consider using ik_smart for both indexing and searching.

  • For quality-critical applications (e-commerce, search engines), use the asymmetric configuration: ik_max_word for indexing and ik_smart for searching.

  • For large-scale deployments (10+ million documents), test performance impact before choosing a mode. Consider shard sizing and node capacity.

Handling mixed Chinese-English content

The IK analyzer handles mixed Chinese-English content gracefully:

  • Chinese tokens: Analyzed by IK dictionary

  • English tokens: Passed through to standard tokenization (word boundary detection)

  • Numbers and punctuation: Treated as separators

Example:

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "使用Machine Learning技术"
}

Result: 使用, machine, learning, 技术

Note: English tokens are lowercased by default. If you need case-sensitive English tokenization, configure a custom analyzer with appropriate token filters.