Use the analysis-aliws plug-in for document search and dictionary configuration - Elasticsearch

The AliNLP tokenization plug-in, also known as analysis-aliws, is a built-in plug-in of Alibaba Cloud Elasticsearch. After you install this plug-in on your Elasticsearch cluster, an analyzer and a tokenizer are integrated into the cluster to implement document analysis and retrieval. The plug-in allows you to upload a custom dictionary file. After the upload, the cluster performs a rolling update to apply the dictionary file without restarting.

The analysis-aliws plug-in adds two components to your cluster:

Analyzer: aliws — strips function words, function phrases, and symbols from results
Tokenizer: aliws_tokenizer

After installation, use the aliws analyzer to index and search documents, upload a custom dictionary file to extend vocabulary, or build a custom analyzer on top of aliws_tokenizer.

If search results are unexpected, test the analyzer and tokenizer directly to diagnose the issue. See Test the analyzer and Test the tokenizer.

Prerequisites

Before you begin, ensure that you have:

An Elasticsearch cluster with the analysis-aliws plug-in installed. The plug-in is not installed by default. To install it, see Install and remove a built-in plug-in
Data nodes with at least 8 GiB of memory. If your cluster does not meet this requirement, upgrade the cluster configuration first. See Upgrade the configuration of a cluster

Limitations

The analysis-aliws plug-in is not supported on:

Elasticsearch V5.X clusters
Elasticsearch V8.X clusters
Elasticsearch Kernel-enhanced Edition clusters

Check whether your cluster supports the plug-in in the Elasticsearch console before proceeding.

Search documents using the aliws analyzer

This section walks through creating an index with the aliws analyzer, indexing a document, and running a search.

The following steps use an Elasticsearch V6.7.0 cluster as an example. Steps may differ for other versions. Follow the actual operations in your console.

Log on to the Kibana console of your Elasticsearch cluster. For details, see Log on to the Kibana console.
In the left-side navigation pane, click Dev Tools.

On the Console tab, run the command for your cluster version to create an index: For clusters running a version earlier than V7.0:

PUT /index
{
  "mappings": {
    "fulltext": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "aliws"
        }
      }
    }
  }
}

For clusters running V7.0 or later:

PUT /index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "aliws"
      }
    }
  }
}

Both commands create an index named index with a content field of type text and assign the aliws analyzer to that field. The index type is fulltext in versions earlier than V7.0 and _doc in V7.0 or later. A successful response looks like:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "index"
}

Add a document to the index:

Important

The following command applies only to clusters running a version earlier than V7.0. For V7.0 or later, replace fulltext with _doc.

POST /index/fulltext/1
{
  "content": "I like go to school."
}

This adds a document with ID 1 and sets its content field to I like go to school. A successful response looks like:

{
  "_index": "index",
  "_type": "fulltext",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Search for the document:

Important

The following command applies only to clusters running a version earlier than V7.0. For V7.0 or later, replace fulltext with _doc.

GET /index/fulltext/_search
{
  "query": {
    "match": {
      "content": "school"
    }
  }
}

This runs the aliws analyzer over all documents in the fulltext type and returns documents where the content field contains the token school. Function words like to are excluded from analysis. A successful response looks like:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "content": "I like go to school."
        }
      }
    ]
  }
}

Configure a custom dictionary

The analysis-aliws plug-in supports one custom dictionary file named aliws_ext_dict.txt. After you upload the file, all nodes in the cluster load it automatically without a restart — the cluster performs a rolling update instead, which takes about 10 minutes.

No dictionary file is provided after installation. Upload one if you need custom vocabulary.

Dictionary file requirements

Before uploading, make sure your dictionary file meets these requirements:

Requirement	Details
File name	`aliws_ext_dict.txt`
Encoding	UTF-8
Format	One word per line, each line ending with `\n` (UNIX/Linux line feed). No leading or trailing whitespace.
Windows files	Use the `dos2unix` tool to convert the file before uploading.

Upload the dictionary file

Log on to the Alibaba Cloud Elasticsearch console.
In the left-side navigation pane, click Elasticsearch Clusters.
In the top navigation bar, select the resource group and region for your cluster. On the Elasticsearch Clusters page, click the cluster ID.
In the left-side navigation pane, choose Configuration and Management > Plug-ins.
On the Built-in Plug-ins tab, find the analysis-aliws plug-in and click Configure Dictionary in the Actions column.
In the Configure Dictionary panel, click Configure in the lower-left corner.
Upload the dictionary file using one of the following methods:
- TXT File: Click Upload TXT File and select the file from your local machine.
- Add OSS File: Enter the Bucket Name and File Name, then click Add. The Object Storage Service (OSS) bucket must be in the same region as your Elasticsearch cluster. If the OSS file content changes later, upload the dictionary file again through the console — OSS changes are not picked up automatically.
Click Save. The cluster performs a rolling update to apply the dictionary file. The update takes about 10 minutes. The cluster does not restart during this process.
To download the uploaded file, click the icon next to the file name.

Only one dictionary file is allowed at a time. To replace it, click x next to aliws_ext_dict.txt to delete the current file, then upload a new one.

Test the analyzer

Run the following command to verify how the aliws analyzer tokenizes text:

GET _analyze
{
  "text": "I like go to school.",
  "analyzer": "aliws"
}

The aliws analyzer strips function words and symbols. For the input "I like go to school.", the expected output is:

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    }
  ]
}

The function word to and the period . are excluded from the output.

Test the tokenizer

Run the following command to verify how aliws_tokenizer tokenizes text:

GET _analyze
{
  "text": "I like go to school.",
  "tokenizer": "aliws_tokenizer"
}

Unlike the aliws analyzer, aliws_tokenizer retains all tokens including function words, whitespace, and punctuation. For the input "I like go to school.", the expected output is:

{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : " ",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : " ",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "to",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : " ",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : ".",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 9
    }
  ]
}

Build a custom analyzer

aliws_tokenizer applies the following filters after tokenization:

Filter	Effect	Default behavior
`stemmer`	Reduces words to their root form	Enabled
`lowercase`	Converts all tokens to lowercase	Enabled
`porter_stem`	Applies the Porter stemming algorithm	Enabled
`stop`	Removes stopwords	Enabled

To build a custom analyzer, add aliws_tokenizer as the base tokenizer and configure filters to match your requirements. Use the stopwords field to define custom stopwords.

The following example creates a custom analyzer named my_custom_analyzer with a custom stop filter:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": [
            " ",
            ",",
            ".",
            "　",
            "a",
            "of"
          ]
        }
      },
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "aliws_tokenizer",
          "filter": [
            "lowercase",
            "porter_stem",
            "my_stop"
          ]
        }
      }
    }
  }
}

To skip filtering entirely, remove the filter block from the analyzer configuration.

To verify the custom analyzer works as expected, run:

GET my-index-000001/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": ["I like go to school."]
}

aliws_tokenizer also supports synonym configuration using the same method as the analysis-ik plug-in. For details, see Use synonyms.

FAQ

How do I configure the analysis-aliws plug-in? What format does the dictionary file use?

See Configure a custom dictionary for the upload steps and Dictionary file requirements for the format.

For additional questions on configuring the plug-in, see the FAQ for analysis-aliws.

What are the differences among Elasticsearch synonyms, IK tokens, and AliNLP tokens?

See Elasticsearch synonyms vs. IK tokens vs. AliNLP tokens.

If the OSS dictionary is updated, does the cluster pick up the changes automatically?

No. If the OSS dictionary content changes, upload the file again through the console. See Configure a custom dictionary.

For more details on rolling updates and OSS-based dictionaries, see this FAQ entry.

After tokenization, the trailing `e` is missing from words like `iPhone` (becomes `Iphon`) and `Chinese` (becomes `chines`). How do I fix this?

aliws_tokenizer performs word stemming after tokenization. As a result, the letter e at the end of each word is removed. To disable stemming, create a custom analyzer that uses aliws_tokenizer without any filters:

PUT my-index1
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "aliws_tokenizer"
        }
      }
    }
  }
}

Verify the result:

GET my-index1/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": ["iphone"]
}

The output should now return iphone without truncation.

References

Overview of plug-ins — all plug-ins available for Alibaba Cloud Elasticsearch
InstallSystemPlugin — API operation to install a built-in plug-in
UpdateAliwsDict — API operation to update the analysis-aliws dictionary file
ListPlugins — API operation to list installed plug-ins on a cluster