AliNLP tokenization plug-in analysis-aliws - Elasticsearch

The AliNLP tokenization plug-in, also known as analysis-aliws, is a built-in plug-in of Alibaba Cloud Elasticsearch. After you install this plug-in on your Elasticsearch cluster, an analyzer and a tokenizer are integrated into the cluster to implement document analysis and retrieval. The plug-in allows you to upload a tailored dictionary file to it. After the upload, the system performs a rolling update for your Elasticsearch cluster to apply the dictionary file.

Introduction

After the analysis-aliws plug-in is installed, the following analyzer and tokenizer are integrated into your Elasticsearch cluster by default. You can use the analyzer and tokenizer to search for documents. You can also upload a tailored dictionary file to the plug-in.

Analyzer: aliws, which does not return function words, function phrases, or symbols
Tokenizer: aliws_tokenizer

Note

For more information, see Use the aliws analyzer to search for a document and Configure dictionaries.
If you fail to get the expected results by using the analysis-aliws plug-in, troubleshoot the failure based on the instructions in Test the analyzer and Test the tokenizer.
You can customize a tokenizer. For more information, see Customize a tokenizer.

Prerequisites

The analysis-aliws plug-in is installed. It is not installed by default. For more information about how to install the plug-in, see Install and remove a built-in plug-in.

Limits

The memory size of data nodes in your Elasticsearch cluster must be 4 GiB or higher. For an Elasticsearch cluster used in the production environment, the memory size must be 8 GiB or higher. If the memory size of data nodes in your cluster does not meet the requirements, upgrade the configuration of your cluster. For more information, see Upgrade the configuration of a cluster.
Elasticsearch V5.X clusters and V8.X clusters do not support the analysis-aliws plug-in. You can check whether your cluster support the plug-in in the Elasticsearch console.

Use the aliws analyzer to search for a document

Log on to the Kibana console of your Elasticsearch cluster and go to the homepage of the Kibana console as prompted.
For more information about how to log on to the Kibana console, see Log on to the Kibana console.
Note In this example, an Elasticsearch V6.7.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
In the left-side navigation pane of the page that appears, click Dev Tools.
On the Console tab of the page that appears, run one of the following commands to create an index:
- Command for an Elasticsearch cluster of a version earlier than V7.0
```
PUT /index
{
   "mappings": {
        "fulltext": {
            "properties": {
                "content": {
                    "type": "text",
                    "analyzer": "aliws"
                }
            }
        }
    }
}
```
- Command for an Elasticsearch cluster of V7.0 or later
```
PUT /index
{
  "mappings": {
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "aliws"
          }
      }
  }
}
```
In this example, an index named index is created. In a version earlier than V7.0, the type of the index is fulltext. In V7.0 or later, the type of the index is _doc. The index contains the content property. The type of the property is text. In addition, the aliws analyzer is added to the index.
If the command is successfully run, the following result is returned:
```
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "index"
}
```
Run the following command to add a document:
Important
The following command is suitable only for an Elasticsearch cluster of a version earlier than V7.0. For an Elasticsearch cluster of V7.0 or later, you must change fulltext to _doc.
```
POST /index/fulltext/1
{
  "content": "I like go to school."
}
```
The preceding command adds a document named 1 and sets the value of the content field in the document to I like go to school..
If the command is successfully run, the following result is returned:
```
{
  "_index": "index",
  "_type": "fulltext",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
```

Run the following command to search for the document:

Important

The following command is suitable only for an Elasticsearch cluster of a version earlier than V7.0. For an Elasticsearch cluster of V7.0 or later, you must change fulltext to _doc.

GET /index/fulltext/_search
{
  "query": {
    "match": {
      "content": "school"
    }
  }
}

The preceding command uses the aliws analyzer to analyze all documents of the fulltext type, and returns the document that has school contained in the content field.

If the command is successfully run, the following result is returned:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "content": "I like go to school."
        }
      }
    ]
  }
}

Note

If you fail to get the expected results by using the analysis-aliws plug-in, troubleshoot the failure based on the instructions in Test the analyzer and Test the tokenizer.

Configure dictionaries

The analysis-aliws plug-in allows you to upload a tailored dictionary file named aliws_ext_dict.txt. After you upload a tailored dictionary file, all the nodes in your Elasticsearch cluster automatically load the file. In this case, the system does not restart the cluster.

Important

After the analysis-aliws plug-in is installed, no default dictionary file is provided. You must manually upload a tailored dictionary file.
Before you upload a tailored dictionary file, you must name the dictionary file aliws_ext_dict.txt.
The content of the dictionary file cannot contain hidden symbols, such as tokens that end with spaces.

Log on to the Alibaba Cloud Elasticsearch console.
In the left-side navigation pane, click Elasticsearch Clusters.
Navigate to the desired cluster.
1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
2. On the Elasticsearch Clusters page, find the cluster and click its ID.
In the left-side navigation pane of the page that appears, choose Configuration and Management > Plug-ins.
On the Built-in Plug-ins tab, find the analysis-aliws plug-in and click Dictionary Configuration in the Actions column.
In the Dictionary Configuration panel, click Configure in the lower-left corner.
Select a method to upload the dictionary file. Then, upload the dictionary file based on the following instructions.
Important
You can upload only one dictionary file, and the name of the dictionary file must be aliws_ext_dict.txt. If you want to update the aliws_ext_dict.txt dictionary file that is uploaded, click x next to the dictionary file name aliws_ext_dict.txt to delete the dictionary file. Then, upload another dictionary file that is named aliws_ext_dict.txt.
The dictionary file must meet the following requirements:
- Name: aliws_ext_dict.txt.
- Encoding format: UTF-8.
- Content: Each row contains one word and ends with \n (line feed in UNIX or Linux). No whitespace characters are used before and after this word. If the dictionary file is generated in Windows, you must use the dos2unix tool to convert the file before you upload it.
You can use one of the following methods to upload a dictionary file:
- TXT File: If you select this method, click Upload TXT File and select the file that you want to upload from your on-premises machine.
- Add OSS File: If you select this method, configure the Bucket Name and File Name parameters, and click Add.
  Make sure that the bucket you specify resides in the same region as your Elasticsearch cluster. If the content of the dictionary that is stored in Object Storage Service (OSS) changes, you must manually upload the dictionary file again.
Click Save.
The system does not restart your cluster but performs a rolling update to make the uploaded dictionary file take effect. The update requires about 10 minutes.
Note
If you want to download the uploaded dictionary file, click the icon that corresponds to the file.

Test the analyzer

Run the following command to test the aliws analyzer:

GET _analyze
{
  "text": "I like go to school.",
  "analyzer": "aliws"
}

If the command is successfully run, the following result is returned:

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    }
  ]
}

Test the tokenizer

Run the following command to test the aliws_tokenizer tokenizer:

GET _analyze
{
  "text": "I like go to school.",
  "tokenizer": "aliws_tokenizer"
}

If the command is successfully run, the following result is returned:

{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : " ",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : " ",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "to",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : " ",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : ".",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 9
    }
  ]
}

Customize a tokenizer

After the analysis-aliws plug-in performs tokenization on data, the following filters perform the related operations on the data: stemmer, lowercase, porter_stem, and stop. If you want to use these filters for your custom tokenizer, you can add the tokenizer aliws_tokenizer of the analysis-aliws plug-in to the custom tokenizer and add filter configurations based on your business requirements. The following code provides an example. You can use the stopwords field to configure stopwords.
```
PUT my-index-000001
{
  "settings": {
    "analysis": {
     "filter": {
      "my_stop": {
       "type": "stop",
       "stopwords": [
        " ",
        ",",
        ".",
        "　",
        "a",
        "of"
       ]
      }
     },
     "analyzer": {
      "my_custom_analyzer": {
       "type": "custom",
       "tokenizer": "aliws_tokenizer",
       "filter": [
        "lowercase",
        "porter_stem",
        "my_stop"
       ]
      }
     }
    }
    }
}
```
Note
If you do not require a filter, you can delete the filter configuration.
aliws_tokenizer allows you to use synonyms to configure a custom tokenizer. The configuration method is the same as that used for the analysis-ik plug-in. For more information, see Use synonyms.

FAQ

The letter at the end of a word is missing after the word is tokenized by using the tokenizer aliws_tokenizer. For example, the tokenization results of iPhone and Chinese are Iphon and chines. The letter e at the end is missing. What do I do?
- Cause: aliws_tokenizer performs word stemming after tokenization. As a result, the letter e at the end of each word is removed.
- Solution: Run the following command, in which the my_custom_analyzer field is specified in the analysis configuration part and the filter configuration part is removed:
```
PUT my-index1
{
    "settings": {
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "my_custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "aliws_tokenizer"
                }
            }
        }
    }
}
```
- Verification: Run the following command to check whether the tokenization results meet expectations:
```
GET my-index1/_analyze
{
    "analyzer": "my_custom_analyzer",
    "text": ["iphone"]
}
```

References

For information about how to call an API operation to update the dictionary file of the analysis-aliws plug-in, see UpdateAliwsDict.
For information about the plug-ins provided by Alibaba Cloud Elasticsearch, see Overview of plug-ins.