The AliNLP tokenization plug-in, also known as analysis-aliws, is a built-in plug-in of Alibaba Cloud Elasticsearch. This plug-in integrates an analyzer and a tokenizer into Elasticsearch to implement document analysis and retrieval. The plug-in allows you to upload a tailored dictionary file to it. After the upload, the system performs a rolling update for your Elasticsearch cluster to apply the dictionary file. This topic describes how to use the analysis-aliws plug-in.

Background information

After the analysis-aliws plug-in is installed, the following analyzer and tokenizer are integrated into your Elasticsearch cluster:
  • Analyzer: aliws, which does not return function words, function phrases, or symbols
  • Tokenizer: aliws_tokenizer

You can use the analyzer and tokenizer to search for documents. You can also upload a tailored dictionary file to the plug-in. For more information, see Search for a document and Configure dictionaries. If you fail to get the expected results by using the analysis-aliws plug-in, reference Test the analyzer and Test the tokenizer to locate the cause.

Prerequisites

The analysis-aliws plug-in is installed. It is not installed by default.

If the analysis-aliws plug-in is not installed, install it. Make sure that each data node in your Elasticsearch cluster offers at least 4 GiB of memory. If your cluster runs in the production environment, each data node in the cluster must offer at least 8 GiB of memory. For more information about how to install the analysis-aliws plug-in, see Install and remove a built-in plug-in.

Notice
  • Elasticsearch V5.X clusters do not support the analysis-aliws plug-in.
  • If the memory size of data nodes in your cluster does not meet the preceding requirements, upgrade the configuration of your cluster. For more information, see Upgrade the configuration of a cluster.

Limits

Elasticsearch V5.X clusters do not support the analysis-aliws plug-in.

Search for a document

  1. Log on to the Kibana console of your Elasticsearch cluster and go to the homepage of the Kibana console as prompted.
    For more information about how to log on to the Kibana console, see Log on to the Kibana console.
    Note In this example, an Elasticsearch V6.7.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
  2. In the left-side navigation pane of the page that appears, click Dev Tools.
  3. On the Console tab of the page that appears, run one of the following commands to create an index:
    • Command for Elasticsearch clusters of earlier than V7.0
      PUT /index
      {
         "mappings": {
              "fulltext": {
                  "properties": {
                      "content": {
                          "type": "text",
                          "analyzer": "aliws"
                      }
                  }
              }
          }
      }
    • Command for Elasticsearch clusters of V7.0 or later
      PUT /index
      {
        "mappings": {
          "properties": {
              "content": {
                  "type": "text",
                  "analyzer": "aliws"
                }
            }
        }
      }

    In this example, an index named index is created. In a version earlier than V7.0, the type of the index is fulltext. In V7.0 or later, the type of the index is _doc. The index contains the content property. The type of the property is text. In addition, the aliws analyzer is added to the index.

    If the command is successfully run, the following result is returned:
    {
      "acknowledged": true,
      "shards_acknowledged": true,
      "index": "index"
    }
  4. Run the following command to add a document:
    Notice The following command is suitable only for an Elasticsearch cluster of a version earlier than V7.0. For an Elasticsearch cluster of V7.0 or later, you must change fulltext to _doc.
    POST /index/fulltext/1
    {
      "content": "I like go to school."
    }

    The preceding command adds a document named 1 and sets the value of the content field in the document to I like go to school..

    If the command is successfully run, the following result is returned:
    {
      "_index": "index",
      "_type": "fulltext",
      "_id": "1",
      "_version": 1,
      "result": "created",
      "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
      },
      "_seq_no": 0,
      "_primary_term": 1
    }
  5. Run the following command to search for the document:
    Notice The following command is suitable only for an Elasticsearch cluster of a version earlier than V7.0. For an Elasticsearch cluster of V7.0 or later, you must change fulltext to _doc.
    GET /index/fulltext/_search
    {
      "query": {
        "match": {
          "content": "school"
        }
      }
    }

    The preceding command uses the aliws analyzer to analyze all documents of the fulltext type, and returns the document that has school contained in the content field.

    If the command is successfully run, the following result is returned:
    {
      "took": 5,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.2876821,
        "hits": [
          {
            "_index": "index",
            "_type": "fulltext",
            "_id": "2",
            "_score": 0.2876821,
            "_source": {
              "content": "I like go to school."
            }
          }
        ]
      }
    }
Note If you fail to get the expected results by using the analysis-aliws plug-in, reference Test the analyzer and Test the tokenizer to locate the cause.

Configure dictionaries

The analysis-aliws plug-in allows you to upload a tailored dictionary file named aliws_ext_dict.txt. After you upload a tailored dictionary file, all the nodes in your Elasticsearch cluster automatically load the file. In this case, the system does not restart the cluster.
Notice
  • After the analysis-aliws plug-in is installed, no default dictionary file is provided. You must manually upload a tailored dictionary file.
  • Before you upload a tailored dictionary file, you must name the dictionary file aliws_ext_dict.txt.
  1. Log on to the Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
    2. In the left-side navigation pane, click Elasticsearch Clusters. On the Elasticsearch Clusters page, find the cluster and click its ID.
  4. In the left-side navigation pane of the page that appears, choose Configuration and Management > Plug-ins.
  5. On the Built-in Plug-ins tab, find the analysis-aliws plug-in and click Dictionary Configuration in the Actions column.
  6. In the Dictionary Configuration panel, click Configure in the lower-left corner.
  7. Select a method to upload the dictionary file. Then, upload the dictionary file based on the following instructions.
    Notice You can upload only one dictionary file, and the name of the dictionary file must be aliws_ext_dict.txt. If you want to update the aliws_ext_dict.txt dictionary file that is uploaded, click x next to the dictionary file name to delete the dictionary file. Then, upload another dictionary file that is named aliws_ext_dict.txt.
    Update the uploaded dictionary file
    The dictionary file must meet the following requirements:
    • Name: aliws_ext_dict.txt.
    • Encoding format: UTF-8.
    • Content: Each row contains one word and ends with \n (line feed in UNIX or Linux). No whitespace characters are used before and after this word. If the dictionary file is generated in Windows, you must use the dos2unix tool to convert the file before you upload it.
    You can use one of the following methods to upload a dictionary file:
    • TXT File: If you select this method, click Upload TXT File and select the file that you want to upload from your on-premises machine.
    • Add OSS File: If you select this method, configure the Bucket Name and File Name parameters, and click Add.

      Make sure that the bucket you specify resides in the same region as your Elasticsearch cluster. If the content of the dictionary that is stored in Object Storage Service (OSS) changes, you must manually upload the dictionary file again.

  8. Click Save.
    The system does not restart your cluster but performs a rolling update to make the uploaded dictionary file take effect. The update requires about 10 minutes.
    Note If you want to download the uploaded dictionary file, click the Download icon icon that corresponds to the file.

Test the analyzer

Run the following command to test the aliws analyzer:

GET _analyze
{
  "text": "I like go to school.",
  "analyzer": "aliws"
}
If the command is successfully run, the following result is returned:
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    }
  ]
}

Test the tokenizer

Run the following command to test the aliws_tokenizer tokenizer:

GET _analyze
{
  "text": "I like go to school.",
  "tokenizer": "aliws_tokenizer"
}
If the command is successfully run, the following result is returned:
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "like",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : " ",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "go",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : " ",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "to",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : " ",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "school",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : ".",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 9
    }
  ]
}

FAQ