After a tokenization method is specified for TEXT fields, Tablestore tokenizes values into multiple words based on the specified tokenization method. Tokenization cannot be configured for fields of non-TEXT types.

You can use match query and match phrase query to query TEXT data. Term query, terms query, prefix query, and wildcard query are also used in a few scenarios.

Tokenization methods

The following tokenization methods are supported:

  • Single-word tokenization (SingleWord)

    This method applies to all natural languages such as Chinese, English, and Japanese. By default, the tokenization method for TEXT fields is single-word tokenization.

    After single-word tokenization is specified, Tablestore performs tokenization based on the following rules:
    • Chinese texts are tokenized based on each character.
    • Letters or digits are tokenized based on spaces or punctuation marks. Uppercase letters are converted to lowercase letters. For example, "Hang Zhou" is tokenized into "hang" and "zhou". To query the rows that contain "Hang Zhou", you can use match query or match phrase query and set the keyword to "hang", "HANG", or "Hang".
    • Alphanumeric characters such as model numbers are also separated by spaces or punctuation marks. However, these characters cannot be tokenized into smaller words. For example, "IPhone6" can be tokenized into "IPhone6". When you use match query or match phrase query, you must specify "iphone6". No results are returned if you use "iphone".
    The following table describes the parameters for single-word tokenization.
    Parameter Description
    caseSensitive Specifies whether to enable case sensitivity. The default value is false. false specifies that all letters are converted to lowercase letters.

    If you do not need Tablestore to convert letters to lowercase letters, you can set the caseSensitive parameter to true.

    delimitWord Specifies whether to tokenize alphanumeric characters. The default value is false.

    You can set the delimitWord parameter to true to separate letters from digits. This way, "iPhone6" is tokenized into "iPhone" and "6".

  • Delimiter tokenization (Split)

    Tablestore provides general dictionary-based tokenization. However, special industries must use tokenization for custom dictionaries. To address this need, Tablestore provides delimiter tokenization. You can perform tokenization by using custom methods, use delimiter tokenization, and then write data to Tablestore.

    Delimiter tokenization applies to all natural languages such as Chinese, English, and Japanese.

    After the tokenization method is set to delimiter tokenization, the system tokenizes field values based on the specified delimiter. For example, a field value is "badminton,ping pong,rap". The delimiter is set to a comma (,). The value is tokenized into "badminton", "ping pong", and "rap". The field is indexed. When you use match query or match phrase query to query "badminton", "ping pong", "rap", or "badminton,ping pong", the row can be obtained.

    The following table describes the parameters for delimiter tokenization.
    Parameter Description
    delimiter The delimiter. By default, the value is a whitespace character.
    • When you create a search index, the delimiter set for the field for tokenization must be the same as that in the written data. Otherwise, data may not be retrieved.
    • When the custom delimiter is set to a number sign (#), replace the delimiter with escape characters.
  • Minimum semantic unit-based tokenization (MinWord)

    This method applies to the Chinese language in full-text search scenarios.

    After the tokenization method is set to minimum semantic unit-based tokenization, Tablestore tokenizes the TEXT field values into minimum semantic units when Tablestore performs a query.

  • Maximum semantic unit-based tokenization (MaxWord)

    This method applies to the Chinese language in full-text search scenarios.

    After the tokenization method is set to maximum semantic unit-based tokenization, Tablestore tokenizes the TEXT field values into the maximum number of semantic units when Tablestore performs a query. However, different semantic units may overlap. The total length of the tokenized words is longer than the length of the original text. The index fields are increased.

    This method can generate more tokens and increase the probability of obtaining query results. However, the index fields are greatly increased. Match phrase query can also be used to tokenize words in the same manner. In this case, tokens can overlap, and data may not be retrieved. Therefore, match query is more suitable for this tokenization method.

  • Fuzzy tokenization

    This method applies to all natural languages such as Chinese, English, and Japanese in scenarios where the text content is small, such as titles, movie titles, and book titles.

    Minimal latencies are ensured when this method is used to return results. However, the index fields are greatly increased. Therefore, this tokenization method is suitable only for short text.

    After the tokenization method is set to fuzzy tokenization, Tablestore performs tokenization by using n-gram. The results are between minChars and maxChars. For example, this method is used to populate the drop-down list.

    • Limits
      • You can use fuzzy tokenization to tokenize only the TEXT field values that are smaller than 256 characters in length. If the length limit is exceeded, Tablestore truncates the exceeded characters and discards them. Only the first 256 characters are retained.
        Note If the length of the string you specified to query data exceeds the limit, you can tokenize the string into multiple strings. The maximum length of each string is specified by using the maxChars value. Then, you can use BoolQuery(mustQueries) to query the rows that meet conditions.
      • To prevent excessive expansion of index data, the difference between the values of maxChars and minChars must not exceed 6.
    • Parameters
      Parameter Description
      minChars The minimum number of characters for a token. The default value is 1. We recommend that you set this value to 2.
      maxChars The maximum number of characters for a token. The default value is 3.

Comparison

The following table compares the previous tokenization methods in different dimensions.
Dimension Single-word tokenization Delimiter tokenization Minimum semantic unit-based tokenization Maximum semantic unit-based tokenization Fuzzy tokenization
Index expansion Medium Small Small Large Huge
Relevance Weak Weak Medium Relatively strong Relatively strong
Applicable language All All Chinese Chinese All
Size limit None None None None 256 characters
Recall rate High Low Low Medium Medium