After you specify a tokenization method is for TEXT fields, Tablestore tokenizes field values into multiple tokens based on the tokenization method. You cannot specify tokenization methods for non-TEXT fields.

You can use match query (MatchQuery) and match phrase query (MatchPhraseQuery) to query TEXT data. You can also use term query (TermQuery), terms query (TermsQuery), prefix query (PrefixQuery), and wildcard query (WildcardQuery) based on your business scenario.

Tokenization methods

The following tokenization methods are supported: single-word tokenization, delimiter tokenization, minimum semantic unit-based tokenization, maximum semantic unit-based tokenization, and fuzzy tokenization.

Single-word tokenization (SingleWord)

This tokenization method applies to all natural languages such as Chinese, English, and Japanese. By default, the tokenization method for TEXT fields is single-word tokenization.

After single-word tokenization is specified, Tablestore performs tokenization based on the following rules:
  • Chinese texts are tokenized based on each Chinese character. For example, "杭州" is tokenized into "杭" and "州". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and set the keyword to "杭" to query the data that contains "杭州".
  • Letters or digits are tokenized based on spaces or punctuation marks. Uppercase letters are converted to lowercase letters. For example, "Hang Zhou" is tokenized into "hang" and "zhou". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and set the keyword to "hang", "HANG", or "Hang" to query the rows that contain "Hang Zhou".
  • Alphanumeric characters such as model numbers are also separated by spaces or punctuation marks. However, these characters cannot be tokenized into smaller words. For example, "IPhone6" can only be tokenized into "IPhone6". When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery), you must specify "iphone6". No results are returned if you use "iphone".
The following table describes the parameters for single-word tokenization.
Parameter Description
caseSensitive Specifies whether to enable case sensitivity. Default value: false. If you set the parameter to false, all letters are converted to lowercase letters.

If you do not need Tablestore to convert letters to lowercase letters, you can set the parameter to true.

delimitWord Specifies whether to tokenize alphanumeric characters. Default value: false.

You can set the delimitWord parameter to true to separate letters from digits. This way, "iphone6" is tokenized into "iphone" and "6".

Delimiter tokenization (Split)

Tablestore provides general dictionary-based tokenization. However, some industries require custom dictionaries for tokenization. To meet this requirement, Tablestore provides delimiter tokenization. You can perform tokenization by using custom methods, use delimiter tokenization, and then write data to Tablestore.

Delimiter tokenization applies to all natural languages such as Chinese, English, and Japanese.

After the tokenization method is set to delimiter tokenization, the system tokenizes field values based on the specified delimiter. For example, a field value is "badminton,ping pong,rap". The delimiter is set to a comma (,). The value is tokenized into "badminton", "ping pong", and "rap". The field is indexed. When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery) to query "badminton", "ping pong", "rap", or "badminton,ping pong", the row can be obtained.

The following table describes the parameters for delimiter tokenization.
Parameter Description
delimiter The delimiter. By default, the value is a whitespace character. You can specify a custom delimiter.
  • When you create a search index, the delimiter specified for field tokenization must be the same as the delimiter that is included in the value of the column in the data table. Otherwise, data may not be obtained.
  • If the custom delimiter is a special character such as a number sign (#) or a tilde (~), concatenate the delimiter with an escape character (\), for example, \#.

Minimum semantic unit-based tokenization (MinWord)

This tokenization method applies to the Chinese language in full-text search scenarios.

After the tokenization method is set to minimum semantic unit-based tokenization, Tablestore tokenizes the TEXT field values into the minimum number of semantic units when Tablestore performs a query.

Maximum semantic unit-based tokenization (MaxWord)

This tokenization method applies to the Chinese language in full-text search scenarios.

After the tokenization method is set to maximum semantic unit-based tokenization, Tablestore tokenizes the TEXT field values into the maximum number of semantic units when Tablestore performs a query. However, different semantic units may overlap. The total length of the tokenized words is longer than the length of the original text. The index size is increased.

This tokenization method can generate more tokens and increase the probability that the rows are matched. However, the index size is greatly increased. You can use match phrase query (MatchPhraseQuery) when the tokenization method is set to maximum semantic unit-based tokenization. Match query (MatchQuery) is more suitable for this tokenization method. If you use match phrase query together with this tokenization method, data may not be obtained due to overlapping tokens because the keyword is also tokenized based on maximum semantic unit-based tokenization.

Fuzzy tokenization

This tokenization method applies to all natural languages such as Chinese, English, and Japanese in scenarios that involve short text content, such as titles, movie names, book titles, file names, and directory names.

The combination of fuzzy tokenization and match phrase query can be used to return query results at a low latency. The combination of fuzzy tokenization and match phrase query outperforms wildcard query (WildcardQuery). However, the index size is greatly increased.

After the tokenization method is set to fuzzy tokenization, Tablestore performs tokenization by using n-gram. The results are between minChars and maxChars. For example, this tokenization method is used to populate the drop-down list.

Fuzzy tokenization converts the field values into lowercase letters. Therefore, fuzzy tokenization is case-insensitive and is similar to the LIKE operator in SQL.

To perform a fuzzy query, you must perform a match phrase query (MatchPhraseQuery) on the columns for which fuzzy tokenization is used. If you have more query requirements on the column, use the virtual column feature. For more information about the virtual column feature, see Virtual columns.

  • Limits
    • You can use fuzzy tokenization to tokenize the TEXT field values that are equal to or smaller than 1,024 characters in length. If the TEXT field value exceeds 1,024 characters in length, Tablestore truncates the excess characters and discards them, and only tokenizes the first 1,024 characters.
    • To prevent excessive increase of index data, the difference between the values of maxChars and minChars must not exceed 6.
  • Parameters
    Parameter Description
    minChars The minimum number of characters for a token. Default value: 1.
    maxChars The maximum number of characters for a token. Default value: 7.

Comparison

The following table compares the tokenization methods.
Item Single-word tokenization Delimiter tokenization Minimum semantic unit-based tokenization Maximum semantic unit-based tokenization Fuzzy tokenization
Index increase Small Small Small Medium Large
Relevance Weak Weak Medium Relatively strong Relatively strong
Applicable language All All Chinese Chinese All
Length limit None None None None 1,024 characters
Recall rate High Low Low Medium Medium