After you specify a tokenization method for a TEXT field, Tablestore tokenizes the values of the field into multiple tokens based on the specified tokenization method. You cannot specify tokenization methods for non-TEXT fields.
Background information
In most cases, you can use match query (MatchQuery) and match phrase query (MatchPhraseQuery) to query a field of the TEXT type. You can also use term query (TermQuery), terms query (TermsQuery), prefix query (PrefixQuery), and wildcard query (WildcardQuery) to query a field of the TEXT type based on your business requirements.
Tokenization methods
The following tokenization methods are supported: single-word tokenization, delimiter tokenization, minimum semantic unit-based tokenization, maximum semantic unit-based tokenization, and fuzzy tokenization.
Single-word tokenization (SingleWord)
This tokenization method applies to all natural languages such as Chinese, English, and Japanese. By default, the tokenization method for TEXT fields is single-word tokenization.
After you specify single-word tokenization for a TEXT field, Tablestore performs tokenization based on the following rules:
Chinese texts are tokenized based on each Chinese character. For example, "杭州" is tokenized into "杭" and "州". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "杭" as the keyword to query the data that contains "杭州".
Letters or digits are tokenized based on spaces or punctuation marks.
If you set the
caseSensitive
parameter to false, the tokens are not case-sensitive. Tablestore converts all letters into lowercase letters for tokens and stores the tokens.For example, "Hang Zhou" is tokenized into "hang" and "zhou". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "hang", "HANG", or "Hang" as the keyword to query the rows that contain "Hang Zhou".
If you set the
caseSensitive
parameter to true, tokens are case-sensitive. Tablestore stores the tokens in a case-sensitive manner.For example, "Hang Zhou" is tokenized into "Hang" and "Zhou". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and specify "Hang" or "Zhou" as the keyword to query the rows that contain "Hang Zhou".
Alphanumeric characters, such as model numbers, are also tokenized based on spaces or punctuation marks. However, the alphanumeric characters cannot be tokenized into smaller words. For example, "IPhone6" can be tokenized only into "IPhone6". When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery), you must specify "IPhone6" as the keyword. No results are returned if you specify "IPhone" as the keyword.
The following table describes the parameters for single-word tokenization.
Parameter | Description |
caseSensitive | Specifies whether to enable case sensitivity. Default value: false. If you set this parameter to false, all letters are converted into lowercase letters. If you do not want Tablestore to convert letters into lowercase letters, set this parameter to true. |
delimitWord | Specifies whether to tokenize alphanumeric characters. Default value: false. You can set the delimitWord parameter to true to separate letters from digits. This way, "iphone6" is tokenized into "iphone" and "6". |
Delimiter tokenization (Split)
Tablestore provides general dictionary-based tokenization. However, specific industries require custom dictionaries for tokenization. To meet this requirement, Tablestore provides delimiter tokenization. You can perform tokenization by using custom methods, use delimiter tokenization, and then write data to Tablestore.
Delimiter tokenization applies to all natural languages, such as Chinese, English, and Japanese.
After you specify delimiter tokenization for a TEXT field, Tablestore tokenizes field values based on the specified delimiter. For example, if a field value is "badminton,ping pong,rap" and you set the delimiter parameter to a comma (,), the value is tokenized into "badminton", "ping pong", and "rap" and the field is indexed. When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery) to query "badminton", "ping pong", "rap", or "badminton,ping pong", the row can be obtained.
The following table describes the parameters for delimiter tokenization.
Parameter | Description |
caseSensitive | Specifies whether to enable case sensitivity. Default value: false. If you set this parameter to false, all letters are converted into lowercase letters. If you do not want Tablestore to convert letters into lowercase letters, set this parameter to true. Important Tablestore SDK for Java V5.17.2 or later supports this parameter. |
delimiter | The delimiter. By default, the value is a whitespace character. You can specify a custom delimiter.
|
Minimum semantic unit-based tokenization (MinWord)
This tokenization method applies to the Chinese language in full-text search scenarios.
After you specify minimum semantic unit-based tokenization as the tokenization method for a TEXT field, Tablestore tokenizes the values of the TEXT field into the minimum number of semantic units when Tablestore performs a query.
Maximum semantic unit-based tokenization (MaxWord)
This tokenization method applies to the Chinese language in full-text search scenarios.
After you specify maximum semantic unit-based tokenization as the tokenization method for a TEXT field, Tablestore tokenizes the values of the TEXT field into the maximum number of semantic units when Tablestore performs a query. However, different semantic units may contain the same characters. The total length of the tokens is longer than the length of the original text. As a result, the data volume of the index is increased.
This tokenization method can generate more tokens and increase the probability that the rows are matched. However, the index size is greatly increased. You can use match phrase query (MatchPhraseQuery) when you specify maximum semantic unit-based tokenization as the tokenization method. Match query (MatchQuery) is more suitable for this tokenization method. If you use match phrase query together with this tokenization method, data may not be obtained due to overlapping tokens because the keyword is also tokenized based on maximum semantic unit-based tokenization.
Fuzzy tokenization
This tokenization method applies to all natural languages such as Chinese, English, and Japanese in scenarios that involve short text content, such as titles, movie names, book titles, file names, and directory names.
You can use fuzzy tokenization together with match phrase query to return query results at low latency. The combination of fuzzy tokenization and match phrase query outperforms wildcard query (WildcardQuery). However, the index size is greatly increased.
After you specify fuzzy tokenization as the tokenization method for a TEXT field, Tablestore performs tokenization by using n-gram. The number of characters in a token ranges from the value of the minChars parameter to the value of the maxChars parameter. For example, this tokenization method is used to populate the drop-down list.
To perform a fuzzy query, you must perform a match phrase query (MatchPhraseQuery) on the field for which fuzzy tokenization is used. If you have additional query requirements on the field, use the virtual column feature. For more information about the virtual column feature, see Virtual columns.
Limits
You can use fuzzy tokenization to tokenize a TEXT field value that is less than or equal to 1,024 characters in length. If the TEXT field value exceeds 1,024 characters in length, Tablestore truncates and discards the excess characters and only tokenizes the first 1,024 characters.
To prevent an excessive increase of index data, the difference between the values of the maxChars and minChars parameters must not exceed 6.
Parameters
Parameter
Description
minChars
The minimum number of characters for a token. Default value: 1.
maxChars
The maximum number of characters for a token. Default value: 7.
caseSensitive
Specifies whether to enable case sensitivity. Default value: false. If you set this parameter to false, all letters are converted into lowercase letters.
If you do not want Tablestore to convert letters into lowercase letters, set this parameter to true.
ImportantTablestore SDK for Java V5.17.2 or later supports this parameter.
Comparison
The following table compares the tokenization methods.
Item | Single-word tokenization | Delimiter tokenization | Minimum semantic unit-based tokenization | Maximum semantic unit-based tokenization | Fuzzy tokenization |
Index increase | Small | Small | Small | Medium | Large |
Relevance | Weak | Weak | Medium | Relatively strong | Relatively strong |
Applicable language | All | All | Chinese | Chinese | All |
Length limit | None | None | None | None | 1,024 characters |
Recall rate | High | Low | Low | Medium | High |