Tokenization for TEXT fields - Tablestore - Alibaba Cloud Documentation Center

After a tokenization method is specified for TEXT fields, Tablestore tokenizes field values into multiple tokens based on the tokenization method that you configure. You cannot specify tokenization methods for non-TEXT fields.

Background information

You can use match query (MatchQuery) and match phrase query (MatchPhraseQuery) to query TEXT data. You can also use term query (TermQuery), terms query (TermsQuery), prefix query (PrefixQuery), and wildcard query (WildcardQuery) based on your business scenario.

Tokenization methods

The following tokenization methods are supported:

Single-word tokenization (SingleWord)

This tokenization method applies to all natural languages such as Chinese, English, and Japanese. By default, the tokenization method for TEXT fields is single-word tokenization.

After single-word tokenization is specified, Tablestore performs tokenization based on the following rules:

Chinese texts are tokenized based on each Chinese character. For example, "杭州" is tokenized into "杭" and "州". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and set the keyword to "杭" to query the data that contains "杭州".
Letters or digits are tokenized based on spaces or punctuation marks. Uppercase letters are converted to lowercase letters. For example, "Hang Zhou" is tokenized into "hang" and "zhou". You can use match query (MatchQuery) or match phrase query (MatchPhraseQuery) and set the keyword to "hang", "HANG", or "Hang" to query the rows that contain "Hang Zhou".
Alphanumeric characters such as model numbers are also separated by spaces or punctuation marks. However, these characters cannot be tokenized into smaller words. For example, "IPhone6" can only be tokenized into "IPhone6". When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery), you must specify "iphone6". No results are returned if you use "iphone".

The following table describes the parameters for single-word tokenization.

Parameter

Description

caseSensitive

Specifies whether to enable case sensitivity. Default value: false. If you set the parameter to false, all letters are converted to lowercase letters.

If you do not need Tablestore to convert letters to lowercase letters, you can set the parameter to true.

delimitWord

Specifies whether to tokenize alphanumeric characters. Default value: false.

You can set the delimitWord parameter to true to separate letters from digits. This way, "iphone6" is tokenized into "iphone" and "6".

Delimiter tokenization (Split)

Tablestore provides general dictionary-based tokenization. However, some industries require custom dictionaries for tokenization. To meet this requirement, Tablestore provides delimiter tokenization. You can perform tokenization by using custom methods, use delimiter tokenization, and then write data to Tablestore.

Delimiter tokenization applies to all natural languages such as Chinese, English, and Japanese.

After you configure delimiter tokenization for a TEXT field, Tablestore converts tokenized content into lowercase letters. If you use MatchQuery or MatchPhraseQuery to query TEXT fields, Tablestore automatically converts the keywords to lowercase letters for querying. The keywords are not case-sensitive. However, if you use a query method that is not designed for TEXT fields such as TermQuery, the keywords are case-sensitive. In this case, you need to manually convert the keywords to lowercase letters before you query data that meets your conditions.

After the tokenization method is set to delimiter tokenization, the system tokenizes field values based on the specified delimiter. For example, a field value is "badminton,ping pong,rap". The delimiter is set to a comma (,). The value is tokenized into "badminton", "ping pong", and "rap". The field is indexed. When you use match query (MatchQuery) or match phrase query (MatchPhraseQuery) to query "badminton", "ping pong", "rap", or "badminton,ping pong", the row can be obtained.

The following table describes the parameters for delimiter tokenization.

Parameter

Description

delimiter

The delimiter. By default, the value is a whitespace character. You can specify a custom delimiter.

When you create a search index, the delimiter specified for field tokenization must be the same as the delimiter that is included in the value of the column in the data table. Otherwise, data may not be obtained.
If the custom delimiter is a special character such as a number sign (#) or a tilde (~), concatenate the delimiter with an escape character (\). Example: \#.

Minimum semantic unit-based tokenization (MinWord)

This tokenization method applies to the Chinese language in full-text search scenarios.

After minimum semantic unit-based tokenization is specified as the tokenization method, Tablestore tokenizes the values of TEXT fields into the minimum number of semantic units when Tablestore performs a query.

Maximum semantic unit-based tokenization (MaxWord)

This tokenization method applies to the Chinese language in full-text search scenarios.

After maximum semantic unit-based tokenization is specified as the tokenization method, Tablestore tokenizes the values of TEXT fields into the maximum number of semantic units when Tablestore performs a query. However, different semantic units may contain the same character. The total length of the tokenized words is longer than the length of the original text. Therefore, the data volume of the index is increased.

This tokenization method can generate more tokens and increase the probability that the rows are matched. However, the index size is greatly increased. You can use match phrase query (MatchPhraseQuery) when the tokenization method is set to maximum semantic unit-based tokenization. Match query (MatchQuery) is more suitable for this tokenization method. If you use match phrase query together with this tokenization method, data may not be obtained due to overlapping tokens because the keyword is also tokenized based on maximum semantic unit-based tokenization.

Fuzzy tokenization

This tokenization method applies to all natural languages such as Chinese, English, and Japanese in scenarios that involve short text content, such as titles, movie names, book titles, file names, and directory names.

The combination of fuzzy tokenization and match phrase query can be used to return query results at a low latency. The combination of fuzzy tokenization and match phrase query outperforms wildcard query (WildcardQuery). However, the index size is greatly increased.

After the tokenization method is set to fuzzy tokenization, Tablestore performs tokenization by using n-gram. The results are between minChars and maxChars. For example, this tokenization method is used to populate the drop-down list.

Fuzzy tokenization converts the field values into lowercase letters. If you use MatchPhraseQuery to query TEXT fields, Tablestore automatically converts the keywords to lowercase letters for querying. Similar to the SQL LIKE method, keywords are not case-sensitive.

Important

To perform a fuzzy query, you must perform a match phrase query (MatchPhraseQuery) on the columns for which fuzzy tokenization is used. If you have more query requirements on the column, use the virtual column feature. For more information about the virtual column feature, see Virtual columns.

Limits
- You can use fuzzy tokenization to tokenize the TEXT field values that are equal to or smaller than 1,024 characters in length. If the TEXT field value exceeds 1,024 characters in length, Tablestore truncates the excess characters and discards them, and only tokenizes the first 1,024 characters.
- To prevent excessive increase of index data, the difference between the values of maxChars and minChars must not exceed 6.
Parameters
Parameter
Description
minChars
The minimum number of characters for a token. Default value: 1.
maxChars
The maximum number of characters for a token. Default value: 7.

Comparison

The following table compares the tokenization methods.

Item	Single-word tokenization	Delimiter tokenization	Minimum semantic unit-based tokenization	Maximum semantic unit-based tokenization	Fuzzy tokenization
Index increase	Small	Small	Small	Medium	Large
Relevance	Weak	Weak	Medium	Relatively strong	Relatively strong
Applicable language	All	All	Chinese	Chinese	All
Length limit	None	None	None	None	1,024 characters
Recall rate	High	Low	Low	Medium	High

Parameter	Description
minChars	The minimum number of characters for a token. Default value: 1.
maxChars	The maximum number of characters for a token. Default value: 7.