AnalyticDB for MySQL provides six built-in analyzers for full-text indexes: AliNLP, IK, Standard, Ngram, Edge_ngram, and Pattern. Each analyzer uses a different tokenization strategy — choose based on your language requirements, search pattern (fuzzy, prefix, or exact match), and whether you need Chinese text processing.
Choose an analyzer
| Analyzer | Best for | Language support | Custom dictionary |
|---|---|---|---|
| AliNLP | Chinese and multilingual NLP-based tokenization | Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, Spanish | Yes |
| IK | Chinese text with configurable granularity | Chinese | Yes |
| Standard | English text with stop-word filtering | Language-specific rules | Yes |
| Ngram | Fuzzy substring search | Any | Yes |
| Edge_ngram | Prefix search and autocomplete | Any | Yes |
| Pattern | Custom delimiter-based tokenization via regular expressions | Any | No |
Default analyzer by cluster version:
Clusters earlier than V3.1.4.15: AliNLP analyzer
Clusters V3.1.4.15 or later: IK analyzer
To check your cluster's minor version, see How do I view the minor version of a cluster?
Specify an analyzer
Syntax
FULLTEXT INDEX idx_name(`column_name`) [ WITH ANALYZER analyzer_name ] [ WITH DICT tbl_dict_name];Parameters
| Parameter | Description |
|---|---|
idx_name | Name of the full-text index. |
column_name | Name of the column to index. |
WITH ANALYZER analyzer_name | Analyzer to use. Omit to use the default analyzer. |
WITH DICT tbl_dict_name | Custom dictionary to apply. For details, see Custom dictionaries for full-text indexes. |
Example
The following statement creates a table with six full-text indexes, one per analyzer:
CREATE TABLE `tbl_fulltext_demo` (
`id` int,
`content` varchar,
`content_alinlp` varchar,
`content_ik` varchar,
`content_standard` varchar,
`content_ngram` varchar,
`content_edge_ngram` varchar,
FULLTEXT INDEX fidx_c(`content`), -- Default analyzer
FULLTEXT INDEX fidx_alinlp(`content_alinlp`) WITH ANALYZER alinlp,
FULLTEXT INDEX fidx_ik(`content_ik`) WITH ANALYZER ik,
FULLTEXT INDEX fidx_standard(`content_standard`) WITH ANALYZER standard,
FULLTEXT INDEX fidx_ngram(`content_ngram`) WITH ANALYZER ngram,
FULLTEXT INDEX fidx_edge_ngram(`content_edge_ngram`) WITH ANALYZER edge_ngram,
PRIMARY KEY (`id`)
) DISTRIBUTED BY HASH(id);Preview tokenization results
Before committing to an analyzer, test how it tokenizes your text. Each analyzer has a dedicated test function that returns the token list for a given input string.
Prefix /*+ mode=two_phase*/ to all tokenization test queries, or they will not execute correctly.The following examples use the same input — 'Hello world' — across all analyzers so you can compare their output directly:
| Analyzer | Test function | Output for 'Hello world' |
|---|---|---|
| AliNLP | fulltext_alinlp_test() | [hello, , world] |
| IK | fulltext_ik_test() | [hello, world, or] |
| Standard | fulltext_standard_test() | [hello, world] |
| Ngram (default token size: 2) | fulltext_ngram_test() | [he, el, ll, lo, o , w, wo, or, rl, ld] |
| Edge_ngram (min: 1, max: 2) | fulltext_edge_ngram_test() | [h, he] |
Usage:
/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');[hello, world, or]The Pattern analyzer does not support SQL-based tokenization testing.
AliNLP analyzer
Best for Chinese and multilingual text. The AliNLP analyzer is developed by Alibaba Cloud and DAMO Academy using natural language processing (NLP) technology. It tokenizes consecutive natural-language text into meaningful segments and supports custom dictionaries for user-defined entities and stop words.
Supported languages: Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, Spanish
Tokenization result for English text:
/*+ mode=two_phase*/ SELECT fulltext_alinlp_test('Hello world');Result:
[hello, , world]
The following examples show the tokenization results with the default configurations:
Configuration parameters
| Parameter | Description | Default |
|---|---|---|
FULLTEXT_SPLIT_GRANULARITY | Segmentation granularity. Integer from 2 to 8. | 2 |
FULLTEXT_FILTER_ST_CONVERT_ENABLED | Enables stem conversion (e.g., men → man, cars → car). | false |
FULLTEXT_TOKENIZER_CASE_SENSITIVE | Whether tokenization is case-sensitive. | false |
FULLTEXT_FILTER_PINYIN_ENABLE | Specifies whether to enable pinyin search. Valid values:
| |
FULLTEXT_FILTER_PINYIN_ENABLE | Specifies whether to enable pinyin search. Valid values:
|
IK analyzer
Best for Chinese text. The IK analyzer is an open-source, lightweight Chinese analyzer. It supports two segmentation modes — coarse-grained and fine-grained — and accepts custom dictionaries for entities and stop words.
Tokenization result for English text:
/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');Result:
[hello, world, or]
The following examples show the tokenization results with the default configurations:
Configuration parameters
| Parameter | Description | Default |
|---|---|---|
CSTORE_IK_SEGMENTER_USE_SMART_ENABLE | Segmentation mode. true = coarse-grained (ik_smart mode); false = fine-grained (ik_max_word mode). | false |
CSTORE_IK_SEGMENTER_LETTER_MIN_LENGTH | Minimum segment length. Integer from 2 to 16. | 3 |
CSTORE_IK_SEGMENTER_LETTER_MAX_LENGTH | Maximum segment length. Integer from 2 to 256. | 128 |
Standard analyzer
Best for English text. The Standard analyzer applies language-specific rules: for English, it lowercases text and removes stop words and punctuation before tokenizing; for Chinese, it splits text into individual characters. It supports custom dictionaries.
Tokenization result for English text:
/*+ mode=two_phase*/ SELECT fulltext_standard_test('Hello world');Result:
[hello, world]
The following examples show the tokenization results with the default configurations:
Configuration parameters
| Parameter | Description | Default |
|---|---|---|
FULLTEXT_MAX_TOKEN_LENGTH | Maximum token length. Integer from 1 to 1,048,576. | 255 |
Ngram analyzer
Best for fuzzy substring search. The Ngram analyzer breaks text into all possible substrings of a fixed length, making it effective for partial-match queries. It supports custom dictionaries.
Tokenization result for English text:
/*+ mode=two_phase*/ SELECT fulltext_ngram_test('Hello world');Result:
[he, el, ll, lo, o , w, wo, or, rl, ld]
The following examples show the tokenization results with the default configurations:
Configuration parameters
| Parameter | Description | Default |
|---|---|---|
FULLTEXT_NGRAM_TOKEN_SIZE | Token length. Integer from 1 to 8. | 2 |
Edge_ngram analyzer
Best for prefix search and autocomplete (search-as-you-type). The Edge_ngram analyzer generates prefix tokens of increasing length — for example, h, then he — making it ideal for matching words from the beginning. It supports custom dictionaries.
Tip: Edge_ngram works well when the search term appears at the start of words. For fuzzy substring matching anywhere in the text, use the Ngram analyzer instead.
Results of English text tokenization.
/*+ mode=two_phase*/ SELECT fulltext_edge_ngram_test('Hello world');Result:
[h, he]
The following examples show the tokenization results with the default configurations:
Configuration parameters
| Parameter | Description | Default |
|---|---|---|
FULLTEXT_MIN_GRAM_SIZE | Minimum prefix length. Integer from 1 to 8. | 1 |
FULLTEXT_MAX_GRAM_SIZE | Maximum prefix length. Integer from 1 to 16. Must be greater than FULLTEXT_MIN_GRAM_SIZE. | 2 |
Pattern analyzer
Best for custom delimiters. The Pattern analyzer tokenizes text by splitting on a regular expression pattern you define. It does not support custom dictionaries or SQL-based tokenization testing.
Syntax
FULLTEXT INDEX fidx_name(`column_name`) WITH ANALYZER pattern_tokenizer("Custom_rule") [ WITH DICT `tbl_dict_name` ];Custom_rule is the regular expression that defines the split pattern.
Configuration parameters
| Parameter | Description | Default |
|---|---|---|
FULLTEXT_TOKENIZER_CASE_SENSITIVE | Whether tokenization is case-sensitive. | false |
Manage analyzer configuration
Query configuration parameters
Use either method below to check the current value of a configuration parameter.
Method 1: `SHOW adb_config`
Returns both default and modified values.
show adb_config key <analyzer_param>;Example:
show adb_config key FULLTEXT_NGRAM_TOKEN_SIZE;Method 2: `SELECT` from `INFORMATION_SCHEMA`
Returns only modified values. If a parameter has never been changed from its default, this query returns null.
SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = '<analyzer_param>';Example:
SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = 'FULLTEXT_NGRAM_TOKEN_SIZE';Modify configuration parameters
set adb_config <analyzer_param>=<value>;Example:
set adb_config FULLTEXT_NGRAM_TOKEN_SIZE=3;