Types and configuration of full-text index analyzers - AnalyticDB

AnalyticDB for MySQL provides six built-in analyzers for full-text indexes: AliNLP, IK, Standard, Ngram, Edge_ngram, and Pattern. Each analyzer uses a different tokenization strategy — choose based on your language requirements, search pattern (fuzzy, prefix, or exact match), and whether you need Chinese text processing.

Choose an analyzer

Analyzer	Best for	Language support	Custom dictionary
AliNLP	Chinese and multilingual NLP-based tokenization	Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, Spanish	Yes
IK	Chinese text with configurable granularity	Chinese	Yes
Standard	English text with stop-word filtering	Language-specific rules	Yes
Ngram	Fuzzy substring search	Any	Yes
Edge_ngram	Prefix search and autocomplete	Any	Yes
Pattern	Custom delimiter-based tokenization via regular expressions	Any	No

Default analyzer by cluster version:

Clusters earlier than V3.1.4.15: AliNLP analyzer
Clusters V3.1.4.15 or later: IK analyzer

To check your cluster's minor version, see How do I view the minor version of a cluster?

Specify an analyzer

Syntax

FULLTEXT INDEX idx_name(`column_name`) [ WITH ANALYZER analyzer_name ] [ WITH DICT tbl_dict_name];

Parameters

Parameter	Description
`idx_name`	Name of the full-text index.
`column_name`	Name of the column to index.
`WITH ANALYZER analyzer_name`	Analyzer to use. Omit to use the default analyzer.
`WITH DICT tbl_dict_name`	Custom dictionary to apply. For details, see Custom dictionaries for full-text indexes.

Example

The following statement creates a table with six full-text indexes, one per analyzer:

CREATE TABLE `tbl_fulltext_demo` (
  `id` int,
  `content` varchar,
  `content_alinlp` varchar,
  `content_ik` varchar,
  `content_standard` varchar,
  `content_ngram` varchar,
  `content_edge_ngram` varchar,
  FULLTEXT INDEX fidx_c(`content`),                                           -- Default analyzer
  FULLTEXT INDEX fidx_alinlp(`content_alinlp`) WITH ANALYZER alinlp,
  FULLTEXT INDEX fidx_ik(`content_ik`) WITH ANALYZER ik,
  FULLTEXT INDEX fidx_standard(`content_standard`) WITH ANALYZER standard,
  FULLTEXT INDEX fidx_ngram(`content_ngram`) WITH ANALYZER ngram,
  FULLTEXT INDEX fidx_edge_ngram(`content_edge_ngram`) WITH ANALYZER edge_ngram,
  PRIMARY KEY (`id`)
) DISTRIBUTED BY HASH(id);

Preview tokenization results

Before committing to an analyzer, test how it tokenizes your text. Each analyzer has a dedicated test function that returns the token list for a given input string.

Prefix /*+ mode=two_phase*/ to all tokenization test queries, or they will not execute correctly.

The following examples use the same input — 'Hello world' — across all analyzers so you can compare their output directly:

Analyzer	Test function	Output for `'Hello world'`
AliNLP	`fulltext_alinlp_test()`	`[hello, , world]`
IK	`fulltext_ik_test()`	`[hello, world, or]`
Standard	`fulltext_standard_test()`	`[hello, world]`
Ngram (default token size: 2)	`fulltext_ngram_test()`	`[he, el, ll, lo, o , w, wo, or, rl, ld]`
Edge_ngram (min: 1, max: 2)	`fulltext_edge_ngram_test()`	`[h, he]`

Usage:

/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');

[hello, world, or]

The Pattern analyzer does not support SQL-based tokenization testing.

AliNLP analyzer

Best for Chinese and multilingual text. The AliNLP analyzer is developed by Alibaba Cloud and DAMO Academy using natural language processing (NLP) technology. It tokenizes consecutive natural-language text into meaningful segments and supports custom dictionaries for user-defined entities and stop words.

Supported languages: Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, Spanish

Tokenization result for English text:

/*+ mode=two_phase*/ SELECT fulltext_alinlp_test('Hello world');

Result:

[hello,  , world]

The following examples show the tokenization results with the default configurations:

Configuration parameters

Parameter	Description	Default
`FULLTEXT_SPLIT_GRANULARITY`	Segmentation granularity. Integer from 2 to 8.	`2`
`FULLTEXT_FILTER_ST_CONVERT_ENABLED`	Enables stem conversion (e.g., `men` → `man`, `cars` → `car`).	`false`
`FULLTEXT_TOKENIZER_CASE_SENSITIVE`	Whether tokenization is case-sensitive.	`false`
FULLTEXT_FILTER_PINYIN_ENABLE	Specifies whether to enable pinyin search. Valid values: true: Enables pinyin search. false (default): Disables pinyin search.
FULLTEXT_FILTER_PINYIN_ENABLE	Specifies whether to enable pinyin search. Valid values: true: Enables pinyin search. false (default): Disables pinyin search.

IK analyzer

Best for Chinese text. The IK analyzer is an open-source, lightweight Chinese analyzer. It supports two segmentation modes — coarse-grained and fine-grained — and accepts custom dictionaries for entities and stop words.

Tokenization result for English text:

/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');

Result:

[hello, world, or]

The following examples show the tokenization results with the default configurations:

Configuration parameters

Parameter	Description	Default
`CSTORE_IK_SEGMENTER_USE_SMART_ENABLE`	Segmentation mode. `true` = coarse-grained (ik_smart mode); `false` = fine-grained (ik_max_word mode).	`false`
`CSTORE_IK_SEGMENTER_LETTER_MIN_LENGTH`	Minimum segment length. Integer from 2 to 16.	`3`
`CSTORE_IK_SEGMENTER_LETTER_MAX_LENGTH`	Maximum segment length. Integer from 2 to 256.	`128`

Standard analyzer

Best for English text. The Standard analyzer applies language-specific rules: for English, it lowercases text and removes stop words and punctuation before tokenizing; for Chinese, it splits text into individual characters. It supports custom dictionaries.

Tokenization result for English text:

/*+ mode=two_phase*/ SELECT fulltext_standard_test('Hello world');

Result:

[hello, world]

The following examples show the tokenization results with the default configurations:

Configuration parameters

Parameter	Description	Default
`FULLTEXT_MAX_TOKEN_LENGTH`	Maximum token length. Integer from 1 to 1,048,576.	`255`

Ngram analyzer

Best for fuzzy substring search. The Ngram analyzer breaks text into all possible substrings of a fixed length, making it effective for partial-match queries. It supports custom dictionaries.

Tokenization result for English text:

/*+ mode=two_phase*/ SELECT fulltext_ngram_test('Hello world');

Result:

[he, el, ll, lo, o ,  w, wo, or, rl, ld]

The following examples show the tokenization results with the default configurations:

Configuration parameters

Parameter	Description	Default
`FULLTEXT_NGRAM_TOKEN_SIZE`	Token length. Integer from 1 to 8.	`2`

Edge_ngram analyzer

Best for prefix search and autocomplete (search-as-you-type). The Edge_ngram analyzer generates prefix tokens of increasing length — for example, h, then he — making it ideal for matching words from the beginning. It supports custom dictionaries.

Tip: Edge_ngram works well when the search term appears at the start of words. For fuzzy substring matching anywhere in the text, use the Ngram analyzer instead.

Results of English text tokenization.

/*+ mode=two_phase*/ SELECT fulltext_edge_ngram_test('Hello world');

Result:

[h, he]

The following examples show the tokenization results with the default configurations:

Configuration parameters

Parameter	Description	Default
`FULLTEXT_MIN_GRAM_SIZE`	Minimum prefix length. Integer from 1 to 8.	`1`
`FULLTEXT_MAX_GRAM_SIZE`	Maximum prefix length. Integer from 1 to 16. Must be greater than `FULLTEXT_MIN_GRAM_SIZE`.	`2`

Pattern analyzer

Best for custom delimiters. The Pattern analyzer tokenizes text by splitting on a regular expression pattern you define. It does not support custom dictionaries or SQL-based tokenization testing.

Syntax

FULLTEXT INDEX fidx_name(`column_name`) WITH ANALYZER pattern_tokenizer("Custom_rule") [ WITH DICT `tbl_dict_name` ];

Custom_rule is the regular expression that defines the split pattern.

Configuration parameters

Parameter	Description	Default
`FULLTEXT_TOKENIZER_CASE_SENSITIVE`	Whether tokenization is case-sensitive.	`false`

Manage analyzer configuration

Query configuration parameters

Use either method below to check the current value of a configuration parameter.

Method 1: `SHOW adb_config`

Returns both default and modified values.

show adb_config key <analyzer_param>;

Example:

show adb_config key FULLTEXT_NGRAM_TOKEN_SIZE;

Method 2: `SELECT` from `INFORMATION_SCHEMA`

Returns only modified values. If a parameter has never been changed from its default, this query returns null.

SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = '<analyzer_param>';

Example:

SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = 'FULLTEXT_NGRAM_TOKEN_SIZE';

Modify configuration parameters

set adb_config <analyzer_param>=<value>;

Example:

set adb_config FULLTEXT_NGRAM_TOKEN_SIZE=3;