All Products
Search
Document Center

AnalyticDB:Analyzers for full-text indexes

Last Updated:Mar 28, 2026

AnalyticDB for MySQL provides six built-in analyzers for full-text indexes: AliNLP, IK, Standard, Ngram, Edge_ngram, and Pattern. Each analyzer uses a different tokenization strategy — choose based on your language requirements, search pattern (fuzzy, prefix, or exact match), and whether you need Chinese text processing.

Choose an analyzer

AnalyzerBest forLanguage supportCustom dictionary
AliNLPChinese and multilingual NLP-based tokenizationChinese, English, Indonesian, Malay, Thai, Vietnamese, French, SpanishYes
IKChinese text with configurable granularityChineseYes
StandardEnglish text with stop-word filteringLanguage-specific rulesYes
NgramFuzzy substring searchAnyYes
Edge_ngramPrefix search and autocompleteAnyYes
PatternCustom delimiter-based tokenization via regular expressionsAnyNo

Default analyzer by cluster version:

  • Clusters earlier than V3.1.4.15: AliNLP analyzer

  • Clusters V3.1.4.15 or later: IK analyzer

To check your cluster's minor version, see How do I view the minor version of a cluster?

Specify an analyzer

Syntax

FULLTEXT INDEX idx_name(`column_name`) [ WITH ANALYZER analyzer_name ] [ WITH DICT tbl_dict_name];

Parameters

ParameterDescription
idx_nameName of the full-text index.
column_nameName of the column to index.
WITH ANALYZER analyzer_nameAnalyzer to use. Omit to use the default analyzer.
WITH DICT tbl_dict_nameCustom dictionary to apply. For details, see Custom dictionaries for full-text indexes.

Example

The following statement creates a table with six full-text indexes, one per analyzer:

CREATE TABLE `tbl_fulltext_demo` (
  `id` int,
  `content` varchar,
  `content_alinlp` varchar,
  `content_ik` varchar,
  `content_standard` varchar,
  `content_ngram` varchar,
  `content_edge_ngram` varchar,
  FULLTEXT INDEX fidx_c(`content`),                                           -- Default analyzer
  FULLTEXT INDEX fidx_alinlp(`content_alinlp`) WITH ANALYZER alinlp,
  FULLTEXT INDEX fidx_ik(`content_ik`) WITH ANALYZER ik,
  FULLTEXT INDEX fidx_standard(`content_standard`) WITH ANALYZER standard,
  FULLTEXT INDEX fidx_ngram(`content_ngram`) WITH ANALYZER ngram,
  FULLTEXT INDEX fidx_edge_ngram(`content_edge_ngram`) WITH ANALYZER edge_ngram,
  PRIMARY KEY (`id`)
) DISTRIBUTED BY HASH(id);

Preview tokenization results

Before committing to an analyzer, test how it tokenizes your text. Each analyzer has a dedicated test function that returns the token list for a given input string.

Prefix /*+ mode=two_phase*/ to all tokenization test queries, or they will not execute correctly.

The following examples use the same input — 'Hello world' — across all analyzers so you can compare their output directly:

AnalyzerTest functionOutput for 'Hello world'
AliNLPfulltext_alinlp_test()[hello, , world]
IKfulltext_ik_test()[hello, world, or]
Standardfulltext_standard_test()[hello, world]
Ngram (default token size: 2)fulltext_ngram_test()[he, el, ll, lo, o , w, wo, or, rl, ld]
Edge_ngram (min: 1, max: 2)fulltext_edge_ngram_test()[h, he]

Usage:

/*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');
[hello, world, or]
The Pattern analyzer does not support SQL-based tokenization testing.

AliNLP analyzer

Best for Chinese and multilingual text. The AliNLP analyzer is developed by Alibaba Cloud and DAMO Academy using natural language processing (NLP) technology. It tokenizes consecutive natural-language text into meaningful segments and supports custom dictionaries for user-defined entities and stop words.

Supported languages: Chinese, English, Indonesian, Malay, Thai, Vietnamese, French, Spanish

  • Tokenization result for English text:

    /*+ mode=two_phase*/ SELECT fulltext_alinlp_test('Hello world');

    Result:

    [hello,  , world]

The following examples show the tokenization results with the default configurations:

Configuration parameters

ParameterDescriptionDefault
FULLTEXT_SPLIT_GRANULARITYSegmentation granularity. Integer from 2 to 8.2
FULLTEXT_FILTER_ST_CONVERT_ENABLEDEnables stem conversion (e.g., menman, carscar).false
FULLTEXT_TOKENIZER_CASE_SENSITIVEWhether tokenization is case-sensitive.false

FULLTEXT_FILTER_PINYIN_ENABLE

Specifies whether to enable pinyin search. Valid values:

  • true: Enables pinyin search.

  • false (default): Disables pinyin search.

FULLTEXT_FILTER_PINYIN_ENABLE

Specifies whether to enable pinyin search. Valid values:

  • true: Enables pinyin search.

  • false (default): Disables pinyin search.

IK analyzer

Best for Chinese text. The IK analyzer is an open-source, lightweight Chinese analyzer. It supports two segmentation modes — coarse-grained and fine-grained — and accepts custom dictionaries for entities and stop words.

  • Tokenization result for English text:

    /*+ mode=two_phase*/ SELECT fulltext_ik_test('Hello world');

    Result:

    [hello, world, or]

The following examples show the tokenization results with the default configurations:

Configuration parameters

ParameterDescriptionDefault
CSTORE_IK_SEGMENTER_USE_SMART_ENABLESegmentation mode. true = coarse-grained (ik_smart mode); false = fine-grained (ik_max_word mode).false
CSTORE_IK_SEGMENTER_LETTER_MIN_LENGTHMinimum segment length. Integer from 2 to 16.3
CSTORE_IK_SEGMENTER_LETTER_MAX_LENGTHMaximum segment length. Integer from 2 to 256.128

Standard analyzer

Best for English text. The Standard analyzer applies language-specific rules: for English, it lowercases text and removes stop words and punctuation before tokenizing; for Chinese, it splits text into individual characters. It supports custom dictionaries.

  • Tokenization result for English text:

    /*+ mode=two_phase*/ SELECT fulltext_standard_test('Hello world');

    Result:

    [hello, world]

The following examples show the tokenization results with the default configurations:

Configuration parameters

ParameterDescriptionDefault
FULLTEXT_MAX_TOKEN_LENGTHMaximum token length. Integer from 1 to 1,048,576.255

Ngram analyzer

Best for fuzzy substring search. The Ngram analyzer breaks text into all possible substrings of a fixed length, making it effective for partial-match queries. It supports custom dictionaries.

  • Tokenization result for English text:

    /*+ mode=two_phase*/ SELECT fulltext_ngram_test('Hello world');

    Result:

    [he, el, ll, lo, o ,  w, wo, or, rl, ld]

The following examples show the tokenization results with the default configurations:

Configuration parameters

ParameterDescriptionDefault
FULLTEXT_NGRAM_TOKEN_SIZEToken length. Integer from 1 to 8.2

Edge_ngram analyzer

Best for prefix search and autocomplete (search-as-you-type). The Edge_ngram analyzer generates prefix tokens of increasing length — for example, h, then he — making it ideal for matching words from the beginning. It supports custom dictionaries.

Tip: Edge_ngram works well when the search term appears at the start of words. For fuzzy substring matching anywhere in the text, use the Ngram analyzer instead.
  • Results of English text tokenization.

    /*+ mode=two_phase*/ SELECT fulltext_edge_ngram_test('Hello world');

    Result:

    [h, he]

The following examples show the tokenization results with the default configurations:

Configuration parameters

ParameterDescriptionDefault
FULLTEXT_MIN_GRAM_SIZEMinimum prefix length. Integer from 1 to 8.1
FULLTEXT_MAX_GRAM_SIZEMaximum prefix length. Integer from 1 to 16. Must be greater than FULLTEXT_MIN_GRAM_SIZE.2

Pattern analyzer

Best for custom delimiters. The Pattern analyzer tokenizes text by splitting on a regular expression pattern you define. It does not support custom dictionaries or SQL-based tokenization testing.

Syntax

FULLTEXT INDEX fidx_name(`column_name`) WITH ANALYZER pattern_tokenizer("Custom_rule") [ WITH DICT `tbl_dict_name` ];

Custom_rule is the regular expression that defines the split pattern.

Configuration parameters

ParameterDescriptionDefault
FULLTEXT_TOKENIZER_CASE_SENSITIVEWhether tokenization is case-sensitive.false

Manage analyzer configuration

Query configuration parameters

Use either method below to check the current value of a configuration parameter.

Method 1: `SHOW adb_config`

Returns both default and modified values.

show adb_config key <analyzer_param>;

Example:

show adb_config key FULLTEXT_NGRAM_TOKEN_SIZE;

Method 2: `SELECT` from `INFORMATION_SCHEMA`

Returns only modified values. If a parameter has never been changed from its default, this query returns null.

SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = '<analyzer_param>';

Example:

SELECT `key`, `value`, `update_time`
FROM INFORMATION_SCHEMA.kepler_meta_configs
WHERE key = 'FULLTEXT_NGRAM_TOKEN_SIZE';

Modify configuration parameters

set adb_config <analyzer_param>=<value>;

Example:

set adb_config FULLTEXT_NGRAM_TOKEN_SIZE=3;