Build Full-Text Inverted Indexes for Fast Search - Hologres

How it works

When you write text data to Hologres, the system builds a full-text inverted index for each data file based on your index configuration. A tokenizer splits the text into tokens. The index then records the mapping between each token and its source document, along with position and term frequency information.

At query time, Hologres tokenizes the query string, then uses the BM25 algorithm to compute a relevance score for each document against the query tokens—returning results fast and in relevance order.

Limits

Full-text inverted indexes are available only on column-oriented tables and hybrid row-column tables in Hologres version 4.0 and later. Row-oriented tables are not supported.
Indexes can be created only on columns of type TEXT, CHAR, or VARCHAR.
Each index covers one column. Each column supports one full-text inverted index. To index multiple columns, create separate indexes for each.
Full-text search runs only on indexed columns. Querying unindexed columns is not supported.
BM25 scores are calculated per file. For small data volumes, trigger compaction manually to merge files and improve search accuracy.
After you create a full-text inverted index, Hologres builds the index files asynchronously during compaction. This applies to both existing data and new batch-inserted data. Until compaction completes, BM25 relevance scores for that data are zero.
For real-time writes after creating an index: before version 4.0.8, indexes were built synchronously as data arrived. Starting in version 4.0.8, Hologres refreshes the in-memory index every second. Data is queryable only after each refresh.
Use Serverless Computing resources for batch data imports. Serverless resources perform compaction and build full-text indexes synchronously during import. See Run Read/Write Tasks with Serverless Computing and Run Compaction Tasks with Serverless Computing.
Full-text search queries can run using Serverless Computing resources.

Choose a tokenizer

Select a tokenizer based on your text type and search requirements:

Use case	Tokenizer	Notes
Keyword extraction from long articles	jieba	Supports new-word discovery and complex pattern switching.
Chinese descriptive text search	ik	Accurately identifies Chinese terms. Available in version 4.0.9 and later.
English title or plain text search	simple, whitespace, standard	Simple and efficient. Choose based on your text format.
Fuzzy log text search	ngram	No dictionary required. Supports fuzzy matching. Available in version 4.0.9 and later.
Pinyin search for Chinese names or products	pinyin	Supports full pinyin, initials, and polyphonic character inference. Available in version 4.0.9 and later.

For tokenizer parameter details, see Customize tokenizer configuration.

Manage indexes

Create an index

Syntax

CREATE INDEX [ IF NOT EXISTS ] idx_name ON table_name
       USING FULLTEXT (column_name [ , ... ])
       [ WITH ( storage_parameter [ = value ] [ , ... ] ) ];

Parameters

Parameter	Description
`idx_name`	The index name.
`table_name`	The target table.
`column_name`	The column to index. Must be of type TEXT, CHAR, or VARCHAR.
`storage_parameter`	Index configuration. Supported parameters: `tokenizer`, `analyzer_params`, and `index_options`. See the following sections.

`tokenizer`

The tokenizer to use. Defaults to jieba. Supported values:

Tokenizer	Description	Available from
`jieba` (default)	Chinese tokenizer combining rule-based matching and statistical models.	4.0
`whitespace`	Splits text on spaces.	4.0
`standard`	Tokenizes using Unicode Standard Annex #29.	4.0
`simple`	Splits text on spaces and punctuation.	4.0
`keyword`	Returns the input unchanged without any processing.	4.0
`icu`	Multilingual tokenizer.	4.0
`ik`	Chinese tokenizer based on IK Analyzer. Automatically detects English words, email addresses, URLs (without ://), and IP addresses.	4.0.9
`ngram`	Character-level sliding window tokenizer. Splits text into continuous n-character sequences to improve recall and fuzzy matching. Ideal for accelerating LIKE and ILIKE queries.	4.0.9
`pinyin`	Generates pinyin for Chinese characters and words, and infers pinyin for non-Chinese strings.	4.0.9

Each index supports only one tokenizer and one analyzer_params setting.

`analyzer_params`

Tokenizer configuration as a JSON string. Each tokenizer has defaults that work for most use cases—specify only tokenizer and omit analyzer_params unless you need custom behavior. For details, see Customize tokenizer configuration.

`index_options`

Controls how much information the index stores and which query features are available. Available in version 4.1.9 and later.

index_options supports three levels. Higher levels include all information from lower levels: freqs includes docs, and positions includes freqs and docs.

Value	Index content	Supported features	Typical use case
`positions` (default)	Document ID + term frequency + term position	Full feature support: phrase queries and standard relevance scoring	General full-text search
`freqs`	Document ID + term frequency	No phrase queries (position data missing)	Scoring by term frequency without exact phrase matching
`docs`	Document ID only	No phrase queries; all matching documents get the same TF score	Existence checks only, storage-sensitive scenarios

For indexes using the keyword tokenizer, the record level is fixed at docs regardless of index_options.

Examples

Create an index using the default tokenizer (jieba) and default configuration:

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1);

Use the ik tokenizer with its default configuration:

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1)
       WITH (tokenizer = 'ik');

Use a custom jieba configuration—exact mode with lowercase filter only:

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1)
       WITH (tokenizer = 'jieba',
             analyzer_params = '{"tokenizer":{"type":"jieba","mode":"exact"}, "filter":["lowercase"]}');

Set index_options to freqs to save storage at the cost of disabling phrase queries (version 4.1.9 and later):

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1)
       WITH (index_options = 'freqs');

Build indexes after data import

Index files are built during compaction, not immediately after the CREATE INDEX statement. To build indexes right away, use one of the following approaches:

With Serverless Computing (recommended): Serverless resources perform compaction and build full-text indexes synchronously during import. See Run Read/Write Tasks with Serverless Computing and Run Compaction Tasks with Serverless Computing.
Without Serverless Computing: Trigger compaction manually after creating the index or finishing a batch import:
```
VACUUM <schema_name>.<table_name>;
```
See Compaction (Beta) for details.

Modify an index

Syntax

-- Modify index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );

-- Restore default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );

After modifying a full-text inverted index, index files are rebuilt during the next compaction. Run VACUUM <schema_name>.<table_name>; to trigger compaction immediately. See Compaction.

Examples

Change the tokenizer to standard:

ALTER INDEX idx1 SET (tokenizer = 'standard');

Change the tokenizer to ik in ik_max_word mode with lowercase conversion disabled:

ALTER INDEX idx1 SET (
  tokenizer = 'ik',
  analyzer_params = '{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase": false}}'
);

Restore the default jieba tokenizer and its default analyzer_params:

ALTER INDEX idx1 RESET (tokenizer, analyzer_params);

Restore index_options to the default (positions):

ALTER INDEX idx1 RESET (index_options);

Change index_options to docs:

ALTER INDEX idx1 SET (index_options = 'docs');

Delete an index

Syntax

DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];

View indexes

Use the system table hologres.hg_index_properties to list all full-text inverted indexes:

SELECT * FROM hologres.hg_index_properties;

To find the table and column associated with a specific index:

SELECT
    t.relname AS table_name,
    a.attname AS column_name
FROM pg_class t
    JOIN pg_index i ON t.oid = i.indrelid
    JOIN pg_class idx ON i.indexrelid = idx.oid
    JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
    AND idx.relname = '<indexname>'
LIMIT 1;

Replace the placeholders:

<namespace>: The table_namespace value from SELECT * FROM hologres.hg_index_properties;.
<indexname>: The index name you created.

Search with full-text inverted indexes

Hologres supports four search modes. Use the TEXT_SEARCH function to run any of them.

Search mode	Description
Keyword match	Matches documents containing any (OR) or all (AND) of the query keywords.
Phrase search	Matches documents where query tokens appear within a specified distance.
Natural language search	Supports complex queries with AND/OR logic, required/excluded terms, and embedded phrases.
Term search	Exact match on the query string—no tokenization applied.

TEXT_SEARCH function

TEXT_SEARCH computes a BM25 relevance score for a query against source text.

Syntax

TEXT_SEARCH (
  <search_data> TEXT/VARCHAR/CHAR
  ,<search_expression> TEXT
  [ ,<mode> TEXT DEFAULT 'match'
  ,<operator> TEXT DEFAULT 'OR'
  ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT ''
  ,<options> TEXT DEFAULT '']
)

Parameters

Parameter	Required	Description
`search_data`	Yes	The source column to search. Must be TEXT, VARCHAR, or CHAR, and must have a full-text index. Queries on unindexed columns fail.
`search_expression`	Yes	The query string. Must be a constant. Supports TEXT, VARCHAR, and CHAR.
`mode`	No	The search mode. Supported values: `match` (default), `phrase`, `natural_language`, `term`. See below.
`operator`	No	Logical operator between keywords. Applies only when `mode` is `match`. Supported values: `OR` (default) or `AND`.
`tokenizer`, `analyzer_params`	No	Tokenizer and configuration to apply to `search_expression`. If not specified, uses the same tokenizer as the index on `search_data`. If `search_data` is a constant, defaults to `jieba`.
`options`	No	Additional search parameters in the format `'key1=v1;key2=v2;'`. Currently supports only `slop` (for phrase mode). Valid values: `0` (default) or any positive integer. Defines the maximum allowed distance between phrase terms.

`mode` values

Value	Behavior
`match` (default)	Keyword match. Each token in `search_expression` is a keyword. Use `operator` to define AND/OR logic between keywords.
`phrase`	Phrase search. Tokens must appear within the distance defined by `slop`. Default `slop` is `0` (adjacent tokens only).
`natural_language`	Complex queries. Supports AND/OR operators, required (`+`) and excluded (`-`) terms, and quoted phrases. See Tantivy QueryParser.
`term`	Exact match. `search_expression` is matched as-is against the index, without tokenization.

`slop` unit by tokenizer

The slop parameter in phrase mode measures distance differently depending on the tokenizer:

Characters: jieba, keyword, icu
Tokens: standard, simple, whitespace

Return value

A non-negative FLOAT representing the BM25 relevance score. Higher values indicate greater relevance. A score of 0 means no match.

Examples

Keyword match with AND operator:

-- Recommended: use named parameters
SELECT TEXT_SEARCH(content, 'machine learning', operator => 'AND') FROM tbl;

Phrase search with slop=2:

SELECT TEXT_SEARCH(content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;

Natural language search with AND/OR logic:

SELECT TEXT_SEARCH(content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl;

Natural language search with required and excluded terms:

-- Must contain 'learning', must not contain 'machine'
SELECT TEXT_SEARCH(content, '+learning -machine system', 'natural_language') FROM tbl;

Term search (exact match):

SELECT TEXT_SEARCH(content, 'machine learning', 'term') FROM tbl;

TOKENIZE function

TOKENIZE returns the tokens produced by a given tokenizer. Use it to test and debug tokenization behavior before building indexes.

Syntax

TOKENIZE (
  <search_data> TEXT
  [ ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT '']
)

Parameters

search_data: Required. The text to tokenize. Accepts constants.
tokenizer, analyzer_params: Optional. Defaults to jieba if not specified.

Return value

A TEXT array of tokens.

Verify index usage

Run EXPLAIN ANALYZE to confirm whether a query uses the full-text inverted index. If the output contains Fulltext Filter, the index is active.

EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, 'Yangtze River') > 0;

Example output:

QUERY PLAN
Gather  (cost=0.00..1.00 rows=1 width=12)
  ->  Local Gather  (cost=0.00..1.00 rows=1 width=12)
        ->  Index Scan using Clustering_index on wiki_articles  (cost=0.00..1.00 rows=1 width=12)
              Fulltext Filter: (text_search(content, search_expression => 'Yangtze River'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

The Fulltext Filter line confirms the index is used. It also shows the resolved tokenizer and analyzer_params—useful for verifying that the index configuration matches what you expect.

For more on reading execution plans, see EXPLAIN and EXPLAIN ANALYZE.

Examples

Data preparation

Create a test table, add a full-text index, and insert sample data:

-- Create table
CREATE TABLE wiki_articles (id int, content text);

-- Create index
CREATE INDEX ft_idx_1 ON wiki_articles
       USING FULLTEXT (content)
       WITH (tokenizer = 'jieba');

-- Insert data
INSERT INTO wiki_articles VALUES
  (1, 'The Yangtze River is China''s longest river and the world''s third-longest river, about 6,300 km long.'),
  (2, 'Li was born in 1962 in Wendeng County, Shandong.'),
  (3, 'He graduated from the department of physics at Shandong University.'),
  (4, 'The Spring Festival, also known as the Lunar New Year, is China''s most important traditional festival.'),
  (5, 'The Spring Festival usually falls between late January and mid-February on the Gregorian calendar. Major customs include pasting spring couplets, setting off firecrackers, eating reunion dinner, and giving New Year greetings.'),
  (6, 'In 2006, the Spring Festival was approved by the State Council as part of China''s first batch of national intangible cultural heritage.'),
  (7, 'Shandong has dozens of universities.'),
  (8, 'ShanDa is a famous university in Shandong.');

-- Trigger compaction to build index files
VACUUM wiki_articles;

Keyword match

-- OR operator (default): matches documents containing 'shandong' or 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university in Shandong.

-- AND operator: matches documents containing both 'shandong' and 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university in Shandong.

Phrase search

-- Default slop=0: 'shandong' must be immediately followed by 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
(1 row)

-- slop=14: 'shandong' and 'university' can be up to 14 characters apart
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
(2 rows)

-- slop=23: also matches out-of-order phrases ('university of Shandong' pattern)
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university in Shandong.
(3 rows)

Punctuation is ignored in phrase searches. Even if the source text uses a comma between words and the query uses a period, it still matches.

Natural language search

-- Keyword match with AND/OR: requires 'shandong' AND 'university', OR contains 'culture'
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '(shandong AND university) OR culture', 'natural_language') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  8 | ShanDa is a famous university in Shandong.
  7 | Shandong has dozens of universities.
  3 | He graduated from the department of physics at Shandong University.
  6 | In 2006, the Spring Festival was approved by the State Council as part of China's first batch of national intangible cultural heritage.

-- Required (+) and excluded (-) terms, with optional boosting term
-- Must contain 'shandong', must not contain 'physics', and may contain 'famous' (boosts score)
SELECT id,
       content,
       TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score
FROM wiki_articles
WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0
ORDER BY score DESC;

-- Result
 id |                     content                      |  score
----+--------------------------------------------------+----------
  8 | ShanDa is a famous university in Shandong.       |  2.92376
  7 | Shandong has dozens of universities.             | 0.863399
  2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338

-- Phrase search using double quotes
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0;

-- Match all documents
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0;

Term search

-- Exact match: 'Spring Festival' exists verbatim in the index
SELECT * FROM wiki_articles
         WHERE TEXT_SEARCH(content, 'Spring Festival', 'term') > 0;

-- Result
 id |                                           content
----+----------------------------------------------------------------------------------------------
  4 | The Spring Festival, also known as the Lunar New Year, is China''s most important traditional festival.
  5 | The Spring Festival usually falls between late January and mid-February on the Gregorian calendar. ...
  6 | In 2006, the Spring Festival was approved by the State Council as part of China''s first batch of national intangible cultural heritage.

-- No match: 'shandong university' is split by jieba, so this exact string doesn't exist in the index
-- Pair term search with the keyword tokenizer for exact multi-word matches
SELECT * FROM wiki_articles
         WHERE TEXT_SEARCH(content, 'shandong university', 'term') > 0;

-- Result
 id | content
----+---------

Complex queries

Combine TEXT_SEARCH with other predicates:

-- Filter by both full-text match and primary key
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 AND id = 3;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.

Return BM25 scores and get the top 3 results:

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
ORDER BY score DESC
LIMIT 3;

-- Result
id  |                               content                               |  score  |                     tokenize
----+---------------------------------------------------------------------+---------+--------------------------------------------------
  8 | ShanDa is a famous university in Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
  7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}

Use TEXT_SEARCH in both SELECT and WHERE to filter and rank in one pass:

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
 WHERE TEXT_SEARCH(content, 'shandong university') > 0
ORDER BY score DESC;

Combine with a JOIN to search within a filtered subset:

-- Find the most relevant documents about 'shandong university' from wiki sources only
CREATE TABLE article_source (id int primary key, source text);
INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'),
                                  (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper');

SELECT a.id,
       source, content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(a.content, 'jieba')
  FROM wiki_articles a
  JOIN article_source b ON (a.id = b.id)
 WHERE TEXT_SEARCH(a.content, 'shandong university') > 0
   AND b.source = 'wiki'
ORDER BY score DESC;

-- Result
id  | source |                               content                               |  score  |                     tokenize
----+--------+---------------------------------------------------------------------+---------+--------------------------------------------------
  7 | wiki   | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | wiki   | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  2 | wiki   | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

Best practices

Rebuild indexes using Serverless Computing

Modifying certain table properties triggers compaction and index rebuilding, which consumes significant CPU resources. The approach depends on what you're modifying.

For `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes:

Use the REBUILD syntax with Serverless Computing resources. Avoid ALTER TABLE ... SET for these properties. For details, see REBUILD.

ASYNC REBUILD TABLE <table_name>
WITH (
    rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
    bitmap_columns = '<col1>,<col2>',
    dictionary_encoding_columns = '<col1>:on,<col2>:off',
    vectors = '{
    "<col_vector>": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

For full-text index columns or column-oriented JSONB columns:

The REBUILD syntax is not supported for these types. Instead, create a new table, copy data using Serverless Computing, and swap the tables:

BEGIN;
-- Clean up any potential temporary tables
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table with the same structure
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT;

-- Enable column-oriented storage for the target column (if needed)
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create the full-text index on the new table
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);

-- Copy data using Serverless Computing to build indexes synchronously
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;

BEGIN;
-- Swap old and new tables
DROP TABLE IF EXISTS <table>;
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT;

For other properties (distribution_key, clustering_key, segment_key, storage format):

Use the REBUILD syntax with Serverless Computing resources.

Advanced: Customize tokenizer configuration

The default tokenizer configuration works for most use cases. Customize analyzer_params only when your workload requires specific tokenization behavior.

`analyzer_params` requirements

Must be a valid JSON string.
Top-level keys are tokenizer and filter.

`tokenizer` object

Required. A JSON object with the following structure:

type (required): The tokenizer name.
Additional tokenizer-specific parameters. See the table below.

`filter` array

Optional. An array of filter configurations applied to each token in order.

Tokenizer-specific parameters

jieba

Parameter	Description	Values
`mode`	Tokenization mode.	`search` (default): lists all possible combinations, including overlapping tokens (e.g., "traditional festival" becomes "tradition", "festival", and "traditional festival"). `exact`: non-redundant tokenization (e.g., "traditional festival" becomes "traditional festival" only).
`hmm`	Whether to use a Hidden Markov Model (HMM) to detect out-of-vocabulary (OOV) words.	`true` (default), `false`

standard

Parameter	Description	Values
`max_token_length`	Maximum token length. Tokens exceeding this length are split at the boundary.	Positive integer. Default: `255`.

ik

Parameter	Description	Values
`mode`	Tokenization mode.	`ik_max_word` (default): fine-grained; outputs all possible short words. `ik_smart`: coarse-grained; prioritizes longer words to reduce splits.
`enable_lowercase`	Whether to convert tokens to lowercase.	`true` (default), `false`

ngram

Parameter	Description	Values
`min_gram`	Minimum token length.	Positive integer. Default: `1`. Maximum difference from `max_gram` is 3.
`max_gram`	Maximum token length.	Integer in [1, 255]. Default: `2`. Maximum difference from `min_gram` is 3.
`prefix_only`	Whether to generate only prefix n-grams.	`true`, `false` (default)

If the difference between max_gram and min_gram exceeds 3, the tokenizer generates too many tokens, increasing resource consumption and index build time. Override the limit with: SET hg_fulltext_index_max_ngram_diff = <value>.

pinyin

Parameter	Description	Values
`keep_first_letter`	Whether to keep the initial letter of each Chinese character as a combined token (e.g., "Lǐ Míng" becomes "lm").	`true` (default), `false`
`keep_separate_first_letter`	Whether to keep each initial as a separate token (e.g., "Lǐ Míng" becomes "l", "m").	`true`, `false` (default)
`limit_first_letter_length`	Maximum length of the combined initials token.	Integer. Default: `16`.
`keep_full_pinyin`	Whether to keep full pinyin for each character (e.g., "Lǐ Míng" becomes "li", "ming").	`true` (default), `false`
`keep_joined_full_pinyin`	Whether to join full pinyin syllables (e.g., "李明" becomes "liming").	`true`, `false` (default)
`keep_none_chinese`	Whether to keep non-Chinese letters or numbers.	`true` (default), `false`
`keep_none_chinese_together`	Whether to keep consecutive non-Chinese characters as one token (e.g., "DJ李明" becomes "DJ", "li", "ming"). Requires `keep_none_chinese=true`.	`true` (default), `false`
`keep_none_chinese_in_first_letter`	Whether to include non-Chinese characters in the initials token (e.g., "李明AT2025" becomes "lmat2025").	`true` (default), `false`
`keep_none_chinese_in_joined_full_pinyin`	Whether to include non-Chinese characters in the joined pinyin token (e.g., "李明AT2025" becomes "limingat2025").	`true`, `false` (default)
`none_chinese_pinyin_tokenize`	Whether to split valid pinyin syllables in non-Chinese sequences (e.g., "limingalibaba2025" becomes "li", "ming", "a", "li", "ba", "ba", "2025"). Requires `keep_none_chinese=true` and `keep_none_chinese_together=true`.	`true` (default), `false`
`keep_original`	Whether to keep the original input as a token.	`true`, `false` (default)
`lowercase`	Whether to convert non-Chinese letters to lowercase.	`true` (default), `false`
`trim_whitespace`	Whether to trim whitespace characters.	`true` (default), `false`
`remove_duplicated_term`	Whether to remove duplicate tokens (e.g., "de的" becomes "de"). May affect phrase query results.	`true`, `false` (default)
`keep_separate_chinese`	Whether to keep individual Chinese characters as separate tokens (e.g., "李明" becomes "李", "明").	`true`, `false` (default)

Default analyzer_params

Tokenizer	Default `analyzer_params`
`jieba` (default)	`{"tokenizer":{"type":"jieba","mode":"search","hmm":true},"filter":["removepunct","lowercase",{"type":"stop","stop_words":["_english_"]},{"type":"stemmer","language":"english"}]}`
`whitespace`	`{"tokenizer":{"type":"whitespace"}}`
`keyword`	`{"tokenizer":{"type":"keyword"}}`
`simple`	`{"tokenizer":{"type":"simple"},"filter":["lowercase"]}`
`standard`	`{"tokenizer":{"type":"standard","max_token_length":255},"filter":["lowercase"]}`
`icu`	`{"tokenizer":{"type":"icu"},"filter":["removepunct","lowercase"]}`
`ik`	`{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase":true},"filter":[{"type":"stop","stop_words":["_english_"]},{"type":"stemmer","language":"english"}]}`
`ngram`	`{"tokenizer":{"type":"ngram","min_gram":1,"max_gram":2,"prefix_only":false}}`
`pinyin`	{"tokenizer":{"type":"pinyin","keep_first_letter":true,"keep_separate_first_letter":false,"keep_full_pinyin":true,"keep_joined_full_pinyin":false,"keep_none_chinese":true,"keep_none_chinese_together":true,"none_chinese_pinyin_tokenize":true,"keep_original":false,"limit_first_letter_length":16,"lowercase":true,"trim_whitespace":true,"keep_none_chinese_in_first_letter":true,"keep_none_chinese_in_joined_full_pinyin":false,"remove_duplicated_term":false,"ignore_pinyin_offset":true,"fixed_pinyin_offset":false,"keep_separate_chinese":false}}

Filters

Filters apply to each token in the order listed. Supported filters:

Filter	Description	Format	Example
`lowercase`	Converts tokens to lowercase.	`"lowercase"`	`["Hello", "WORLD"]` → `["hello", "world"]`
`stop`	Removes stop-word tokens. Supports custom stop words and built-in language dictionaries.	`{"type":"stop","stop_words":["_english_","cat"]}`	Built-in dictionaries: `_english_`, `_danish_`, `_dutch_`, `_finnish_`, `_french_`, `_german_`, `_hungarian_`, `_italian_`, `_norwegian_`, `_portuguese_`, `_russian_`, `_spanish_`, `_swedish_`
`stemmer`	Reduces tokens to their root form.	`{"type":"stemmer","language":"english"}`	`["machine","learning"]` → `["machin","learn"]`. Supported languages: arabic, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish, tamil, turkish.
`length`	Removes tokens longer than `max`.	`{"type":"length","max":10}`	`["AI","for","Artificial","Intelligence"]` → `["AI","for","Artificial"]`
`removepunct`	Removes tokens consisting entirely of punctuation (default mode) or containing any punctuation. Available in version 4.0.8 and later.	`"removepunct"` or `{"type":"removepunct","mode":"if_all"}` or `{"type":"removepunct","mode":"if_any"}`	`if_all` (default): removes only all-punctuation tokens. `if_any`: removes tokens with any punctuation.
`pinyin`	Pinyin token filter. Uses the same parameters as the `pinyin` tokenizer.	JSON object with pinyin parameters	—