All Products
Search
Document Center

Hologres:Full-Text Inverted Index

Last Updated:Mar 26, 2026

When your workload requires keyword search, phrase matching, or relevance-ranked retrieval over text columns, a full-text inverted index provides fast, BM25-scored search that scales far beyond brute-force scans. Hologres supports full-text inverted indexes starting from version 4.0, built on the Tantivy search engine.

How it works

When you write text data to Hologres, the system builds a full-text inverted index for each data file based on your index configuration. A tokenizer splits the text into tokens. The index then records the mapping between each token and its source document, along with position and term frequency information.

At query time, Hologres tokenizes the query string, then uses the BM25 algorithm to compute a relevance score for each document against the query tokens—returning results fast and in relevance order.

Limits

  • Full-text inverted indexes are available only on column-oriented tables and hybrid row-column tables in Hologres version 4.0 and later. Row-oriented tables are not supported.

  • Indexes can be created only on columns of type TEXT, CHAR, or VARCHAR.

  • Each index covers one column. Each column supports one full-text inverted index. To index multiple columns, create separate indexes for each.

  • Full-text search runs only on indexed columns. Querying unindexed columns is not supported.

  • BM25 scores are calculated per file. For small data volumes, trigger compaction manually to merge files and improve search accuracy.

  • After you create a full-text inverted index, Hologres builds the index files asynchronously during compaction. This applies to both existing data and new batch-inserted data. Until compaction completes, BM25 relevance scores for that data are zero.

  • For real-time writes after creating an index: before version 4.0.8, indexes were built synchronously as data arrived. Starting in version 4.0.8, Hologres refreshes the in-memory index every second. Data is queryable only after each refresh.

  • Use Serverless Computing resources for batch data imports. Serverless resources perform compaction and build full-text indexes synchronously during import. See Run Read/Write Tasks with Serverless Computing and Run Compaction Tasks with Serverless Computing.

  • Full-text search queries can run using Serverless Computing resources.

Choose a tokenizer

Select a tokenizer based on your text type and search requirements:

Use case Tokenizer Notes
Keyword extraction from long articles jieba Supports new-word discovery and complex pattern switching.
Chinese descriptive text search ik Accurately identifies Chinese terms. Available in version 4.0.9 and later.
English title or plain text search simple, whitespace, standard Simple and efficient. Choose based on your text format.
Fuzzy log text search ngram No dictionary required. Supports fuzzy matching. Available in version 4.0.9 and later.
Pinyin search for Chinese names or products pinyin Supports full pinyin, initials, and polyphonic character inference. Available in version 4.0.9 and later.

For tokenizer parameter details, see Customize tokenizer configuration.

Manage indexes

Create an index

Syntax

CREATE INDEX [ IF NOT EXISTS ] idx_name ON table_name
       USING FULLTEXT (column_name [ , ... ])
       [ WITH ( storage_parameter [ = value ] [ , ... ] ) ];

Parameters

Parameter Description
idx_name The index name.
table_name The target table.
column_name The column to index. Must be of type TEXT, CHAR, or VARCHAR.
storage_parameter Index configuration. Supported parameters: tokenizer, analyzer_params, and index_options. See the following sections.

`tokenizer`

The tokenizer to use. Defaults to jieba. Supported values:

Tokenizer Description Available from
jieba (default) Chinese tokenizer combining rule-based matching and statistical models. 4.0
whitespace Splits text on spaces. 4.0
standard Tokenizes using Unicode Standard Annex #29. 4.0
simple Splits text on spaces and punctuation. 4.0
keyword Returns the input unchanged without any processing. 4.0
icu Multilingual tokenizer. 4.0
ik Chinese tokenizer based on IK Analyzer. Automatically detects English words, email addresses, URLs (without ://), and IP addresses. 4.0.9
ngram Character-level sliding window tokenizer. Splits text into continuous n-character sequences to improve recall and fuzzy matching. Ideal for accelerating LIKE and ILIKE queries. 4.0.9
pinyin Generates pinyin for Chinese characters and words, and infers pinyin for non-Chinese strings. 4.0.9
Each index supports only one tokenizer and one analyzer_params setting.

`analyzer_params`

Tokenizer configuration as a JSON string. Each tokenizer has defaults that work for most use cases—specify only tokenizer and omit analyzer_params unless you need custom behavior. For details, see Customize tokenizer configuration.

`index_options`

Controls how much information the index stores and which query features are available. Available in version 4.1.9 and later.

index_options supports three levels. Higher levels include all information from lower levels: freqs includes docs, and positions includes freqs and docs.

Value Index content Supported features Typical use case
positions (default) Document ID + term frequency + term position Full feature support: phrase queries and standard relevance scoring General full-text search
freqs Document ID + term frequency No phrase queries (position data missing) Scoring by term frequency without exact phrase matching
docs Document ID only No phrase queries; all matching documents get the same TF score Existence checks only, storage-sensitive scenarios
For indexes using the keyword tokenizer, the record level is fixed at docs regardless of index_options.

Examples

Create an index using the default tokenizer (jieba) and default configuration:

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1);

Use the ik tokenizer with its default configuration:

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1)
       WITH (tokenizer = 'ik');

Use a custom jieba configuration—exact mode with lowercase filter only:

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1)
       WITH (tokenizer = 'jieba',
             analyzer_params = '{"tokenizer":{"type":"jieba","mode":"exact"}, "filter":["lowercase"]}');

Set index_options to freqs to save storage at the cost of disabling phrase queries (version 4.1.9 and later):

CREATE INDEX idx1 ON tbl
       USING FULLTEXT (col1)
       WITH (index_options = 'freqs');

Build indexes after data import

Index files are built during compaction, not immediately after the CREATE INDEX statement. To build indexes right away, use one of the following approaches:

Modify an index

Syntax

-- Modify index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );

-- Restore default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );
After modifying a full-text inverted index, index files are rebuilt during the next compaction. Run VACUUM <schema_name>.<table_name>; to trigger compaction immediately. See Compaction.

Examples

Change the tokenizer to standard:

ALTER INDEX idx1 SET (tokenizer = 'standard');

Change the tokenizer to ik in ik_max_word mode with lowercase conversion disabled:

ALTER INDEX idx1 SET (
  tokenizer = 'ik',
  analyzer_params = '{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase": false}}'
);

Restore the default jieba tokenizer and its default analyzer_params:

ALTER INDEX idx1 RESET (tokenizer, analyzer_params);

Restore index_options to the default (positions):

ALTER INDEX idx1 RESET (index_options);

Change index_options to docs:

ALTER INDEX idx1 SET (index_options = 'docs');

Delete an index

Syntax

DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];

View indexes

Use the system table hologres.hg_index_properties to list all full-text inverted indexes:

SELECT * FROM hologres.hg_index_properties;

To find the table and column associated with a specific index:

SELECT
    t.relname AS table_name,
    a.attname AS column_name
FROM pg_class t
    JOIN pg_index i ON t.oid = i.indrelid
    JOIN pg_class idx ON i.indexrelid = idx.oid
    JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
    AND idx.relname = '<indexname>'
LIMIT 1;

Replace the placeholders:

  • <namespace>: The table_namespace value from SELECT * FROM hologres.hg_index_properties;.

  • <indexname>: The index name you created.

Search with full-text inverted indexes

Hologres supports four search modes. Use the TEXT_SEARCH function to run any of them.

Search mode Description
Keyword match Matches documents containing any (OR) or all (AND) of the query keywords.
Phrase search Matches documents where query tokens appear within a specified distance.
Natural language search Supports complex queries with AND/OR logic, required/excluded terms, and embedded phrases.
Term search Exact match on the query string—no tokenization applied.

TEXT_SEARCH function

TEXT_SEARCH computes a BM25 relevance score for a query against source text.

Syntax

TEXT_SEARCH (
  <search_data> TEXT/VARCHAR/CHAR
  ,<search_expression> TEXT
  [ ,<mode> TEXT DEFAULT 'match'
  ,<operator> TEXT DEFAULT 'OR'
  ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT ''
  ,<options> TEXT DEFAULT '']
)

Parameters

Parameter Required Description
search_data Yes The source column to search. Must be TEXT, VARCHAR, or CHAR, and must have a full-text index. Queries on unindexed columns fail.
search_expression Yes The query string. Must be a constant. Supports TEXT, VARCHAR, and CHAR.
mode No The search mode. Supported values: match (default), phrase, natural_language, term. See below.
operator No Logical operator between keywords. Applies only when mode is match. Supported values: OR (default) or AND.
tokenizer, analyzer_params No Tokenizer and configuration to apply to search_expression. If not specified, uses the same tokenizer as the index on search_data. If search_data is a constant, defaults to jieba.
options No Additional search parameters in the format 'key1=v1;key2=v2;'. Currently supports only slop (for phrase mode). Valid values: 0 (default) or any positive integer. Defines the maximum allowed distance between phrase terms.

`mode` values

Value Behavior
match (default) Keyword match. Each token in search_expression is a keyword. Use operator to define AND/OR logic between keywords.
phrase Phrase search. Tokens must appear within the distance defined by slop. Default slop is 0 (adjacent tokens only).
natural_language Complex queries. Supports AND/OR operators, required (+) and excluded (-) terms, and quoted phrases. See Tantivy QueryParser.
term Exact match. search_expression is matched as-is against the index, without tokenization.

`slop` unit by tokenizer

The slop parameter in phrase mode measures distance differently depending on the tokenizer:

  • Characters: jieba, keyword, icu

  • Tokens: standard, simple, whitespace

Return value

A non-negative FLOAT representing the BM25 relevance score. Higher values indicate greater relevance. A score of 0 means no match.

Examples

Keyword match with AND operator:

-- Recommended: use named parameters
SELECT TEXT_SEARCH(content, 'machine learning', operator => 'AND') FROM tbl;

Phrase search with slop=2:

SELECT TEXT_SEARCH(content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;

Natural language search with AND/OR logic:

SELECT TEXT_SEARCH(content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl;

Natural language search with required and excluded terms:

-- Must contain 'learning', must not contain 'machine'
SELECT TEXT_SEARCH(content, '+learning -machine system', 'natural_language') FROM tbl;

Term search (exact match):

SELECT TEXT_SEARCH(content, 'machine learning', 'term') FROM tbl;

TOKENIZE function

TOKENIZE returns the tokens produced by a given tokenizer. Use it to test and debug tokenization behavior before building indexes.

Syntax

TOKENIZE (
  <search_data> TEXT
  [ ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT '']
)

Parameters

  • search_data: Required. The text to tokenize. Accepts constants.

  • tokenizer, analyzer_params: Optional. Defaults to jieba if not specified.

Return value

A TEXT array of tokens.

Verify index usage

Run EXPLAIN ANALYZE to confirm whether a query uses the full-text inverted index. If the output contains Fulltext Filter, the index is active.

EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, 'Yangtze River') > 0;

Example output:

QUERY PLAN
Gather  (cost=0.00..1.00 rows=1 width=12)
  ->  Local Gather  (cost=0.00..1.00 rows=1 width=12)
        ->  Index Scan using Clustering_index on wiki_articles  (cost=0.00..1.00 rows=1 width=12)
              Fulltext Filter: (text_search(content, search_expression => 'Yangtze River'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

The Fulltext Filter line confirms the index is used. It also shows the resolved tokenizer and analyzer_params—useful for verifying that the index configuration matches what you expect.

For more on reading execution plans, see EXPLAIN and EXPLAIN ANALYZE.

Examples

Data preparation

Create a test table, add a full-text index, and insert sample data:

-- Create table
CREATE TABLE wiki_articles (id int, content text);

-- Create index
CREATE INDEX ft_idx_1 ON wiki_articles
       USING FULLTEXT (content)
       WITH (tokenizer = 'jieba');

-- Insert data
INSERT INTO wiki_articles VALUES
  (1, 'The Yangtze River is China''s longest river and the world''s third-longest river, about 6,300 km long.'),
  (2, 'Li was born in 1962 in Wendeng County, Shandong.'),
  (3, 'He graduated from the department of physics at Shandong University.'),
  (4, 'The Spring Festival, also known as the Lunar New Year, is China''s most important traditional festival.'),
  (5, 'The Spring Festival usually falls between late January and mid-February on the Gregorian calendar. Major customs include pasting spring couplets, setting off firecrackers, eating reunion dinner, and giving New Year greetings.'),
  (6, 'In 2006, the Spring Festival was approved by the State Council as part of China''s first batch of national intangible cultural heritage.'),
  (7, 'Shandong has dozens of universities.'),
  (8, 'ShanDa is a famous university in Shandong.');

-- Trigger compaction to build index files
VACUUM wiki_articles;

Keyword match

-- OR operator (default): matches documents containing 'shandong' or 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university in Shandong.

-- AND operator: matches documents containing both 'shandong' and 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university in Shandong.

Phrase search

-- Default slop=0: 'shandong' must be immediately followed by 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
(1 row)

-- slop=14: 'shandong' and 'university' can be up to 14 characters apart
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
(2 rows)

-- slop=23: also matches out-of-order phrases ('university of Shandong' pattern)
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university in Shandong.
(3 rows)
Punctuation is ignored in phrase searches. Even if the source text uses a comma between words and the query uses a period, it still matches.

Natural language search

-- Keyword match with AND/OR: requires 'shandong' AND 'university', OR contains 'culture'
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '(shandong AND university) OR culture', 'natural_language') > 0;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  8 | ShanDa is a famous university in Shandong.
  7 | Shandong has dozens of universities.
  3 | He graduated from the department of physics at Shandong University.
  6 | In 2006, the Spring Festival was approved by the State Council as part of China's first batch of national intangible cultural heritage.

-- Required (+) and excluded (-) terms, with optional boosting term
-- Must contain 'shandong', must not contain 'physics', and may contain 'famous' (boosts score)
SELECT id,
       content,
       TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score
FROM wiki_articles
WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0
ORDER BY score DESC;

-- Result
 id |                     content                      |  score
----+--------------------------------------------------+----------
  8 | ShanDa is a famous university in Shandong.       |  2.92376
  7 | Shandong has dozens of universities.             | 0.863399
  2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338

-- Phrase search using double quotes
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0;

-- Match all documents
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0;

Term search

-- Exact match: 'Spring Festival' exists verbatim in the index
SELECT * FROM wiki_articles
         WHERE TEXT_SEARCH(content, 'Spring Festival', 'term') > 0;

-- Result
 id |                                           content
----+----------------------------------------------------------------------------------------------
  4 | The Spring Festival, also known as the Lunar New Year, is China''s most important traditional festival.
  5 | The Spring Festival usually falls between late January and mid-February on the Gregorian calendar. ...
  6 | In 2006, the Spring Festival was approved by the State Council as part of China''s first batch of national intangible cultural heritage.

-- No match: 'shandong university' is split by jieba, so this exact string doesn't exist in the index
-- Pair term search with the keyword tokenizer for exact multi-word matches
SELECT * FROM wiki_articles
         WHERE TEXT_SEARCH(content, 'shandong university', 'term') > 0;

-- Result
 id | content
----+---------

Complex queries

Combine TEXT_SEARCH with other predicates:

-- Filter by both full-text match and primary key
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 AND id = 3;

-- Result
 id |                               content
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.

Return BM25 scores and get the top 3 results:

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
ORDER BY score DESC
LIMIT 3;

-- Result
id  |                               content                               |  score  |                     tokenize
----+---------------------------------------------------------------------+---------+--------------------------------------------------
  8 | ShanDa is a famous university in Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
  7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}

Use TEXT_SEARCH in both SELECT and WHERE to filter and rank in one pass:

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
 WHERE TEXT_SEARCH(content, 'shandong university') > 0
ORDER BY score DESC;

Combine with a JOIN to search within a filtered subset:

-- Find the most relevant documents about 'shandong university' from wiki sources only
CREATE TABLE article_source (id int primary key, source text);
INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'),
                                  (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper');

SELECT a.id,
       source, content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(a.content, 'jieba')
  FROM wiki_articles a
  JOIN article_source b ON (a.id = b.id)
 WHERE TEXT_SEARCH(a.content, 'shandong university') > 0
   AND b.source = 'wiki'
ORDER BY score DESC;

-- Result
id  | source |                               content                               |  score  |                     tokenize
----+--------+---------------------------------------------------------------------+---------+--------------------------------------------------
  7 | wiki   | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | wiki   | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  2 | wiki   | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

Best practices

Rebuild indexes using Serverless Computing

Modifying certain table properties triggers compaction and index rebuilding, which consumes significant CPU resources. The approach depends on what you're modifying.

For `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes:

Use the REBUILD syntax with Serverless Computing resources. Avoid ALTER TABLE ... SET for these properties. For details, see REBUILD.

ASYNC REBUILD TABLE <table_name>
WITH (
    rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
    bitmap_columns = '<col1>,<col2>',
    dictionary_encoding_columns = '<col1>:on,<col2>:off',
    vectors = '{
    "<col_vector>": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

For full-text index columns or column-oriented JSONB columns:

The REBUILD syntax is not supported for these types. Instead, create a new table, copy data using Serverless Computing, and swap the tables:

BEGIN;
-- Clean up any potential temporary tables
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table with the same structure
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT;

-- Enable column-oriented storage for the target column (if needed)
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create the full-text index on the new table
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);

-- Copy data using Serverless Computing to build indexes synchronously
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;

BEGIN;
-- Swap old and new tables
DROP TABLE IF EXISTS <table>;
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT;

For other properties (distribution_key, clustering_key, segment_key, storage format):

Use the REBUILD syntax with Serverless Computing resources.

Advanced: Customize tokenizer configuration

The default tokenizer configuration works for most use cases. Customize analyzer_params only when your workload requires specific tokenization behavior.

`analyzer_params` requirements

  • Must be a valid JSON string.

  • Top-level keys are tokenizer and filter.

`tokenizer` object

Required. A JSON object with the following structure:

  • type (required): The tokenizer name.

  • Additional tokenizer-specific parameters. See the table below.

`filter` array

Optional. An array of filter configurations applied to each token in order.

Tokenizer-specific parameters

jieba

Parameter Description Values
mode Tokenization mode. search (default): lists all possible combinations, including overlapping tokens (e.g., "traditional festival" becomes "tradition", "festival", and "traditional festival"). exact: non-redundant tokenization (e.g., "traditional festival" becomes "traditional festival" only).
hmm Whether to use a Hidden Markov Model (HMM) to detect out-of-vocabulary (OOV) words. true (default), false

standard

Parameter Description Values
max_token_length Maximum token length. Tokens exceeding this length are split at the boundary. Positive integer. Default: 255.

ik

Parameter Description Values
mode Tokenization mode. ik_max_word (default): fine-grained; outputs all possible short words. ik_smart: coarse-grained; prioritizes longer words to reduce splits.
enable_lowercase Whether to convert tokens to lowercase. true (default), false

ngram

Parameter Description Values
min_gram Minimum token length. Positive integer. Default: 1. Maximum difference from max_gram is 3.
max_gram Maximum token length. Integer in [1, 255]. Default: 2. Maximum difference from min_gram is 3.
prefix_only Whether to generate only prefix n-grams. true, false (default)
If the difference between max_gram and min_gram exceeds 3, the tokenizer generates too many tokens, increasing resource consumption and index build time. Override the limit with: SET hg_fulltext_index_max_ngram_diff = <value>.

pinyin

Parameter Description Values
keep_first_letter Whether to keep the initial letter of each Chinese character as a combined token (e.g., "Lǐ Míng" becomes "lm"). true (default), false
keep_separate_first_letter Whether to keep each initial as a separate token (e.g., "Lǐ Míng" becomes "l", "m"). true, false (default)
limit_first_letter_length Maximum length of the combined initials token. Integer. Default: 16.
keep_full_pinyin Whether to keep full pinyin for each character (e.g., "Lǐ Míng" becomes "li", "ming"). true (default), false
keep_joined_full_pinyin Whether to join full pinyin syllables (e.g., "李明" becomes "liming"). true, false (default)
keep_none_chinese Whether to keep non-Chinese letters or numbers. true (default), false
keep_none_chinese_together Whether to keep consecutive non-Chinese characters as one token (e.g., "DJ李明" becomes "DJ", "li", "ming"). Requires keep_none_chinese=true. true (default), false
keep_none_chinese_in_first_letter Whether to include non-Chinese characters in the initials token (e.g., "李明AT2025" becomes "lmat2025"). true (default), false
keep_none_chinese_in_joined_full_pinyin Whether to include non-Chinese characters in the joined pinyin token (e.g., "李明AT2025" becomes "limingat2025"). true, false (default)
none_chinese_pinyin_tokenize Whether to split valid pinyin syllables in non-Chinese sequences (e.g., "limingalibaba2025" becomes "li", "ming", "a", "li", "ba", "ba", "2025"). Requires keep_none_chinese=true and keep_none_chinese_together=true. true (default), false
keep_original Whether to keep the original input as a token. true, false (default)
lowercase Whether to convert non-Chinese letters to lowercase. true (default), false
trim_whitespace Whether to trim whitespace characters. true (default), false
remove_duplicated_term Whether to remove duplicate tokens (e.g., "de的" becomes "de"). May affect phrase query results. true, false (default)
keep_separate_chinese Whether to keep individual Chinese characters as separate tokens (e.g., "李明" becomes "李", "明"). true, false (default)

Default analyzer_params

Tokenizer Default analyzer_params
jieba (default) {"tokenizer":{"type":"jieba","mode":"search","hmm":true},"filter":["removepunct","lowercase",{"type":"stop","stop_words":["_english_"]},{"type":"stemmer","language":"english"}]}
whitespace {"tokenizer":{"type":"whitespace"}}
keyword {"tokenizer":{"type":"keyword"}}
simple {"tokenizer":{"type":"simple"},"filter":["lowercase"]}
standard {"tokenizer":{"type":"standard","max_token_length":255},"filter":["lowercase"]}
icu {"tokenizer":{"type":"icu"},"filter":["removepunct","lowercase"]}
ik {"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase":true},"filter":[{"type":"stop","stop_words":["_english_"]},{"type":"stemmer","language":"english"}]}
ngram {"tokenizer":{"type":"ngram","min_gram":1,"max_gram":2,"prefix_only":false}}
pinyin {"tokenizer":{"type":"pinyin","keep_first_letter":true,"keep_separate_first_letter":false,"keep_full_pinyin":true,"keep_joined_full_pinyin":false,"keep_none_chinese":true,"keep_none_chinese_together":true,"none_chinese_pinyin_tokenize":true,"keep_original":false,"limit_first_letter_length":16,"lowercase":true,"trim_whitespace":true,"keep_none_chinese_in_first_letter":true,"keep_none_chinese_in_joined_full_pinyin":false,"remove_duplicated_term":false,"ignore_pinyin_offset":true,"fixed_pinyin_offset":false,"keep_separate_chinese":false}}

Filters

Filters apply to each token in the order listed. Supported filters:

Filter Description Format Example
lowercase Converts tokens to lowercase. "lowercase" ["Hello", "WORLD"]["hello", "world"]
stop Removes stop-word tokens. Supports custom stop words and built-in language dictionaries. {"type":"stop","stop_words":["_english_","cat"]} Built-in dictionaries: _english_, _danish_, _dutch_, _finnish_, _french_, _german_, _hungarian_, _italian_, _norwegian_, _portuguese_, _russian_, _spanish_, _swedish_
stemmer Reduces tokens to their root form. {"type":"stemmer","language":"english"} ["machine","learning"]["machin","learn"]. Supported languages: arabic, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish, tamil, turkish.
length Removes tokens longer than max. {"type":"length","max":10} ["AI","for","Artificial","Intelligence"]["AI","for","Artificial"]
removepunct Removes tokens consisting entirely of punctuation (default mode) or containing any punctuation. Available in version 4.0.8 and later. "removepunct" or {"type":"removepunct","mode":"if_all"} or {"type":"removepunct","mode":"if_any"} if_all (default): removes only all-punctuation tokens. if_any: removes tokens with any punctuation.
pinyin Pinyin token filter. Uses the same parameters as the pinyin tokenizer. JSON object with pinyin parameters