When your workload requires keyword search, phrase matching, or relevance-ranked retrieval over text columns, a full-text inverted index provides fast, BM25-scored search that scales far beyond brute-force scans. Hologres supports full-text inverted indexes starting from version 4.0, built on the Tantivy search engine.
How it works
When you write text data to Hologres, the system builds a full-text inverted index for each data file based on your index configuration. A tokenizer splits the text into tokens. The index then records the mapping between each token and its source document, along with position and term frequency information.
At query time, Hologres tokenizes the query string, then uses the BM25 algorithm to compute a relevance score for each document against the query tokens—returning results fast and in relevance order.
Limits
-
Full-text inverted indexes are available only on column-oriented tables and hybrid row-column tables in Hologres version 4.0 and later. Row-oriented tables are not supported.
-
Indexes can be created only on columns of type TEXT, CHAR, or VARCHAR.
-
Each index covers one column. Each column supports one full-text inverted index. To index multiple columns, create separate indexes for each.
-
Full-text search runs only on indexed columns. Querying unindexed columns is not supported.
-
BM25 scores are calculated per file. For small data volumes, trigger compaction manually to merge files and improve search accuracy.
-
After you create a full-text inverted index, Hologres builds the index files asynchronously during compaction. This applies to both existing data and new batch-inserted data. Until compaction completes, BM25 relevance scores for that data are zero.
-
For real-time writes after creating an index: before version 4.0.8, indexes were built synchronously as data arrived. Starting in version 4.0.8, Hologres refreshes the in-memory index every second. Data is queryable only after each refresh.
-
Use Serverless Computing resources for batch data imports. Serverless resources perform compaction and build full-text indexes synchronously during import. See Run Read/Write Tasks with Serverless Computing and Run Compaction Tasks with Serverless Computing.
-
Full-text search queries can run using Serverless Computing resources.
Choose a tokenizer
Select a tokenizer based on your text type and search requirements:
| Use case | Tokenizer | Notes |
|---|---|---|
| Keyword extraction from long articles | jieba | Supports new-word discovery and complex pattern switching. |
| Chinese descriptive text search | ik | Accurately identifies Chinese terms. Available in version 4.0.9 and later. |
| English title or plain text search | simple, whitespace, standard | Simple and efficient. Choose based on your text format. |
| Fuzzy log text search | ngram | No dictionary required. Supports fuzzy matching. Available in version 4.0.9 and later. |
| Pinyin search for Chinese names or products | pinyin | Supports full pinyin, initials, and polyphonic character inference. Available in version 4.0.9 and later. |
For tokenizer parameter details, see Customize tokenizer configuration.
Manage indexes
Create an index
Syntax
CREATE INDEX [ IF NOT EXISTS ] idx_name ON table_name
USING FULLTEXT (column_name [ , ... ])
[ WITH ( storage_parameter [ = value ] [ , ... ] ) ];
Parameters
| Parameter | Description |
|---|---|
idx_name |
The index name. |
table_name |
The target table. |
column_name |
The column to index. Must be of type TEXT, CHAR, or VARCHAR. |
storage_parameter |
Index configuration. Supported parameters: tokenizer, analyzer_params, and index_options. See the following sections. |
`tokenizer`
The tokenizer to use. Defaults to jieba. Supported values:
| Tokenizer | Description | Available from |
|---|---|---|
jieba (default) |
Chinese tokenizer combining rule-based matching and statistical models. | 4.0 |
whitespace |
Splits text on spaces. | 4.0 |
standard |
Tokenizes using Unicode Standard Annex #29. | 4.0 |
simple |
Splits text on spaces and punctuation. | 4.0 |
keyword |
Returns the input unchanged without any processing. | 4.0 |
icu |
Multilingual tokenizer. | 4.0 |
ik |
Chinese tokenizer based on IK Analyzer. Automatically detects English words, email addresses, URLs (without ://), and IP addresses. | 4.0.9 |
ngram |
Character-level sliding window tokenizer. Splits text into continuous n-character sequences to improve recall and fuzzy matching. Ideal for accelerating LIKE and ILIKE queries. | 4.0.9 |
pinyin |
Generates pinyin for Chinese characters and words, and infers pinyin for non-Chinese strings. | 4.0.9 |
Each index supports only one tokenizer and one analyzer_params setting.
`analyzer_params`
Tokenizer configuration as a JSON string. Each tokenizer has defaults that work for most use cases—specify only tokenizer and omit analyzer_params unless you need custom behavior. For details, see Customize tokenizer configuration.
`index_options`
Controls how much information the index stores and which query features are available. Available in version 4.1.9 and later.
index_options supports three levels. Higher levels include all information from lower levels: freqs includes docs, and positions includes freqs and docs.
| Value | Index content | Supported features | Typical use case |
|---|---|---|---|
positions (default) |
Document ID + term frequency + term position | Full feature support: phrase queries and standard relevance scoring | General full-text search |
freqs |
Document ID + term frequency | No phrase queries (position data missing) | Scoring by term frequency without exact phrase matching |
docs |
Document ID only | No phrase queries; all matching documents get the same TF score | Existence checks only, storage-sensitive scenarios |
For indexes using thekeywordtokenizer, the record level is fixed atdocsregardless ofindex_options.
Examples
Create an index using the default tokenizer (jieba) and default configuration:
CREATE INDEX idx1 ON tbl
USING FULLTEXT (col1);
Use the ik tokenizer with its default configuration:
CREATE INDEX idx1 ON tbl
USING FULLTEXT (col1)
WITH (tokenizer = 'ik');
Use a custom jieba configuration—exact mode with lowercase filter only:
CREATE INDEX idx1 ON tbl
USING FULLTEXT (col1)
WITH (tokenizer = 'jieba',
analyzer_params = '{"tokenizer":{"type":"jieba","mode":"exact"}, "filter":["lowercase"]}');
Set index_options to freqs to save storage at the cost of disabling phrase queries (version 4.1.9 and later):
CREATE INDEX idx1 ON tbl
USING FULLTEXT (col1)
WITH (index_options = 'freqs');
Build indexes after data import
Index files are built during compaction, not immediately after the CREATE INDEX statement. To build indexes right away, use one of the following approaches:
-
With Serverless Computing (recommended): Serverless resources perform compaction and build full-text indexes synchronously during import. See Run Read/Write Tasks with Serverless Computing and Run Compaction Tasks with Serverless Computing.
-
Without Serverless Computing: Trigger compaction manually after creating the index or finishing a batch import:
VACUUM <schema_name>.<table_name>;See Compaction (Beta) for details.
Modify an index
Syntax
-- Modify index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );
-- Restore default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );
After modifying a full-text inverted index, index files are rebuilt during the next compaction. Run VACUUM <schema_name>.<table_name>; to trigger compaction immediately. See Compaction.
Examples
Change the tokenizer to standard:
ALTER INDEX idx1 SET (tokenizer = 'standard');
Change the tokenizer to ik in ik_max_word mode with lowercase conversion disabled:
ALTER INDEX idx1 SET (
tokenizer = 'ik',
analyzer_params = '{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase": false}}'
);
Restore the default jieba tokenizer and its default analyzer_params:
ALTER INDEX idx1 RESET (tokenizer, analyzer_params);
Restore index_options to the default (positions):
ALTER INDEX idx1 RESET (index_options);
Change index_options to docs:
ALTER INDEX idx1 SET (index_options = 'docs');
Delete an index
Syntax
DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];
View indexes
Use the system table hologres.hg_index_properties to list all full-text inverted indexes:
SELECT * FROM hologres.hg_index_properties;
To find the table and column associated with a specific index:
SELECT
t.relname AS table_name,
a.attname AS column_name
FROM pg_class t
JOIN pg_index i ON t.oid = i.indrelid
JOIN pg_class idx ON i.indexrelid = idx.oid
JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
AND idx.relname = '<indexname>'
LIMIT 1;
Replace the placeholders:
-
<namespace>: Thetable_namespacevalue fromSELECT * FROM hologres.hg_index_properties;. -
<indexname>: The index name you created.
Search with full-text inverted indexes
Hologres supports four search modes. Use the TEXT_SEARCH function to run any of them.
| Search mode | Description |
|---|---|
| Keyword match | Matches documents containing any (OR) or all (AND) of the query keywords. |
| Phrase search | Matches documents where query tokens appear within a specified distance. |
| Natural language search | Supports complex queries with AND/OR logic, required/excluded terms, and embedded phrases. |
| Term search | Exact match on the query string—no tokenization applied. |
TEXT_SEARCH function
TEXT_SEARCH computes a BM25 relevance score for a query against source text.
Syntax
TEXT_SEARCH (
<search_data> TEXT/VARCHAR/CHAR
,<search_expression> TEXT
[ ,<mode> TEXT DEFAULT 'match'
,<operator> TEXT DEFAULT 'OR'
,<tokenizer> TEXT DEFAULT ''
,<analyzer_params> TEXT DEFAULT ''
,<options> TEXT DEFAULT '']
)
Parameters
| Parameter | Required | Description |
|---|---|---|
search_data |
Yes | The source column to search. Must be TEXT, VARCHAR, or CHAR, and must have a full-text index. Queries on unindexed columns fail. |
search_expression |
Yes | The query string. Must be a constant. Supports TEXT, VARCHAR, and CHAR. |
mode |
No | The search mode. Supported values: match (default), phrase, natural_language, term. See below. |
operator |
No | Logical operator between keywords. Applies only when mode is match. Supported values: OR (default) or AND. |
tokenizer, analyzer_params |
No | Tokenizer and configuration to apply to search_expression. If not specified, uses the same tokenizer as the index on search_data. If search_data is a constant, defaults to jieba. |
options |
No | Additional search parameters in the format 'key1=v1;key2=v2;'. Currently supports only slop (for phrase mode). Valid values: 0 (default) or any positive integer. Defines the maximum allowed distance between phrase terms. |
`mode` values
| Value | Behavior |
|---|---|
match (default) |
Keyword match. Each token in search_expression is a keyword. Use operator to define AND/OR logic between keywords. |
phrase |
Phrase search. Tokens must appear within the distance defined by slop. Default slop is 0 (adjacent tokens only). |
natural_language |
Complex queries. Supports AND/OR operators, required (+) and excluded (-) terms, and quoted phrases. See Tantivy QueryParser. |
term |
Exact match. search_expression is matched as-is against the index, without tokenization. |
`slop` unit by tokenizer
The slop parameter in phrase mode measures distance differently depending on the tokenizer:
-
Characters:
jieba,keyword,icu -
Tokens:
standard,simple,whitespace
Return value
A non-negative FLOAT representing the BM25 relevance score. Higher values indicate greater relevance. A score of 0 means no match.
Examples
Keyword match with AND operator:
-- Recommended: use named parameters
SELECT TEXT_SEARCH(content, 'machine learning', operator => 'AND') FROM tbl;
Phrase search with slop=2:
SELECT TEXT_SEARCH(content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;
Natural language search with AND/OR logic:
SELECT TEXT_SEARCH(content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl;
Natural language search with required and excluded terms:
-- Must contain 'learning', must not contain 'machine'
SELECT TEXT_SEARCH(content, '+learning -machine system', 'natural_language') FROM tbl;
Term search (exact match):
SELECT TEXT_SEARCH(content, 'machine learning', 'term') FROM tbl;
TOKENIZE function
TOKENIZE returns the tokens produced by a given tokenizer. Use it to test and debug tokenization behavior before building indexes.
Syntax
TOKENIZE (
<search_data> TEXT
[ ,<tokenizer> TEXT DEFAULT ''
,<analyzer_params> TEXT DEFAULT '']
)
Parameters
-
search_data: Required. The text to tokenize. Accepts constants. -
tokenizer,analyzer_params: Optional. Defaults tojiebaif not specified.
Return value
A TEXT array of tokens.
Verify index usage
Run EXPLAIN ANALYZE to confirm whether a query uses the full-text inverted index. If the output contains Fulltext Filter, the index is active.
EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, 'Yangtze River') > 0;
Example output:
QUERY PLAN
Gather (cost=0.00..1.00 rows=1 width=12)
-> Local Gather (cost=0.00..1.00 rows=1 width=12)
-> Index Scan using Clustering_index on wiki_articles (cost=0.00..1.00 rows=1 width=12)
Fulltext Filter: (text_search(content, search_expression => 'Yangtze River'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0
The Fulltext Filter line confirms the index is used. It also shows the resolved tokenizer and analyzer_params—useful for verifying that the index configuration matches what you expect.
For more on reading execution plans, see EXPLAIN and EXPLAIN ANALYZE.
Examples
Data preparation
Create a test table, add a full-text index, and insert sample data:
-- Create table
CREATE TABLE wiki_articles (id int, content text);
-- Create index
CREATE INDEX ft_idx_1 ON wiki_articles
USING FULLTEXT (content)
WITH (tokenizer = 'jieba');
-- Insert data
INSERT INTO wiki_articles VALUES
(1, 'The Yangtze River is China''s longest river and the world''s third-longest river, about 6,300 km long.'),
(2, 'Li was born in 1962 in Wendeng County, Shandong.'),
(3, 'He graduated from the department of physics at Shandong University.'),
(4, 'The Spring Festival, also known as the Lunar New Year, is China''s most important traditional festival.'),
(5, 'The Spring Festival usually falls between late January and mid-February on the Gregorian calendar. Major customs include pasting spring couplets, setting off firecrackers, eating reunion dinner, and giving New Year greetings.'),
(6, 'In 2006, the Spring Festival was approved by the State Council as part of China''s first batch of national intangible cultural heritage.'),
(7, 'Shandong has dozens of universities.'),
(8, 'ShanDa is a famous university in Shandong.');
-- Trigger compaction to build index files
VACUUM wiki_articles;
Keyword match
-- OR operator (default): matches documents containing 'shandong' or 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0;
-- Result
id | content
----+---------------------------------------------------------------------
2 | Li was born in 1962 in Wendeng County, Shandong.
3 | He graduated from the department of physics at Shandong University.
7 | Shandong has dozens of universities.
8 | ShanDa is a famous university in Shandong.
-- AND operator: matches documents containing both 'shandong' and 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0;
-- Result
id | content
----+---------------------------------------------------------------------
3 | He graduated from the department of physics at Shandong University.
7 | Shandong has dozens of universities.
8 | ShanDa is a famous university in Shandong.
Phrase search
-- Default slop=0: 'shandong' must be immediately followed by 'university'
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;
-- Result
id | content
----+---------------------------------------------------------------------
3 | He graduated from the department of physics at Shandong University.
(1 row)
-- slop=14: 'shandong' and 'university' can be up to 14 characters apart
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0;
-- Result
id | content
----+---------------------------------------------------------------------
3 | He graduated from the department of physics at Shandong University.
7 | Shandong has dozens of universities.
(2 rows)
-- slop=23: also matches out-of-order phrases ('university of Shandong' pattern)
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0;
-- Result
id | content
----+---------------------------------------------------------------------
3 | He graduated from the department of physics at Shandong University.
7 | Shandong has dozens of universities.
8 | ShanDa is a famous university in Shandong.
(3 rows)
Punctuation is ignored in phrase searches. Even if the source text uses a comma between words and the query uses a period, it still matches.
Natural language search
-- Keyword match with AND/OR: requires 'shandong' AND 'university', OR contains 'culture'
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, '(shandong AND university) OR culture', 'natural_language') > 0;
-- Result
id | content
----+---------------------------------------------------------------------
8 | ShanDa is a famous university in Shandong.
7 | Shandong has dozens of universities.
3 | He graduated from the department of physics at Shandong University.
6 | In 2006, the Spring Festival was approved by the State Council as part of China's first batch of national intangible cultural heritage.
-- Required (+) and excluded (-) terms, with optional boosting term
-- Must contain 'shandong', must not contain 'physics', and may contain 'famous' (boosts score)
SELECT id,
content,
TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score
FROM wiki_articles
WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0
ORDER BY score DESC;
-- Result
id | content | score
----+--------------------------------------------------+----------
8 | ShanDa is a famous university in Shandong. | 2.92376
7 | Shandong has dozens of universities. | 0.863399
2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338
-- Phrase search using double quotes
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0;
-- Match all documents
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0;
Term search
-- Exact match: 'Spring Festival' exists verbatim in the index
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, 'Spring Festival', 'term') > 0;
-- Result
id | content
----+----------------------------------------------------------------------------------------------
4 | The Spring Festival, also known as the Lunar New Year, is China''s most important traditional festival.
5 | The Spring Festival usually falls between late January and mid-February on the Gregorian calendar. ...
6 | In 2006, the Spring Festival was approved by the State Council as part of China''s first batch of national intangible cultural heritage.
-- No match: 'shandong university' is split by jieba, so this exact string doesn't exist in the index
-- Pair term search with the keyword tokenizer for exact multi-word matches
SELECT * FROM wiki_articles
WHERE TEXT_SEARCH(content, 'shandong university', 'term') > 0;
-- Result
id | content
----+---------
Complex queries
Combine TEXT_SEARCH with other predicates:
-- Filter by both full-text match and primary key
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 AND id = 3;
-- Result
id | content
----+---------------------------------------------------------------------
3 | He graduated from the department of physics at Shandong University.
Return BM25 scores and get the top 3 results:
SELECT id,
content,
TEXT_SEARCH(content, 'shandong university') AS score,
TOKENIZE(content, 'jieba')
FROM wiki_articles
ORDER BY score DESC
LIMIT 3;
-- Result
id | content | score | tokenize
----+---------------------------------------------------------------------+---------+--------------------------------------------------
8 | ShanDa is a famous university in Shandong. | 2.74634 | {shanda,famous,univers,shandong}
7 | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers}
3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
Use TEXT_SEARCH in both SELECT and WHERE to filter and rank in one pass:
SELECT id,
content,
TEXT_SEARCH(content, 'shandong university') AS score,
TOKENIZE(content, 'jieba')
FROM wiki_articles
WHERE TEXT_SEARCH(content, 'shandong university') > 0
ORDER BY score DESC;
Combine with a JOIN to search within a filtered subset:
-- Find the most relevant documents about 'shandong university' from wiki sources only
CREATE TABLE article_source (id int primary key, source text);
INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'),
(5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper');
SELECT a.id,
source, content,
TEXT_SEARCH(content, 'shandong university') AS score,
TOKENIZE(a.content, 'jieba')
FROM wiki_articles a
JOIN article_source b ON (a.id = b.id)
WHERE TEXT_SEARCH(a.content, 'shandong university') > 0
AND b.source = 'wiki'
ORDER BY score DESC;
-- Result
id | source | content | score | tokenize
----+--------+---------------------------------------------------------------------+---------+--------------------------------------------------
7 | wiki | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers}
3 | wiki | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
2 | wiki | Li was born in 1962 in Wendeng County, Shandong. | 1.09244 | {li,born,1962,wendeng,counti,shandong}
Best practices
Rebuild indexes using Serverless Computing
Modifying certain table properties triggers compaction and index rebuilding, which consumes significant CPU resources. The approach depends on what you're modifying.
For `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes:
Use the REBUILD syntax with Serverless Computing resources. Avoid ALTER TABLE ... SET for these properties. For details, see REBUILD.
ASYNC REBUILD TABLE <table_name>
WITH (
rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
bitmap_columns = '<col1>,<col2>',
dictionary_encoding_columns = '<col1>:on,<col2>:off',
vectors = '{
"<col_vector>": {
"algorithm": "HGraph",
"distance_method": "Cosine",
"builder_params": {
"base_quantization_type": "rabitq",
"graph_storage_type": "compressed",
"max_degree": 64,
"ef_construction": 400,
"precise_quantization_type": "fp32",
"use_reorder": true,
"max_total_size_to_merge_mb" : 4096
}
}
}'
);
For full-text index columns or column-oriented JSONB columns:
The REBUILD syntax is not supported for these types. Instead, create a new table, copy data using Serverless Computing, and swap the tables:
BEGIN;
-- Clean up any potential temporary tables
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table with the same structure
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT;
-- Enable column-oriented storage for the target column (if needed)
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create the full-text index on the new table
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);
-- Copy data using Serverless Computing to build indexes synchronously
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;
BEGIN;
-- Swap old and new tables
DROP TABLE IF EXISTS <table>;
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT;
For other properties (distribution_key, clustering_key, segment_key, storage format):
Use the REBUILD syntax with Serverless Computing resources.
Advanced: Customize tokenizer configuration
The default tokenizer configuration works for most use cases. Customize analyzer_params only when your workload requires specific tokenization behavior.
`analyzer_params` requirements
-
Must be a valid JSON string.
-
Top-level keys are
tokenizerandfilter.
`tokenizer` object
Required. A JSON object with the following structure:
-
type(required): The tokenizer name. -
Additional tokenizer-specific parameters. See the table below.
`filter` array
Optional. An array of filter configurations applied to each token in order.
Tokenizer-specific parameters
jieba
| Parameter | Description | Values |
|---|---|---|
mode |
Tokenization mode. | search (default): lists all possible combinations, including overlapping tokens (e.g., "traditional festival" becomes "tradition", "festival", and "traditional festival"). exact: non-redundant tokenization (e.g., "traditional festival" becomes "traditional festival" only). |
hmm |
Whether to use a Hidden Markov Model (HMM) to detect out-of-vocabulary (OOV) words. | true (default), false |
standard
| Parameter | Description | Values |
|---|---|---|
max_token_length |
Maximum token length. Tokens exceeding this length are split at the boundary. | Positive integer. Default: 255. |
ik
| Parameter | Description | Values |
|---|---|---|
mode |
Tokenization mode. | ik_max_word (default): fine-grained; outputs all possible short words. ik_smart: coarse-grained; prioritizes longer words to reduce splits. |
enable_lowercase |
Whether to convert tokens to lowercase. | true (default), false |
ngram
| Parameter | Description | Values |
|---|---|---|
min_gram |
Minimum token length. | Positive integer. Default: 1. Maximum difference from max_gram is 3. |
max_gram |
Maximum token length. | Integer in [1, 255]. Default: 2. Maximum difference from min_gram is 3. |
prefix_only |
Whether to generate only prefix n-grams. | true, false (default) |
If the difference betweenmax_gramandmin_gramexceeds 3, the tokenizer generates too many tokens, increasing resource consumption and index build time. Override the limit with:SET hg_fulltext_index_max_ngram_diff = <value>.
pinyin
| Parameter | Description | Values |
|---|---|---|
keep_first_letter |
Whether to keep the initial letter of each Chinese character as a combined token (e.g., "Lǐ Míng" becomes "lm"). | true (default), false |
keep_separate_first_letter |
Whether to keep each initial as a separate token (e.g., "Lǐ Míng" becomes "l", "m"). | true, false (default) |
limit_first_letter_length |
Maximum length of the combined initials token. | Integer. Default: 16. |
keep_full_pinyin |
Whether to keep full pinyin for each character (e.g., "Lǐ Míng" becomes "li", "ming"). | true (default), false |
keep_joined_full_pinyin |
Whether to join full pinyin syllables (e.g., "李明" becomes "liming"). | true, false (default) |
keep_none_chinese |
Whether to keep non-Chinese letters or numbers. | true (default), false |
keep_none_chinese_together |
Whether to keep consecutive non-Chinese characters as one token (e.g., "DJ李明" becomes "DJ", "li", "ming"). Requires keep_none_chinese=true. |
true (default), false |
keep_none_chinese_in_first_letter |
Whether to include non-Chinese characters in the initials token (e.g., "李明AT2025" becomes "lmat2025"). | true (default), false |
keep_none_chinese_in_joined_full_pinyin |
Whether to include non-Chinese characters in the joined pinyin token (e.g., "李明AT2025" becomes "limingat2025"). | true, false (default) |
none_chinese_pinyin_tokenize |
Whether to split valid pinyin syllables in non-Chinese sequences (e.g., "limingalibaba2025" becomes "li", "ming", "a", "li", "ba", "ba", "2025"). Requires keep_none_chinese=true and keep_none_chinese_together=true. |
true (default), false |
keep_original |
Whether to keep the original input as a token. | true, false (default) |
lowercase |
Whether to convert non-Chinese letters to lowercase. | true (default), false |
trim_whitespace |
Whether to trim whitespace characters. | true (default), false |
remove_duplicated_term |
Whether to remove duplicate tokens (e.g., "de的" becomes "de"). May affect phrase query results. | true, false (default) |
keep_separate_chinese |
Whether to keep individual Chinese characters as separate tokens (e.g., "李明" becomes "李", "明"). | true, false (default) |
Default analyzer_params
| Tokenizer | Default analyzer_params |
|---|---|
jieba (default) |
{"tokenizer":{"type":"jieba","mode":"search","hmm":true},"filter":["removepunct","lowercase",{"type":"stop","stop_words":["_english_"]},{"type":"stemmer","language":"english"}]} |
whitespace |
{"tokenizer":{"type":"whitespace"}} |
keyword |
{"tokenizer":{"type":"keyword"}} |
simple |
{"tokenizer":{"type":"simple"},"filter":["lowercase"]} |
standard |
{"tokenizer":{"type":"standard","max_token_length":255},"filter":["lowercase"]} |
icu |
{"tokenizer":{"type":"icu"},"filter":["removepunct","lowercase"]} |
ik |
{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase":true},"filter":[{"type":"stop","stop_words":["_english_"]},{"type":"stemmer","language":"english"}]} |
ngram |
{"tokenizer":{"type":"ngram","min_gram":1,"max_gram":2,"prefix_only":false}} |
pinyin |
{"tokenizer":{"type":"pinyin","keep_first_letter":true,"keep_separate_first_letter":false,"keep_full_pinyin":true,"keep_joined_full_pinyin":false,"keep_none_chinese":true,"keep_none_chinese_together":true,"none_chinese_pinyin_tokenize":true,"keep_original":false,"limit_first_letter_length":16,"lowercase":true,"trim_whitespace":true,"keep_none_chinese_in_first_letter":true,"keep_none_chinese_in_joined_full_pinyin":false,"remove_duplicated_term":false,"ignore_pinyin_offset":true,"fixed_pinyin_offset":false,"keep_separate_chinese":false}} |
Filters
Filters apply to each token in the order listed. Supported filters:
| Filter | Description | Format | Example |
|---|---|---|---|
lowercase |
Converts tokens to lowercase. | "lowercase" |
["Hello", "WORLD"] → ["hello", "world"] |
stop |
Removes stop-word tokens. Supports custom stop words and built-in language dictionaries. | {"type":"stop","stop_words":["_english_","cat"]} |
Built-in dictionaries: _english_, _danish_, _dutch_, _finnish_, _french_, _german_, _hungarian_, _italian_, _norwegian_, _portuguese_, _russian_, _spanish_, _swedish_ |
stemmer |
Reduces tokens to their root form. | {"type":"stemmer","language":"english"} |
["machine","learning"] → ["machin","learn"]. Supported languages: arabic, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish, tamil, turkish. |
length |
Removes tokens longer than max. |
{"type":"length","max":10} |
["AI","for","Artificial","Intelligence"] → ["AI","for","Artificial"] |
removepunct |
Removes tokens consisting entirely of punctuation (default mode) or containing any punctuation. Available in version 4.0.8 and later. | "removepunct" or {"type":"removepunct","mode":"if_all"} or {"type":"removepunct","mode":"if_any"} |
if_all (default): removes only all-punctuation tokens. if_any: removes tokens with any punctuation. |
pinyin |
Pinyin token filter. Uses the same parameters as the pinyin tokenizer. |
JSON object with pinyin parameters | — |