Hologres V4.0 and later supports full-text inverted indexes. This feature is built on Tantivy, a high-performance full-text search engine, and provides powerful search capabilities. It also supports the BM25 similarity scoring algorithm and offers features such as document sorting, keyword searches, and phrase searches.
How it works
When you write source text to Hologres, a full-text inverted index file is built for each data file based on the index configuration. During this process, a tokenizer segments the text into tokens. The index then records the mapping between each token and the source text, including its position, term frequency, and other related information.
When you search the text, the search query is first tokenized to create a set of object tokens. The BM25 algorithm then calculates a relevance score for each source document based on this set of tokens. This process enables high-performance and high-precision full-text searches.
Notes
Full-text inverted indexes are supported only for column-oriented tables and row-column hybrid tables in Hologres V4.0 and later. Row-oriented tables are not supported.
You can create full-text inverted indexes only on columns of the TEXT, CHAR, or VARCHAR type.
You can build a full-text inverted index on only a single column. Each column supports only one full-text inverted index. To build indexes on multiple columns, you must create a separate index for each column.
After you create a full-text inverted index, the index file is built asynchronously during the Compaction process. Before the index file is built, the BM25 relevance score for the data is 0.
Full-text searches can be performed only on columns that have a full-text index. Brute-force calculations on columns without an index are not supported.
You can use Serverless Computing resources for batch data imports. Serverless resources synchronously perform Compaction and full-text index building during the data import. For more information, see Use Serverless Computing to execute read and write tasks and Use Serverless Computing to execute Compaction tasks. If you do not use Serverless resources, you must manually run the following command to trigger Compaction after a batch data import or index modification.
VACUUM <schema_name>.<table_name>;The BM25 search algorithm calculates relevance scores at the file level. If you import a small amount of data, you can manually trigger Compaction as needed to merge files and improve search accuracy.
You can use Serverless Computing resources to execute full-text search queries.
Manage indexes
Create an index
Syntax
CREATE INDEX [ IF NOT EXISTS ] <idx_name> ON <table_name>
USING FULLTEXT (<column_name>)
[ WITH ( <storage_parameter> = '<storage_value>' [ , ... ] ) ];Parameters
Parameter | Description |
idx_name | The index name. |
table_name | The target table name. |
column_name | The target column for the full-text inverted index. |
storage_parameter | Configures parameters for the full-text inverted index. Two types of parameters are available:
Note You can set only one type of tokenizer and analyzer_params within the same index. |
storage_value | The value for the full-text inverted index parameter.
|
Usage examples
Build a full-text inverted index with the default jieba tokenizer and its configuration.
CREATE INDEX idx1 ON tbl USING FULLTEXT (col1);Explicitly specify the standard tokenizer and use its default configuration.
CREATE INDEX idx1 ON tbl USING FULLTEXT (col1) WITH (tokenizer = 'standard');
Modify an index
Syntax
-- Modify the index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );
-- Reset to the default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );Parameters
For more information about the parameters, see Parameters.
Usage examples
After you modify a full-text inverted index, the index file is built asynchronously during the data Compaction process. After you modify the index, you must manually run the VACUUM <schema_name>.<table_name>; command to synchronously trigger Compaction. For more information, see Compaction.
Change the index tokenizer to standard.
ALTER INDEX idx1 SET (tokenizer = 'standard');Reset to the default jieba tokenizer and use its default analyzer_params configuration.
ALTER INDEX idx1 RESET (tokenizer); ALTER INDEX idx1 RESET (tokenizer, analyzer_params);Reset to the default analyzer_params configuration for the current tokenizer.
ALTER INDEX idx1 RESET (analyzer_params);
Delete an index
Syntax
DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];Parameters
For more information about the parameters, see Parameters.
View indexes
Hologres provides the hologres.hg_index_properties system table, which you can use to view created full-text inverted indexes and their locations.
SELECT * FROM hologres.hg_index_properties;You can run the following SQL statement to view the table and column that correspond to an index.
SELECT
t.relname AS table_name,
a.attname AS column_name
FROM pg_class t
JOIN pg_index i ON t.oid = i.indrelid
JOIN pg_class idx ON i.indexrelid = idx.oid
JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
AND idx.relname = '<indexname>'
LIMIT 1;Parameters:
namespace: The value of the
table_namespacefield in the result of theSELECT * FROM hologres.hg_index_properties;command.indexname: The name of the index you created.
Use an index for full-text search
Hologres supports a variety of search modes, which allow you to flexibly perform full-text searches based on your business logic.
Search mode | Description |
Keyword match | Searches based on the keywords from the tokenized search object. You can define AND/OR relationships between keywords. |
Phrase search | Searches for a phrase from the search object. A match occurs only if the distance between multiple words meets the requirement. |
Natural language search | Lets you flexibly define complex query conditions to achieve your search goals, such as defining AND/OR relationships, required words, excluded words, and phrases. |
Term search | Performs an exact search for the search object. A match occurs only if the index contains the exact query string. |
TEXT_SEARCH search function
The TEXT_SEARCH function calculates the BM25 relevance score for a search source based on a search object.
Function syntax
TEXT_SEARCH (
<search_data> TEXT/VARCHAR/CHAR
,<search_expression> TEXT
[ ,<mode> TEXT DEFAULT 'match'
,<operator> TEXT DEFAULT 'OR'
,<tokenizer> TEXT DEFAULT ''
,<analyzer_params> TEXT DEFAULT ''
,<options> TEXT DEFAULT '']
)Parameters
Parameter | Required | Description |
search_data | Yes | The search source. The data type can be TEXT, VARCHAR, or CHAR. Only column input parameters are supported. A full-text index must be created on the column. Otherwise, an error is reported. |
search_expression | Yes | The search object. The data type can be TEXT, VARCHAR, or CHAR. Only constants are supported. |
mode | No | The search mode. The following modes are supported:
|
operator | No | The logical operator between keywords. This parameter takes effect only when mode is set to match. The following values are supported:
|
tokenizer, analyzer_params | No | The tokenizer and its configuration used for the search_expression. You generally do not need to configure these parameters.
|
options | No | Other parameters for full-text search. The input format is Currently, only the slop parameter is supported. It takes effect only when mode is set to phrase. slop can be 0 (default) or a positive integer. It defines the tolerable distance between words in a phrase. Note slop represents the maximum allowed interval (or transformation overhead) between the words that form a phrase. For tokenizers such as jieba, keyword, and icu, the unit of the interval is the number of characters, not the number of tokens. For tokenizers such as standard, simple, and whitespace, the unit of the interval is the number of tokens. |
Return value description
This function returns a non-negative FLOAT value, which represents the BM25 relevance score between the search source and the search object. The higher the relevance, the larger the score. The score is 0 when the texts are completely irrelevant.
Examples
Use the keyword match mode and change the operator to AND.
-- Specify the parameter name. SELECT TEXT_SEARCH (content, 'machine learning', operator => 'AND') FROM tbl; -- If you do not specify the parameter name, you must specify the parameters in order. SELECT TEXT_SEARCH (content, 'machine learning', 'match', 'AND') FROM tbl;Use the phrase search mode and set slop to 2.
SELECT TEXT_SEARCH (content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;Use the natural language search mode.
-- Define the token search logic using AND and OR operators. SELECT TEXT_SEARCH (content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl; -- Define the token search logic using + (required word) and - (excluded word). SELECT TEXT_SEARCH (content, '+learning -machine system', 'natural_language') FROM tbl;
TOKENIZE function
The TOKENIZE function outputs tokenization results based on the tokenizer configuration. You can use this function to debug the tokenization effect of a full-text inverted index.
Function syntax
TOKENIZE (
<search_data> TEXT
[ ,<tokenizer> TEXT DEFAULT ''
,<analyzer_params> TEXT DEFAULT '']
)Parameters
search_data: Required. The target text for tokenization. Constant input parameters are supported.
tokenizer, analyzer_params: Optional. The tokenizer and its configuration used for the search_data target text. By default, the jieba tokenizer is used.
Return value description
This function returns a collection of tokens from the target text. The type is a TEXT array.
Verify index usage
You can check the execution plan to verify that the SQL statement used a full-text inverted index. If Fulltext Filter appears in the plan, the index was used successfully. For more information about execution plans, see EXPLAIN and EXPLAIN ANALYZE.
Sample SQL:
EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, '长江') > 0;The following execution plan is returned. It contains the Fulltext Filter field, which indicates that the SQL statement successfully used the full-text inverted index.
QUERY PLAN
Gather (cost=0.00..1.00 rows=1 width=12)
-> Local Gather (cost=0.00..1.00 rows=1 width=12)
-> Index Scan using Clustering_index on wiki_articles (cost=0.00..1.00 rows=1 width=12)
Fulltext Filter: (text_search(content, search_expression => '长江'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0Usage examples
Prepare data
You can run the following SQL statements to create a test table and write data to it.
-- Create a table.
CREATE TABLE wiki_articles (id int, content text);
-- Create an index.
CREATE INDEX ft_idx_1 ON wiki_articles
USING FULLTEXT (content)
WITH (tokenizer = 'jieba');
-- Write data.
INSERT INTO wiki_articles VALUES
(1, 'The Yangtze River is the largest river in China and the third longest river in the world, with a total length of about 6,300 kilometers.'),
(2, 'Li was born in 1962 in Wendeng County, Shandong.'),
(3, 'He graduated from the department of physics at Shandong University.'),
(4, 'Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.'),
(5, 'Spring Festival usually falls between late January and mid-February in the Gregorian calendar. The main customs during the festival include pasting spring couplets, setting off firecrackers, having a New Year''s Eve dinner, and making New Year visits.'),
(6, 'In 2006, Spring Festival was approved by the State Council as one of the first national intangible cultural heritage items.'),
(7, 'Shandong has dozens of universities.'),
(8, 'ShanDa is a famous university of Shandong.');
-- Compaction
VACUUM wiki_articles;
-- Query table data.
SELECT * FROM wiki_articles limit 1;The following is a sample result:
id | content
---+---------------------------------------------------
1 | The Yangtze River is the largest river in China and the third longest river in the world, with a total length of about 6,300 kilometers.Different search examples
Keyword match.
-- (K1) Keyword match (the default operator is OR). Documents that contain 'shandong' or 'university' are matched. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0; -- The following result is returned: id | content ----+--------------------------------------------------------------------- 2 | Li was born in 1962 in Wendeng County, Shandong. 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong. -- (K2) Keyword match (the operator is AND). Documents must contain both 'shandong' and 'university' to be matched. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0; -- The following result is returned: id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong.Phrase search.
-- (P1) Phrase search (default slop = 0). A match occurs only if 'shandong' is immediately followed by 'university'. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. -- (P2) Phrase search with slop = 14. The distance between 'shandong' and 'university' cannot exceed 14 characters. This matches "Shandong has dozens of universities." SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. -- (P3) Phrase search supports phrases that are not in order, but the slop is calculated differently and must be larger than for an in-order phrase. -- For example, 'university of Shandong' can also match the following query, but it is not matched if slop is 22. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong. -- (P4) Punctuation is ignored (for the jieba tokenizer). -- This is true even if the text has a comma between '长河' and '全长', while the query string has a period. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '长河。全长', mode => 'phrase') > 0; -- Result id | content ----+----------------------------------------------------- 1 | The Yangtze River is the largest river in China and the third longest river in the world, with a total length of about 6,300 kilometers.Natural language query.
-- (N1) Natural language query: Without any symbols, this is equivalent to a keyword match by default. It is the same as (K1). SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', 'natural_language') > 0; id | content ----+--------------------------------------------------------------------- 7 | Shandong has dozens of universities. 2 | Li was born in 1962 in Wendeng County, Shandong. 3 | He graduated from the department of physics at Shandong University. 8 | ShanDa is a famous university of Shandong. -- (N2) Natural language query: Keyword match. A match occurs if the document contains ('shandong' AND 'university') OR 'culture'. The AND operator has higher precedence than OR. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '(shandong AND university) OR culture', 'natural_language') > 0; -- Equivalent to SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong AND university OR culture', 'natural_language') > 0; -- Equivalent to SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '(+shandong +university) culture', 'natural_language') > 0; -- Result id | content ----+--------------------------------------------------------------------- 8 | ShanDa is a famous university of Shandong. 7 | Shandong has dozens of universities. 3 | He graduated from the department of physics at Shandong University. 6 | In 2006, Spring Festival was approved by the State Council as one of the first national intangible cultural heritage items. -- (N3) Natural language query: Keyword match. Must contain 'shandong', must not contain 'university', and may contain 'culture'. -- In this query, the keyword 'culture' is not preceded by a + or - symbol. It does not affect which rows are matched, but it does affect the match score. Rows that contain 'culture' have a higher score. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '+shandong -university culture', 'natural_language') > 0; id | content ----+-------------------------------------------------- 2 | Li was born in 1962 in Wendeng County, Shandong. -- Must contain 'shandong', must not contain 'physics', and may contain 'famous'. The relevance score is higher if it contains 'famous'. -- Note: This query calculates scores on a single shard. The calculated BM25 scores may vary depending on the number of shards and file organization. SELECT id, content, TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score FROM wiki_articles WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0 ORDER BY score DESC; -- Result id | content | score ----+--------------------------------------------------+---------- 8 | ShanDa is a famous university of Shandong. | 2.92376 7 | Shandong has dozens of universities. | 0.863399 2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338 -- (N4) Natural language query: Phrase search. Equivalent to (P1). The phrase must be enclosed in double quotation marks (""). If the phrase contains a double quotation mark, escape it with a backslash (\). SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. -- (N5) Natural language query: Phrase search. Equivalent to (P2). Supports the ~ syntax to set slop. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '"shandong university"~23', 'natural_language') > 0; -- Result id | content ----+--------------------------------------------------------------------- 8 | ShanDa is a famous university of Shandong. 7 | Shandong has dozens of universities. 3 | He graduated from the department of physics at Shandong University. -- (N6) Natural language query: Match all documents. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0; -- Result id | content ----+---------------------------------------------------------------------------------------------- 1 | The Yangtze River is the largest river in China and the third longest river in the world, with a total length of about 6,300 kilometers. 2 | Li was born in 1962 in Wendeng County, Shandong. 3 | He graduated from the department of physics at Shandong University. 4 | Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China. 5 | Spring Festival usually falls between late January and mid-February in the Gregorian calendar. The main customs during the festival include pasting spring couplets, setting off firecrackers, having a New Year''s Eve dinner, and making New Year visits. 6 | In 2006, Spring Festival was approved by the State Council as one of the first national intangible cultural heritage items. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong.
Complex query examples
Query in conjunction with a primary key (PK).
-- Search for text that contains 'shandong' or 'university' and has an id of 3. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 and id = 3; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. -- Search for text that contains 'shandong' or 'university', or has an id less than 2. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 OR id < 2; -- Result id | content ----+--------------------------------------------------------------------- 2 | Li was born in 1962 in Wendeng County, Shandong. 8 | ShanDa is a famous university of Shandong. 1 | The Yangtze River is the largest river in China and the third longest river in the world, with a total length of about 6,300 kilometers. 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities.Find the scores and retrieve the top three.
SELECT id, content, TEXT_SEARCH(content, 'shandong university') AS score, TOKENIZE(content, 'jieba') FROM wiki_articles ORDER BY score DESC LIMIT 3; -- Result id | content | score | tokenize ----+---------------------------------------------------------------------+---------+-------------------------------------------------- 8 | ShanDa is a famous university of Shandong. | 2.74634 | {shanda,famous,univers,shandong} 7 | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers} 3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}Use the TEXT_SEARCH function in both the output and WHERE clauses.
SELECT id, content, TEXT_SEARCH(content, 'shandong university') AS score, TOKENIZE(content, 'jieba') FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 ORDER BY score DESC; -- Result id | content | score | tokenize ----+---------------------------------------------------------------------+---------+-------------------------------------------------- 7 | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers} 8 | ShanDa is a famous university of Shandong. | 2.74634 | {shanda,famous,univers,shandong} 3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers} 2 | Li was born in 1962 in Wendeng County, Shandong. | 1.09244 | {li,born,1962,wendeng,counti,shandong}Search for documents from the wiki source that are most relevant to 'shandong university'.
-- Source table for the JOIN. CREATE TABLE article_source (id int primary key, source text); INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'), (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper'), (9, 'http_log'), (10, 'http_log'), (11, 'http_log'); SELECT a.id, source, content, TEXT_SEARCH(content, 'shandong university') AS score, TOKENIZE(a.content, 'jieba') FROM wiki_articles a JOIN article_source b ON (a.id = b.id) WHERE TEXT_SEARCH(a.content, 'shandong university') > 0 AND b.source = 'wiki' ORDER BY score DESC; -- Result id | source | content | score | tokenize ----+--------+---------------------------------------------------------------------+---------+-------------------------------------------------- 7 | wiki | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers} 3 | wiki | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers} 2 | wiki | Li was born in 1962 in Wendeng County, Shandong. | 1.09244 | {li,born,1962,wendeng,counti,shandong}
Different tokenizer examples
Use the default jieba tokenizer. The default is search mode, which creates more token variations to improve search results.
SELECT TOKENIZE('他来到北京清华大学', 'jieba'); -- Result tokenize -------------------------------------- {他,来到,北京,清华,华大,大学,清华大学}Use a custom jieba tokenizer in exact mode, which does not create extra token variations.
SELECT TOKENIZE('他来到北京清华大学', 'jieba', '{"tokenizer": {"type": "jieba", "mode": "exact"}}'); -- Result tokenize ----------------------- {他,来到,北京,清华大学}Tokenizer comparison.
SELECT TOKENIZE('他来到北京清华大学', 'jieba') as jieba, TOKENIZE('他来到北京清华大学', 'keyword') as keyword, TOKENIZE('他来到北京清华大学', 'whitespace') as whitespace, TOKENIZE('他来到北京清华大学', 'simple') as simple, TOKENIZE('他来到北京清华大学', 'standard') as standard, TOKENIZE('他来到北京清华大学', 'icu') as icu; -- Result -[ RECORD 1 ]-------------------------------------- jieba | {他,来到,北京,清华,华大,大学,清华大学} keyword | {他来到北京清华大学} whitespace | {他来到北京清华大学} simple | {他来到北京清华大学} standard | {他,来,到,北,京,清,华,大,学} icu | {他,来到,北京,清华大学}Tokenization effect comparison for http_logs.
SELECT TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'jieba') as jieba, TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'keyword') as keyword, TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'whitespace') as whitespace, TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'simple') as simple, TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'standard') as standard, TOKENIZE('211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0', 'icu') as icu; -- Result -[ RECORD 1 ]----------------------------------------------------------------------------------------------- jieba | {211.11,9.0,1998-06,21t15,00,01-05,00,get,english,index,html,http,1.0,304,0} keyword | {"211.11.9.0 - - [1998-06-21T15:00:01-05:00] \\\"GET /english/index.html HTTP/1.0\\\" 304 0"} whitespace | {211.11.9.0,-,-,[1998-06-21T15:00:01-05:00],"\\\"GET",/english/index.html,"HTTP/1.0\\\"",304,0} simple | {211,11,9,0,1998,06,21t15,00,01,05,00,get,english,index,html,http,1,0,304,0} standard | {211.11.9.0,1998,06,21t15,00,01,05,00,get,english,index.html,http,1.0,304,0} icu | {211.11.9.0,1998,06,21t15,00,01,05,00,get,english,index.html,http,1.0,304,0}
Advanced operations: Customize tokenizer configurations
Hologres recommends using the default tokenizer configurations. However, in some real-world scenarios, the default configurations may not meet your business needs. You can customize the tokenizer configuration for more flexible tokenization.
analyzer_params configuration requirements
The configuration requirements for the analyzer_params parameter are as follows:
Only JSON strings are supported.
The top level of the JSON supports two keys, tokenizer and filter, which have the following values:
Parameter
Description
tokenizer
Required. The value is a JSON object used to configure tokenizer properties. The JSON object supports the following keys:
type: Required. The tokenizer name.
mode: Optional. Defines the tokenization mode. This is supported only by the jieba tokenizer. The values are:
search (default): Creates multiple token combinations, which allows for redundancy. For example, tokenizing "传统节日" (traditional festival) produces three tokens: "传统" (traditional), "节日" (festival), and "传统节日" (traditional festival).
exact: Prevents redundant splitting during tokenization. For example, tokenizing "传统节日" (traditional festival) produces only one token: "传统节日" (traditional festival).
hmm: Optional. Defines whether to use a Hidden Markov Model to identify words not in the dictionary to improve new word detection. This is supported only by the jieba tokenizer. The values are:
true (default): Use the model.
false: Do not use the model.
filter
Optional. The value is a JSON array used to configure token filter properties. If you configure multiple token filter properties, they are applied to each token in the specified order.
Default analyzer_params configurations
The default analyzer_params configurations for different tokenizers are as follows:
Tokenizer name | Default analyzer_params configuration | Tokenization example |
jieba (default tokenizer) | | |
whitespace | | |
keyword | | |
simple | | |
standard | | |
icu | | ["Chinese", "english", "Chinese.", "english.", "124", "124!=8", ".", ",", ",,", " ..."]["Chinese", "english", "Chinese.", "english.", "124", "124!=8"] |
Filter configurations in analyzer_params
Hologres supports the following filters (token filter properties) in analyzer_params.
If you configure multiple token filter properties, they are applied to each token in the specified order.
Property name | Property description | Parameter format | Usage example |
lowercase | Converts uppercase letters in a token to lowercase. | Simply declare lowercase. |
|
stop | Removes stop word tokens. |
|
|
stemmer | Converts a token to its corresponding root form (stem) based on the grammatical rules of the language. |
|
|
length | Removes tokens that exceed a specified length. |
|
|
removepunct | Removes tokens that consist only of punctuation characters. | Simply declare removepunct. |
|