Hologres V4.0 and later supports full-text inverted indexes. This feature is built on Tantivy, a high-performance full-text search engine. It provides high-performance retrieval and supports the BM25 similarity scoring algorithm for document sorting, keyword search, and phrase search.
How it works
When you write source text to Hologres, it builds a full-text inverted index file for each data file based on your index configuration. First, a tokenizer breaks the text into tokens. The index then records the mapping between each token and the source text. It also records information such as position and term frequency.
To search the text, the search query is first tokenized into a set of target tokens. The BM25 algorithm then calculates a relevance score for each source text against the target tokens. This process enables fast and accurate full-text search.
Notes
Full-text inverted indexes are supported only for column-oriented tables and row-column hybrid tables in Hologres V4.0 and later. Row-oriented tables are not supported.
You can create full-text inverted indexes only on columns of the TEXT, CHAR, or VARCHAR data type.
You can build a full-text inverted index on only one column at a time. Each column supports only one full-text inverted index. To index multiple columns, you must create a separate index for each one.
After you create a full-text inverted index, the index files for existing data and newly imported data are built asynchronously during the data compaction process. Until the index file is built, the BM25 relevance score for the data is 0.
For real-time data writes after you create a full-text inverted index: In versions earlier than Hologres V4.0.8, the index was built synchronously with real-time data writes. In V4.0.8 and later, the system asynchronously refreshes the in-memory real-time index every second. This ensures efficient data writes and index building. Data can be queried using the index only after the index is refreshed.
You can run full-text searches only on columns that have a full-text index. Brute-force searches on unindexed columns are not supported.
You can use Serverless Computing resources for batch data imports. Serverless resources complete compaction and full-text index building synchronously during the data import. For more information, see Use Serverless Computing for read and write tasks and Use Serverless Computing for compaction tasks. If you do not use Serverless resources, you must manually run the following command to trigger Compaction after you import data in batches or modify an index.
VACUUM <schema_name>.<table_name>;The BM25 search algorithm calculates relevance scores at the file level. If you import a small amount of data, you can manually trigger Compaction to merge files and improve search accuracy.
You can use Serverless Computing resources to run full-text search queries.
You can choose a tokenizer based on the scenarios in the following table:
Scenario
Tokenizer
Notes
Keyword extraction from long articles
Jieba
Supports new word discovery and complex pattern switching.
Chinese descriptive text search
IK
Accurately identifies Chinese terms.
English title text search
Simple, Whitespace, Standard
Simple and efficient. Use as needed based on the target English text.
Fuzzy search for log text
Ngram
No dictionary required. Meets the needs of fuzzy text queries.
Pinyin search for Chinese product or person names
Pinyin
Supports various Chinese Pinyin scenarios such as full Pinyin, first letter, and polyphone derivation.
Manage indexes
Create an index
Syntax
CREATE INDEX [ IF NOT EXISTS ] idx_name ON table_name
USING FULLTEXT (column_name [ , ... ])
[ WITH ( storage_parameter [ = value ] [ , ... ] ) ];Parameters
Parameter | Description |
idx_name | The index name. |
table_name | The target table name. |
column_name | The name of the target column for building the full-text inverted index. |
storage_parameter | The parameters for the full-text inverted index. There are two types of parameters:
Note Only one type of `tokenizer` and `analyzer_params` can be set within the same index. |
Examples
Create a full-text inverted index that uses the default tokenizer and configuration. The default tokenizer is the Jieba tokenizer.
CREATE INDEX idx1 ON tbl USING FULLTEXT (col1);Explicitly specify the IK tokenizer and use its default configuration.
CREATE INDEX idx1 ON tbl USING FULLTEXT (col1) WITH (tokenizer = 'ik');Explicitly specify a custom tokenizer configuration that uses the Jieba tokenizer in `exact` mode and only the `lowercase` filter to convert text to lowercase.
CREATE INDEX idx1 ON tbl USING FULLTEXT (col1) WITH (tokenizer = 'jieba', analyzer_params = '{"tokenizer":{"type":"jieba","mode":"exact"}, "filter":["lowercase"]}');
After you create a full-text inverted index, the index files are built during the Compaction process after data is imported.
You can use Serverless Computing resources for batch data imports. Serverless resources complete compaction and full-text index building synchronously during the data import. For more information, see Use Serverless Computing for read and write tasks and Use Serverless Computing for compaction tasks.
If you do not use Serverless resources, you must manually run the following command to trigger Compaction after you import data in batches or modify an index. For more information, see Compaction (Beta).
VACUUM <schema_name>.<table_name>;
Modify an index
Syntax
-- Modify the index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );
-- Reset to the default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );Parameters
For more information about the parameters, see Parameters.
Examples
After you modify a full-text inverted index, the index files are rebuilt asynchronously during the data compaction process. After you modify the index, you must manually run the VACUUM <schema_name>.<table_name>; command to synchronously trigger Compaction. For more information, see Compaction.
Change the index tokenizer to `standard`.
ALTER INDEX idx1 SET (tokenizer = 'standard');Change the index tokenizer to `ik` with the `ik_max_word` mode, without converting text to lowercase or adding filters.
ALTER INDEX idx1 SET ( tokenizer = 'ik', analyzer_params = '{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase": false}}' );Reset the index to use the default Jieba tokenizer and its default `analyzer_params` configuration.
ALTER INDEX idx1 RESET (tokenizer); ALTER INDEX idx1 RESET (tokenizer, analyzer_params);Reset the `analyzer_params` configuration to the default for the current tokenizer.
ALTER INDEX idx1 RESET (analyzer_params);
Delete an index
Syntax
DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];Parameters
For more information about the parameters, see Parameters.
View indexes
Hologres provides the hologres.hg_index_properties system table to view created full-text inverted indexes and their locations.
SELECT * FROM hologres.hg_index_properties;You can run the following SQL statement to view the table and column that correspond to an index.
SELECT
t.relname AS table_name,
a.attname AS column_name
FROM pg_class t
JOIN pg_index i ON t.oid = i.indrelid
JOIN pg_class idx ON i.indexrelid = idx.oid
JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
AND idx.relname = '<indexname>'
LIMIT 1;Parameters:
namespace: The value of the
table_namespacefield in the result returned by theSELECT * FROM hologres.hg_index_properties;command.indexname: The name of the index.
Use indexes for full-text search
Hologres supports various search modes that allow you to perform full-text searches flexibly as needed.
Search mode | Description |
Keyword match | Searches by the keywords from the tokenized search query. Supports defining AND/OR relationships between keywords. |
Phrase search | Searches by phrases from the search query. A match requires that the distance between multiple words meets the specified requirement. |
Natural language search | Lets you define complex query conditions to flexibly achieve search goals, such as defining AND/OR relationships, required words, excluded words, and phrases. |
Term search | Performs an exact search for the search query. A match requires that the index contains the exact query string. |
TEXT_SEARCH search function
The `TEXT_SEARCH` function calculates the BM25 relevance score for a source text based on a search query.
Function syntax
TEXT_SEARCH (
<search_data> TEXT/VARCHAR/CHAR
,<search_expression> TEXT
[ ,<mode> TEXT DEFAULT 'match'
,<operator> TEXT DEFAULT 'OR'
,<tokenizer> TEXT DEFAULT ''
,<analyzer_params> TEXT DEFAULT ''
,<options> TEXT DEFAULT '']
)Parameters
Parameter | Required | Description |
search_data | Yes | The search source. This parameter supports the TEXT, VARCHAR, and CHAR data types. Only column input is supported, and the column must have a full-text index. Otherwise, an error is reported. |
search_expression | Yes | The search object. This parameter supports the TEXT, VARCHAR, and CHAR data types. Only constants are supported. |
mode | No | The search mode. The following modes are supported:
|
operator | No | The logical operator between keywords. This parameter takes effect only when mode is set to `match`. The following values are supported:
|
tokenizer, analyzer_params | No | The tokenizer and configuration used for the search query search_expression. You do not usually need to configure these parameters.
|
options | No | Other parameters for full-text index. Specify these parameters in the format Currently, only the slop parameter is supported. It takes effect only when mode is set to phrase. You can set slop to 0 (default) or a positive integer. This value defines the maximum distance allowed between terms in a phrase. Note The slop value specifies the maximum allowed gap—or transformation overhead—between terms in a phrase. For tokenizers such as jieba, keyword, and icu, the gap is measured in characters, not tokens. For tokenizers such as standard, simple, and whitespace, the gap is measured in tokens. |
Return value
This function returns a non-negative FLOAT value that represents the BM25 relevance score between the source text and the search query. A higher score indicates a higher relevance. A score of 0 indicates that the texts are completely irrelevant.
Examples
Use the keyword match mode and change the operator to AND.
-- Specify the parameter name. SELECT TEXT_SEARCH (content, 'machine learning', operator => 'AND') FROM tbl; -- Do not specify the parameter name. You must specify the parameters in order. SELECT TEXT_SEARCH (content, 'machine learning', 'match', 'AND') FROM tbl;Use the phrase search mode and set slop to 2.
SELECT TEXT_SEARCH (content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;Use the natural language search mode.
-- Define the token search logic using the AND and OR operators. SELECT TEXT_SEARCH (content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl; -- Define the token search logic using + (must include) and - (must exclude). SELECT TEXT_SEARCH (content, '+learning -machine system', 'natural_language') FROM tbl;Use the term search mode.
SELECT TEXT_SEARCH (content, 'machine learning', 'term') FROM tbl;
TOKENIZE function
The `TOKENIZE` function outputs tokenization results based on the tokenizer configuration. You can use it to debug the tokenization effect of a full-text inverted index.
Function syntax
TOKENIZE (
<search_data> TEXT
[ ,<tokenizer> TEXT DEFAULT ''
,<analyzer_params> TEXT DEFAULT '']
)Parameters
search_data: Required. The target text for tokenization. Only constant input is supported.
tokenizer, analyzer_params: Optional. The tokenizer and configuration used for the target text search_data. The default is the Jieba tokenizer.
Return value
This function returns a TEXT array that contains the collection of tokens from the target text.
Verify index usage
You can use an execution plan to determine whether an SQL statement uses a full-text inverted index. If the execution plan includes Fulltext Filter, the full-text inverted index is used. For more information about execution plans, see EXPLAIN and EXPLAIN ANALYZE.
Example SQL statement:
EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, 'Yangtze River') > 0;The following execution plan contains the Fulltext Filter field. This indicates that the SQL statement uses a full-text inverted index.
QUERY PLAN
Gather (cost=0.00..1.00 rows=1 width=12)
-> Local Gather (cost=0.00..1.00 rows=1 width=12)
-> Index Scan using Clustering_index on wiki_articles (cost=0.00..1.00 rows=1 width=12)
Fulltext Filter: (text_search(content, search_expression => 'Yangtze River'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0Usage examples
Data preparation
Run the following SQL statements to create a test table and insert data into it.
-- Create a table.
CREATE TABLE wiki_articles (id int, content text);
-- Create an index.
CREATE INDEX ft_idx_1 ON wiki_articles
USING FULLTEXT (content)
WITH (tokenizer = 'jieba');
-- Write data.
INSERT INTO wiki_articles VALUES
(1, 'The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.'),
(2, 'Li was born in 1962 in Wendeng County, Shandong.'),
(3, 'He graduated from the department of physics at Shandong University.'),
(4, 'The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.'),
(5, 'The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year''s Eve dinner, and making New Year visits.'),
(6, 'In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.'),
(7, 'Shandong has dozens of universities.'),
(8, 'ShanDa is a famous university of Shandong.');
-- Compaction
VACUUM wiki_articles;
-- Query table data.
SELECT * FROM wiki_articles limit 1;The following result is returned:
id | content
---+---------------------------------------------------
1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.Different search examples
Keyword match.
-- (K1) Keyword match (default operator=OR). Documents containing 'shandong' or 'university' are matched. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0; -- Result id | content ----+--------------------------------------------------------------------- 2 | Li was born in 1962 in Wendeng County, Shandong. 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong. -- (K2) Keyword match (operator=AND). Documents must contain both 'shandong' and 'university' to be matched. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong.Phrase search.
-- (P1) Phrase search (default slop = 0). A match occurs only if 'shandong' is immediately followed by 'university'. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. (1 row) -- (P2) Phrase search with slop = 14. The distance between 'shandong' and 'university' cannot exceed 14 characters. This matches "Shandong has dozens of universities." SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. (2 rows) -- (P3) Phrase search supports unordered phrases, but the slop calculation is different and requires a larger value than for ordered phrases. -- Therefore, 'university of Shandong' can also match the following query, but it will not be matched if slop=22. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong. (3 rows) -- (P4) Behavior with a standard tokenizer index. (ALTER INDEX ft_idx_1 SET (tokenizer = 'standard');) -- For the standard tokenizer, slop is calculated in tokens. -- As long as there are 0 tokens in between, regardless of the number of spaces, it is considered a phrase match. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0; -- Result id | content ----+------------------------ 1 | shandong university 2 | shandong university 3 | shandong university 4 | shandong university (4 rows) -- (P5) Punctuation is ignored. (Example with Jieba tokenizer) -- Even though there is a comma between 'river' and 'with' in the text, and the query string has a period. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'river. with', mode => 'phrase') > 0; -- Result id | content ----+----------------------------------------------------- 1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers. (1 row)Natural language search.
-- (N1) Natural language query: Without any symbols, this is equivalent to a keyword match. Same as (K1). SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', 'natural_language') > 0; id | content ----+--------------------------------------------------------------------- 7 | Shandong has dozens of universities. 2 | Li was born in 1962 in Wendeng County, Shandong. 3 | He graduated from the department of physics at Shandong University. 8 | ShanDa is a famous university of Shandong. -- (N2) Natural language query: Keyword match. Must contain ('shandong' AND 'university') OR 'heritage'. The AND operator has higher precedence than OR. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '(shandong AND university) OR heritage', 'natural_language') > 0; -- Equivalent to SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong AND university OR heritage', 'natural_language') > 0; -- Equivalent to SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '(+shandong +university) heritage', 'natural_language') > 0; -- Result id | content ----+--------------------------------------------------------------------- 8 | ShanDa is a famous university of Shandong. 7 | Shandong has dozens of universities. 3 | He graduated from the department of physics at Shandong University. 6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages. -- (N3) Natural language query: Keyword match. Must contain 'shandong', must not contain 'university', and may contain 'heritage'. -- In this query, the 'heritage' keyword does not have a + or - sign, so it does not affect which rows are matched. However, it does affect the match score. Rows with 'heritage' will have a higher score. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '+shandong -university heritage', 'natural_language') > 0; id | content ----+-------------------------------------------------- 2 | Li was born in 1962 in Wendeng County, Shandong. -- Must contain 'shandong', must not contain 'physics', and may contain 'famous'. The relevance score is higher if it contains 'famous'. -- Note: This query shows the score calculation result on a single shard. The calculated BM25 score may vary with different numbers of shards and different file organizations. SELECT id, content, TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score FROM wiki_articles WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0 ORDER BY score DESC; -- Result id | content | score ----+--------------------------------------------------+---------- 8 | ShanDa is a famous university of Shandong. | 2.92376 7 | Shandong has dozens of universities. | 0.863399 2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338 -- (N4) Natural language query: Phrase search. Equivalent to (P1). The phrase must be enclosed in double quotation marks (""). If the phrase contains a double quotation mark, it must be escaped with a backslash (\). SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. -- (N5) Natural language query: Phrase search. Equivalent to (P2). Supports setting slop with the ~ syntax. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '"shandong university"~23', 'natural_language') > 0; -- Result id | content ----+--------------------------------------------------------------------- 8 | ShanDa is a famous university of Shandong. 7 | Shandong has dozens of universities. 3 | He graduated from the department of physics at Shandong University. -- (N6) Natural language query: Match all documents. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0; -- Result id | content ----+---------------------------------------------------------------------------------------------- 1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers. 2 | Li was born in 1962 in Wendeng County, Shandong. 3 | He graduated from the department of physics at Shandong University. 4 | The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China. 5 | The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year's Eve dinner, and making New Year visits. 6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages. 7 | Shandong has dozens of universities. 8 | ShanDa is a famous university of Shandong.Term search.
-- (T1) Term query: The tokenization result explicitly contains "Spring Festival", so it can be retrieved. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'Spring Festival', 'term') > 0; -- Result id | content ----+---------------------------------------------------------------------------------------------- 4 | The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China. 5 | The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year's Eve dinner, and making New Year visits. 6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages. -- (T2) Term query: Example that returns no results. -- Because the content column of the wiki_articles source table uses the Jieba tokenizer by default, which splits "shandong university". -- We recommend using term queries with the keyword tokenizer. In that case, the target data can be retrieved. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', 'term') > 0; -- Result id | content ----+---------
Complex query examples
Query with a primary key.
-- Retrieve text that contains 'shandong' or 'university' and has an id of 3. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 and id = 3; -- Result id | content ----+--------------------------------------------------------------------- 3 | He graduated from the department of physics at Shandong University. -- Retrieve text that contains 'shandong' or 'university', or has an id less than 2. SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 OR id < 2; -- Result id | content ----+--------------------------------------------------------------------- 2 | Li was born in 1962 in Wendeng County, Shandong. 8 | ShanDa is a famous university of Shandong. 1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers. 3 | He graduated from the department of physics at Shandong University. 7 | Shandong has dozens of universities.Obtain the score and retrieve the top three results.
SELECT id, content, TEXT_SEARCH(content, 'shandong university') AS score, TOKENIZE(content, 'jieba') FROM wiki_articles ORDER BY score DESC LIMIT 3; -- Result id | content | score | tokenize ----+---------------------------------------------------------------------+---------+-------------------------------------------------- 8 | ShanDa is a famous university of Shandong. | 2.74634 | {shanda,famous,univers,shandong} 7 | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers} 3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}Use the TEXT_SEARCH function in both the SELECT list and the WHERE clause.
SELECT id, content, TEXT_SEARCH(content, 'shandong university') AS score, TOKENIZE(content, 'jieba') FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 ORDER BY score DESC; -- Result id | content | score | tokenize ----+---------------------------------------------------------------------+---------+-------------------------------------------------- 7 | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers} 8 | ShanDa is a famous university of Shandong. | 2.74634 | {shanda,famous,univers,shandong} 3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers} 2 | Li was born in 1962 in Wendeng County, Shandong. | 1.09244 | {li,born,1962,wendeng,counti,shandong}Retrieve the documents from the 'wiki' source that are most relevant to 'shandong university'.
-- Source table for JOIN. CREATE TABLE article_source (id int primary key, source text); INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'), (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper'), (9, 'http_log'), (10, 'http_log'), (11, 'http_log'); SELECT a.id, source, content, TEXT_SEARCH(content, 'shandong university') AS score, TOKENIZE(a.content, 'jieba') FROM wiki_articles a JOIN article_source b ON (a.id = b.id) WHERE TEXT_SEARCH(a.content, 'shandong university') > 0 AND b.source = 'wiki' ORDER BY score DESC; -- Result id | source | content | score | tokenize ----+--------+---------------------------------------------------------------------+---------+-------------------------------------------------- 7 | wiki | Shandong has dozens of universities. | 2.74634 | {shandong,has,dozen,univers} 3 | wiki | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers} 2 | wiki | Li was born in 1962 in Wendeng County, Shandong. | 1.09244 | {li,born,1962,wendeng,counti,shandong}
Usage recommendations
Reindex using Serverless resources
If you modify table properties, compaction may be triggered and the index may be rebuilt, which consumes a large amount of CPU resources. To modify the following table properties, perform the steps below:
-
Modifying `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes triggers compaction and reindexing. Therefore, do not use the `ALTER TABLE xxx SET` syntax. Instead, run the following command to use the `REBUILD` syntax with Serverless Computing resources. For more information, see REBUILD.
ASYNC REBUILD TABLE <table_name>
WITH (
rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
bitmap_columns = '<col1>,<col2>',
dictionary_encoding_columns = '<col1>:on,<col2>:off',
vectors = '{
"<col_vector>": {
"algorithm": "HGraph",
"distance_method": "Cosine",
"builder_params": {
"base_quantization_type": "rabitq",
"graph_storage_type": "compressed",
"max_degree": 64,
"ef_construction": 400,
"precise_quantization_type": "fp32",
"use_reorder": true,
"max_total_size_to_merge_mb" : 4096
}
}
}'
);
-
Modifying column-oriented JSONB columns or full-text index columns also triggers compaction and reindexing. The `REBUILD` syntax is not supported for these modifications. Instead, create a temporary table by following these steps:
BEGIN ;
-- Clean up potential temporary tables.
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table.
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT ;
-- Enable column-oriented storage for JSON-formatted data for the corresponding column.
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create a full-text index on the corresponding column.
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);
-- Insert data into the temporary table, use Serverless resources to run the job, and synchronously complete index building.
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;
BEGIN ;
-- Delete the old table.
DROP TABLE IF EXISTS <table>;
-- Rename the temporary table.
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT ;
-
To modify other properties, such as `distribution_key`, `clustering_key`, `segment_key`, and storage format, use the `REBUILD` syntax with Serverless Computing resources.
Advanced operations: Customize tokenizer configurations
Hologres recommends using the default tokenizer configurations. However, in some cases, the default configurations may not meet your business needs. You can customize the tokenizer configuration for more flexible tokenization.
analyzer_params configuration requirements
The `analyzer_params` parameter has the following configuration requirements:
Only JSON-formatted strings are supported.
The top level of the JSON supports two keys: `tokenizer` and `filter`.
filter: Optional. The value is a JSON array used to configure token filter properties. If you configure multiple token filter properties, they are applied to each token in the specified order.
tokenizer: Required. The value is a JSON object used to configure tokenizer properties. The JSON object supports the following keys:
type: Required. The tokenizer name.
Other parameters for the tokenizer object. Different tokenizers support different parameters, as shown in the following table:
Tokenizer
Other parameters for the
tokenizer objectParameter description
Parameter values
jieba
mode
Defines the tokenization mode.
search (default): Lists multiple possible combinations during tokenization, which allows for redundancy. For example, the tokenization result for "traditional festival" is "traditional", "festival", and "traditional festival".
exact: Prevents redundant splitting during tokenization. For example, the tokenization result for "traditional festival" is only "traditional festival".
hmm
Specifies whether to use a Hidden Markov Model (HMM) to identify words that are not in the dictionary. This improves new word recognition.
true (default): Enabled
false: Disabled.
standard
max_token_length
The maximum token length.
A positive integer. The default value is 255. If a token exceeds this length, it is split at intervals of `max_token_length`.
ik
mode
Defines the tokenization mode.
ik_max_word (default): Performs fine-grained tokenization that outputs all possible short words. For example, the tokenization result for "Nanjing City" is "Nanjing" and "City".
ik_smart: Performs coarse-grained tokenization that prioritizes long words and reduces the number of splits. The output tokens do not overlap. This mode attempts to combine numbers and quantifiers into a single token. For example, "Nanjing City" and "1 kilometer" are not further tokenized.
enable_lowercase
Specifies whether to convert tokens to lowercase.
true (default)
false
ngram
min_ngram
The minimum character length of a token.
A positive number. The default value is 1. The maximum difference between `min_ngram` and `max_ngram` is 3.
Note: You can adjust the maximum difference by setting the following GUC. For example,
SET hg_fulltext_index_max_ngram_diff = 5.max_ngram
The maximum character length of a token.
The default value is 2. The value must be in the range of [1, 255]. The maximum difference between `min_ngram` and `max_ngram` is 3.
NoteIf the difference between `max_ngram` and `min_ngram` is too large, the ngram tokenizer generates many tokens. This can increase resource consumption, require more index storage, and lengthen index building times.
prefix_only
Specifies whether to consider only prefixes.
true
false (default)
pinyin
keep_first_letter
Specifies whether to keep the first letter of each Chinese character as a token.
true (default): Keeps the first letter. For example, the tokenization result for "Li Ming" includes "lm".
false: Do not retain.
keep_separate_first_letter
Specifies whether to keep the first letter of each Chinese character as a separate token.
true: Keeps the first letters separately. For example, the tokenization result for "Li Ming" includes "l,m".
false (default): The setting does not retain the value.
limit_first_letter_length
The maximum length of the first-letter token.
An integer that specifies the maximum length of the first-letter string. The default value is 16.
keep_full_pinyin
Specifies whether to keep the full Pinyin of each Chinese character as separate tokens.
true (default): Keeps the full Pinyin. For example, the tokenization result for "Li Ming" includes "li,ming".
false: The item is not retained.
keep_joined_full_pinyin
Specifies whether to join the full Pinyin of each Chinese character into a single token.
true: Joins the full Pinyin. For example, the tokenization result for "Li Ming" includes "liming".
false (default): The system does not establish a connection.
keep_none_chinese
Specifies whether to keep non-Chinese letters or numbers in the tokenization result.
true (default): Retain.
false: Not retained.
keep_none_chinese_together
Specifies whether to keep non-Chinese letters or numbers together.
true (default): Keeps non-Chinese characters together. For example, the tokenization result for "DJ Li Ming" is "DJ,li,ming", where "DJ" stays together.
false: Does not keep non-Chinese characters together. For example, the tokenization result for "DJ Li Ming" is "D,J,li,ming".
NoteThis parameter takes effect only when `keep_none_chinese` is set to `true`.
keep_none_chinese_in_first_letter
Specifies whether to keep non-Chinese letters or numbers in the first-letter token.
true (default): Retain. For example, the tokenization result of 李明AT2025 includes lmat2025.
false: The item is not retained.
keep_none_chinese_in_joined_full_pinyin
Specifies whether to keep non-Chinese letters or numbers in the joined full Pinyin token.
true: Keeps non-Chinese characters. For example, the tokenization result for "Li Ming AT2025" includes "limingat2025".
false (default): The resource is not retained.
none_chinese_pinyin_tokenize
If non-Chinese letters are valid Pinyin, specifies whether to split them into separate Pinyin terms.
true (default): Splits the terms. For example, the tokenization result for "limingalibaba2025" includes "li,ming,a,li,ba,ba,2025".
false: Does not split the terms.
NoteThis parameter takes effect only when `keep_none_chinese` and `keep_none_chinese_together` are set to `true`.
keep_original
Specifies whether to keep the original input.
true: Keeps the original input.
false (default): Does not keep the original input.
lowercase
Specifies whether to convert non-Chinese letters to lowercase.
true (default)
false
trim_whitespace
Specifies whether to trim whitespace characters.
true (default)
false
remove_duplicated_term
Specifies whether to remove duplicate terms.
If this parameter is set to true, certain tokens are removed. For example, "de的" is tokenized as "de", which affects the results of phrase queries for "de的".
false (default): Does not remove duplicate terms.
keep_separate_chinese
Specifies whether to keep individual Chinese characters as separate tokens.
true: Keeps individual characters. For example, the tokenization result for "Li Ming" includes the two tokens "Li" and "Ming".
false (default): The item is not retained.
Default analyzer_params configurations
The default `analyzer_params` configurations for different tokenizers are as follows:
Analyzer name | Default analyzer_params configuration | Tokenization example |
jieba (default analyzer) | | |
whitespace | | |
keyword | | |
simple | | |
standard | | |
icu | | |
ik | | |
ngram | | |
pinyin | | |
Filter configuration in analyzer_params
Hologres supports the following filters, also known as token filter properties, in `analyzer_params`.
If you configure multiple token filter properties, they are applied to each token in the specified order.
Property name | Property description | Parameter format | Example |
lowercase | Converts uppercase letters in a token to lowercase. | Declare `lowercase`. |
|
stop | Removes stop-word tokens. |
|
|
stemmer | Converts a token to its corresponding root form (stem) based on the grammatical rules of the language. |
|
|
length | Removes tokens that exceed a specified length. |
|
|
removepunct | Removes tokens that consist only of punctuation characters. | Declare `removepunct`. Note In V4.0.8 and later, `removepunct` has a new `mode` parameter that defines the removal mode. The values are as follows:
|
|
pinyin | Pinyin token filter. | | The filter configuration properties are the same as for the Pinyin tokenizer. |