Full-text inverted index (Beta) - Hologres - Alibaba Cloud Documentation Center

Hologres V4.0 and later supports full-text inverted indexes. This feature is built on Tantivy, a high-performance full-text search engine. It provides high-performance retrieval and supports the BM25 similarity scoring algorithm for document sorting, keyword search, and phrase search.

How it works

When you write source text to Hologres, it builds a full-text inverted index file for each data file based on your index configuration. First, a tokenizer breaks the text into tokens. The index then records the mapping between each token and the source text. It also records information such as position and term frequency.

To search the text, the search query is first tokenized into a set of target tokens. The BM25 algorithm then calculates a relevance score for each source text against the target tokens. This process enables fast and accurate full-text search.

Notes

Full-text inverted indexes are supported only for column-oriented tables and row-column hybrid tables in Hologres V4.0 and later. Row-oriented tables are not supported.
You can create full-text inverted indexes only on columns of the TEXT, CHAR, or VARCHAR data type.
You can build a full-text inverted index on only one column at a time. Each column supports only one full-text inverted index. To index multiple columns, you must create a separate index for each one.
After you create a full-text inverted index, the index files for existing data and newly imported data are built asynchronously during the data compaction process. Until the index file is built, the BM25 relevance score for the data is 0.
For real-time data writes after you create a full-text inverted index: In versions earlier than Hologres V4.0.8, the index was built synchronously with real-time data writes. In V4.0.8 and later, the system asynchronously refreshes the in-memory real-time index every second. This ensures efficient data writes and index building. Data can be queried using the index only after the index is refreshed.
You can run full-text searches only on columns that have a full-text index. Brute-force searches on unindexed columns are not supported.
You can use Serverless Computing resources for batch data imports. Serverless resources complete compaction and full-text index building synchronously during the data import. For more information, see Use Serverless Computing for read and write tasks and Use Serverless Computing for compaction tasks. If you do not use Serverless resources, you must manually run the following command to trigger Compaction after you import data in batches or modify an index.
```
VACUUM <schema_name>.<table_name>;
```
The BM25 search algorithm calculates relevance scores at the file level. If you import a small amount of data, you can manually trigger Compaction to merge files and improve search accuracy.
You can use Serverless Computing resources to run full-text search queries.

You can choose a tokenizer based on the scenarios in the following table:

Scenario	Tokenizer	Notes
Keyword extraction from long articles	Jieba	Supports new word discovery and complex pattern switching.
Chinese descriptive text search	IK	Accurately identifies Chinese terms.
English title text search	Simple, Whitespace, Standard	Simple and efficient. Use as needed based on the target English text.
Fuzzy search for log text	Ngram	No dictionary required. Meets the needs of fuzzy text queries.
Pinyin search for Chinese product or person names	Pinyin	Supports various Chinese Pinyin scenarios such as full Pinyin, first letter, and polyphone derivation.

Manage indexes

Create an index

Syntax

CREATE INDEX [ IF NOT EXISTS ] idx_name ON table_name
       USING FULLTEXT (column_name [ , ... ])
       [ WITH ( storage_parameter [ = value ] [ , ... ] ) ];

Parameters

Parameter	Description
idx_name	The index name.
table_name	The target table name.
column_name	The name of the target column for building the full-text inverted index.
storage_parameter	The parameters for the full-text inverted index. There are two types of parameters: tokenizer: The tokenizer name. The following tokenizers are supported: jieba (default): A Chinese tokenizer that combines rule-based matching and statistical models. whitespace: A tokenizer that splits text by spaces. standard: A standard tokenizer based on Unicode Standard Annex #29. simple: A simple tokenizer that splits text by spaces and punctuation. keyword: A keyword tokenizer that performs no operations and outputs the original text. icu: A tokenizer for multilingual text processing. ik: A Chinese tokenizer based on the IK Analyzer. It can automatically identify special formats such as English words, email addresses, URLs without `://`, and IP addresses. This tokenizer is supported in Hologres V4.0.9 and later. ngram: A character-based sliding window tokenizer. It splits text into consecutive n characters or words, called n-grams, to improve the recall rate and fuzzy matching capabilities of the search system. This tokenizer is suitable for accelerating `like` and `ilike` fuzzy matching scenarios. It is supported in Hologres V4.0.9 and later. pinyin: A Pinyin tokenizer. It generates Pinyin for Chinese characters and words in the text and supports Pinyin splitting and derivation for non-Chinese strings. This helps you achieve more accurate Pinyin generation and a higher search recall rate. This tokenizer is supported in Hologres V4.0.9 and later. analyzer_params: The tokenizer configuration. Only JSON-formatted strings are supported. Each tokenizer has a default `analyzer_params` configuration. In most cases, use the default configuration. This means you only need to specify the `tokenizer` parameter and do not need to explicitly configure the `analyzer_params` parameter. You can customize some of the configurations. For more information, see Advanced operations: Customize tokenizer configurations. Note Only one type of `tokenizer` and `analyzer_params` can be set within the same index.

Examples

Create a full-text inverted index that uses the default tokenizer and configuration. The default tokenizer is the Jieba tokenizer.
```
CREATE INDEX idx1 ON tbl 
       USING FULLTEXT (col1);
```

Explicitly specify the IK tokenizer and use its default configuration.

CREATE INDEX idx1 ON tbl 
       USING FULLTEXT (col1)
       WITH (tokenizer = 'ik');

Explicitly specify a custom tokenizer configuration that uses the Jieba tokenizer in `exact` mode and only the `lowercase` filter to convert text to lowercase.

CREATE INDEX idx1 ON tbl 
       USING FULLTEXT (col1)
       WITH (tokenizer = 'jieba',
             analyzer_params = '{"tokenizer":{"type":"jieba","mode":"exact"}, "filter":["lowercase"]}');

Note

After you create a full-text inverted index, the index files are built during the Compaction process after data is imported.
You can use Serverless Computing resources for batch data imports. Serverless resources complete compaction and full-text index building synchronously during the data import. For more information, see Use Serverless Computing for read and write tasks and Use Serverless Computing for compaction tasks.
If you do not use Serverless resources, you must manually run the following command to trigger Compaction after you import data in batches or modify an index. For more information, see Compaction (Beta).
```
VACUUM <schema_name>.<table_name>;
```

Modify an index

Syntax

-- Modify the index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );

-- Reset to the default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );

Parameters

For more information about the parameters, see Parameters.

Examples

Note

After you modify a full-text inverted index, the index files are rebuilt asynchronously during the data compaction process. After you modify the index, you must manually run the VACUUM <schema_name>.<table_name>; command to synchronously trigger Compaction. For more information, see Compaction.

Change the index tokenizer to `standard`.

ALTER INDEX idx1 SET (tokenizer = 'standard');

Change the index tokenizer to `ik` with the `ik_max_word` mode, without converting text to lowercase or adding filters.

ALTER INDEX idx1 SET (
  tokenizer = 'ik',
  analyzer_params = '{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase": false}}'
);

Reset the index to use the default Jieba tokenizer and its default `analyzer_params` configuration.
```
ALTER INDEX idx1 RESET (tokenizer);
ALTER INDEX idx1 RESET (tokenizer, analyzer_params);
```
Reset the `analyzer_params` configuration to the default for the current tokenizer.
```
ALTER INDEX idx1 RESET (analyzer_params);
```

Delete an index

Syntax

DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];

Parameters

For more information about the parameters, see Parameters.

View indexes

Hologres provides the hologres.hg_index_properties system table to view created full-text inverted indexes and their locations.

SELECT * FROM hologres.hg_index_properties;

You can run the following SQL statement to view the table and column that correspond to an index.

SELECT 
    t.relname AS table_name, 
    a.attname AS column_name
FROM pg_class t
    JOIN pg_index i ON t.oid = i.indrelid
    JOIN pg_class idx ON i.indexrelid = idx.oid
    JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
    AND idx.relname = '<indexname>'
LIMIT 1;

Parameters:

namespace: The value of the table_namespace field in the result returned by the SELECT * FROM hologres.hg_index_properties; command.
indexname: The name of the index.

Use indexes for full-text search

Hologres supports various search modes that allow you to perform full-text searches flexibly as needed.

Search mode	Description
Keyword match	Searches by the keywords from the tokenized search query. Supports defining AND/OR relationships between keywords.
Phrase search	Searches by phrases from the search query. A match requires that the distance between multiple words meets the specified requirement.
Natural language search	Lets you define complex query conditions to flexibly achieve search goals, such as defining AND/OR relationships, required words, excluded words, and phrases.
Term search	Performs an exact search for the search query. A match requires that the index contains the exact query string.

TEXT_SEARCH search function

The `TEXT_SEARCH` function calculates the BM25 relevance score for a source text based on a search query.

Function syntax

TEXT_SEARCH (
  <search_data> TEXT/VARCHAR/CHAR
  ,<search_expression> TEXT
  [ ,<mode> TEXT DEFAULT 'match'
  ,<operator> TEXT DEFAULT 'OR'
  ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT ''
  ,<options> TEXT DEFAULT '']
)

Parameters

Parameter	Required	Description
search_data	Yes	The search source. This parameter supports the TEXT, VARCHAR, and CHAR data types. Only column input is supported, and the column must have a full-text index. Otherwise, an error is reported.
search_expression	Yes	The search object. This parameter supports the TEXT, VARCHAR, and CHAR data types. Only constants are supported.
mode	No	The search mode. The following modes are supported: match (default): Keyword match. Each token is a keyword. The relationship between multiple keywords is set by the operator parameter, which defaults to OR. phrase: Phrase search. The distance between words in a phrase is configured by specifying slop in the options parameter. The default value is 0, which means the words in the phrase must be adjacent. natural_language: Natural language search. Use natural language to express complex query conditions, such as AND/OR keywords, required words, excluded words, and phrases. For more information, see Tantivy. term: Term search. No tokenization or other processing is performed on search_expression. The function performs an exact match in the index.
operator	No	The logical operator between keywords. This parameter takes effect only when mode is set to `match`. The following values are supported: OR (default): If the search query has multiple tokens, a match is returned if any token matches. AND: If the search query has multiple tokens, a match is returned only if all tokens match.
tokenizer, analyzer_params	No	The tokenizer and configuration used for the search query search_expression. You do not usually need to configure these parameters. If not specified, the same tokenizer and configuration as the full-text inverted index on the search_data source column are used by default. If the source is a constant, the default Jieba tokenizer is used. If specified, the search query search_expression is tokenized using the specified tokenizer and configuration.
options	No	Other parameters for full-text index. Specify these parameters in the format `'key1=v1;key2=v2;...;keyN=vN;'`. Currently, only the slop parameter is supported. It takes effect only when mode is set to phrase. You can set slop to 0 (default) or a positive integer. This value defines the maximum distance allowed between terms in a phrase. Note The slop value specifies the maximum allowed gap—or transformation overhead—between terms in a phrase. For tokenizers such as jieba, keyword, and icu, the gap is measured in characters, not tokens. For tokenizers such as standard, simple, and whitespace, the gap is measured in tokens.

Return value

This function returns a non-negative FLOAT value that represents the BM25 relevance score between the source text and the search query. A higher score indicates a higher relevance. A score of 0 indicates that the texts are completely irrelevant.

Examples

Use the keyword match mode and change the operator to AND.

-- Specify the parameter name.
SELECT TEXT_SEARCH (content, 'machine learning', operator => 'AND') FROM tbl;

-- Do not specify the parameter name. You must specify the parameters in order.
SELECT TEXT_SEARCH (content, 'machine learning', 'match', 'AND') FROM tbl;

Use the phrase search mode and set slop to 2.

SELECT TEXT_SEARCH (content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;

Use the natural language search mode.

-- Define the token search logic using the AND and OR operators.
SELECT TEXT_SEARCH (content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl;

 -- Define the token search logic using + (must include) and - (must exclude).
SELECT TEXT_SEARCH (content, '+learning -machine system', 'natural_language') FROM tbl;

Use the term search mode.

SELECT TEXT_SEARCH (content, 'machine learning', 'term') FROM tbl;

TOKENIZE function

The `TOKENIZE` function outputs tokenization results based on the tokenizer configuration. You can use it to debug the tokenization effect of a full-text inverted index.

Function syntax

TOKENIZE (
  <search_data> TEXT
  [ ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT '']
)

Parameters

search_data: Required. The target text for tokenization. Only constant input is supported.
tokenizer, analyzer_params: Optional. The tokenizer and configuration used for the target text search_data. The default is the Jieba tokenizer.

Return value

This function returns a TEXT array that contains the collection of tokens from the target text.

Verify index usage

You can use an execution plan to determine whether an SQL statement uses a full-text inverted index. If the execution plan includes Fulltext Filter, the full-text inverted index is used. For more information about execution plans, see EXPLAIN and EXPLAIN ANALYZE.

Example SQL statement:

EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, 'Yangtze River') > 0;

The following execution plan contains the Fulltext Filter field. This indicates that the SQL statement uses a full-text inverted index.

QUERY PLAN
Gather  (cost=0.00..1.00 rows=1 width=12)
  ->  Local Gather  (cost=0.00..1.00 rows=1 width=12)
        ->  Index Scan using Clustering_index on wiki_articles  (cost=0.00..1.00 rows=1 width=12)
              Fulltext Filter: (text_search(content, search_expression => 'Yangtze River'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

Usage examples

Data preparation

Run the following SQL statements to create a test table and insert data into it.

-- Create a table.
CREATE TABLE wiki_articles (id int, content text);

-- Create an index.
CREATE INDEX ft_idx_1 ON wiki_articles
       USING FULLTEXT (content)
       WITH (tokenizer = 'jieba');

-- Write data.
INSERT INTO wiki_articles VALUES
  (1, 'The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.'),
  (2, 'Li was born in 1962 in Wendeng County, Shandong.'),
  (3, 'He graduated from the department of physics at Shandong University.'),
  (4, 'The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.'),
  (5, 'The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year''s Eve dinner, and making New Year visits.'),
  (6, 'In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.'),
  (7, 'Shandong has dozens of universities.'),
  (8, 'ShanDa is a famous university of Shandong.');

-- Compaction
VACUUM wiki_articles;

-- Query table data.
SELECT * FROM wiki_articles limit 1;

The following result is returned:

id |                       content                       
---+---------------------------------------------------
 1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.

Different search examples

Keyword match.

-- (K1) Keyword match (default operator=OR). Documents containing 'shandong' or 'university' are matched.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0;

-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

-- (K2) Keyword match (operator=AND). Documents must contain both 'shandong' and 'university' to be matched.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0;

-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

Phrase search.

-- (P1) Phrase search (default slop = 0). A match occurs only if 'shandong' is immediately followed by 'university'.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;

-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
(1 row)

-- (P2) Phrase search with slop = 14. The distance between 'shandong' and 'university' cannot exceed 14 characters. This matches "Shandong has dozens of universities."
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0;
        
-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
(2 rows)

-- (P3) Phrase search supports unordered phrases, but the slop calculation is different and requires a larger value than for ordered phrases.
--      Therefore, 'university of Shandong' can also match the following query, but it will not be matched if slop=22.
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0;
        
-- Result   
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.
(3 rows)

-- (P4) Behavior with a standard tokenizer index. (ALTER INDEX ft_idx_1 SET (tokenizer = 'standard');)
--      For the standard tokenizer, slop is calculated in tokens.
--      As long as there are 0 tokens in between, regardless of the number of spaces, it is considered a phrase match.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;

-- Result
 id |        content         
----+------------------------
  1 | shandong university
  2 | shandong  university
  3 | shandong   university
  4 | shandong    university
(4 rows)

-- (P5) Punctuation is ignored. (Example with Jieba tokenizer)
--      Even though there is a comma between 'river' and 'with' in the text, and the query string has a period.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'river. with', mode => 'phrase') > 0;

-- Result
 id |                       content                       
----+-----------------------------------------------------
  1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.
(1 row)

Natural language search.

-- (N1) Natural language query: Without any symbols, this is equivalent to a keyword match. Same as (K1).
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong university', 'natural_language') > 0;
 id |                               content                               
----+---------------------------------------------------------------------
  7 | Shandong has dozens of universities.
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  8 | ShanDa is a famous university of Shandong.

-- (N2) Natural language query: Keyword match. Must contain ('shandong' AND 'university') OR 'heritage'. The AND operator has higher precedence than OR.
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '(shandong AND university) OR heritage', 'natural_language') > 0;
-- Equivalent to
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, 'shandong AND university OR heritage', 'natural_language') > 0;
-- Equivalent to
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '(+shandong +university) heritage', 'natural_language') > 0;

-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  8 | ShanDa is a famous university of Shandong.
  7 | Shandong has dozens of universities.
  3 | He graduated from the department of physics at Shandong University.
  6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.

-- (N3) Natural language query: Keyword match. Must contain 'shandong', must not contain 'university', and may contain 'heritage'.
--      In this query, the 'heritage' keyword does not have a + or - sign, so it does not affect which rows are matched. However, it does affect the match score. Rows with 'heritage' will have a higher score.
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '+shandong -university heritage', 'natural_language') > 0;
 id |                     content                      
----+--------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.

-- Must contain 'shandong', must not contain 'physics', and may contain 'famous'. The relevance score is higher if it contains 'famous'.
-- Note: This query shows the score calculation result on a single shard. The calculated BM25 score may vary with different numbers of shards and different file organizations.
SELECT id,
       content,
       TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score
FROM wiki_articles
WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0
ORDER BY score DESC;

-- Result
 id |                     content                      |  score   
----+--------------------------------------------------+----------
  8 | ShanDa is a famous university of Shandong.       |  2.92376
  7 | Shandong has dozens of universities.             | 0.863399
  2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338

-- (N4) Natural language query: Phrase search. Equivalent to (P1). The phrase must be enclosed in double quotation marks (""). If the phrase contains a double quotation mark, it must be escaped with a backslash (\).
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0;
        
-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.

-- (N5) Natural language query: Phrase search. Equivalent to (P2). Supports setting slop with the ~ syntax.
SELECT * FROM wiki_articles
        WHERE TEXT_SEARCH(content, '"shandong university"~23', 'natural_language') > 0;
        
-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  8 | ShanDa is a famous university of Shandong.
  7 | Shandong has dozens of universities.
  3 | He graduated from the department of physics at Shandong University.

-- (N6) Natural language query: Match all documents.
SELECT * FROM wiki_articles                                                                                  
        WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0;
        
-- Result
 id |                                           content                                            
----+----------------------------------------------------------------------------------------------
  1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.
  2 | Li was born in 1962 in Wendeng County, Shandong.
  3 | He graduated from the department of physics at Shandong University.
  4 | The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.
  5 | The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year's Eve dinner, and making New Year visits.
  6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.
  7 | Shandong has dozens of universities.
  8 | ShanDa is a famous university of Shandong.

Term search.

-- (T1) Term query: The tokenization result explicitly contains "Spring Festival", so it can be retrieved.
SELECT * FROM wiki_articles 
         WHERE TEXT_SEARCH(content, 'Spring Festival', 'term') > 0;

-- Result         
 id |                                           content                                            
----+----------------------------------------------------------------------------------------------
  4 | The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.
  5 | The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year's Eve dinner, and making New Year visits.
  6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.


-- (T2) Term query: Example that returns no results.
-- Because the content column of the wiki_articles source table uses the Jieba tokenizer by default, which splits "shandong university".
-- We recommend using term queries with the keyword tokenizer. In that case, the target data can be retrieved.
SELECT * FROM wiki_articles 
         WHERE TEXT_SEARCH(content, 'shandong university', 'term') > 0;

-- Result         
 id | content 
----+---------

Complex query examples

Query with a primary key.

-- Retrieve text that contains 'shandong' or 'university' and has an id of 3.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 and id = 3;

-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  3 | He graduated from the department of physics at Shandong University.
  

-- Retrieve text that contains 'shandong' or 'university', or has an id less than 2.
SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 OR id < 2;

-- Result
 id |                               content                               
----+---------------------------------------------------------------------
  2 | Li was born in 1962 in Wendeng County, Shandong.
  8 | ShanDa is a famous university of Shandong.
  1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.
  3 | He graduated from the department of physics at Shandong University.
  7 | Shandong has dozens of universities.

Obtain the score and retrieve the top three results.

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
ORDER BY score DESC
LIMIT 3;

-- Result
id  |                               content                               |  score  |                     tokenize                     
----+---------------------------------------------------------------------+---------+--------------------------------------------------
  8 | ShanDa is a famous university of Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
  7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}

Use the TEXT_SEARCH function in both the SELECT list and the WHERE clause.

SELECT id,
       content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(content, 'jieba')
  FROM wiki_articles
 WHERE TEXT_SEARCH(content, 'shandong university') > 0
ORDER BY score DESC;

-- Result
id  |                               content                               |  score  |                     tokenize                     
----+---------------------------------------------------------------------+---------+--------------------------------------------------
  7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  8 | ShanDa is a famous university of Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
  3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  2 | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

Retrieve the documents from the 'wiki' source that are most relevant to 'shandong university'.

-- Source table for JOIN.
CREATE TABLE article_source (id int primary key, source text);
INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'),
                                  (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper'),
                                  (9, 'http_log'), (10, 'http_log'), (11, 'http_log');
                                  
SELECT a.id,
       source, content,
       TEXT_SEARCH(content, 'shandong university') AS score,
       TOKENIZE(a.content, 'jieba')
  FROM wiki_articles a
  JOIN article_source b
    ON (a.id = b.id)
 WHERE TEXT_SEARCH(a.content, 'shandong university') > 0
   AND b.source = 'wiki'
ORDER BY score DESC;

-- Result
id  | source |                               content                               |  score  |                     tokenize                     
----+--------+---------------------------------------------------------------------+---------+--------------------------------------------------
  7 | wiki   | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
  3 | wiki   | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  2 | wiki   | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

Usage recommendations

Reindex using Serverless resources

If you modify table properties, compaction may be triggered and the index may be rebuilt, which consumes a large amount of CPU resources. To modify the following table properties, perform the steps below:

Modifying `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes triggers compaction and reindexing. Therefore, do not use the `ALTER TABLE xxx SET` syntax. Instead, run the following command to use the `REBUILD` syntax with Serverless Computing resources. For more information, see REBUILD.

ASYNC REBUILD TABLE <table_name> 
WITH (
    rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
    bitmap_columns = '<col1>,<col2>',
    dictionary_encoding_columns = '<col1>:on,<col2>:off',
    vectors = '{
    "<col_vector>": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);

Modifying column-oriented JSONB columns or full-text index columns also triggers compaction and reindexing. The `REBUILD` syntax is not supported for these modifications. Instead, create a temporary table by following these steps:

BEGIN ;
-- Clean up potential temporary tables.
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table.
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT ;

-- Enable column-oriented storage for JSON-formatted data for the corresponding column.
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create a full-text index on the corresponding column.
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);

-- Insert data into the temporary table, use Serverless resources to run the job, and synchronously complete index building.
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;

BEGIN ;
-- Delete the old table.
DROP TABLE IF EXISTS  <table>;
-- Rename the temporary table.
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT ;

To modify other properties, such as `distribution_key`, `clustering_key`, `segment_key`, and storage format, use the `REBUILD` syntax with Serverless Computing resources.

Advanced operations: Customize tokenizer configurations

Hologres recommends using the default tokenizer configurations. However, in some cases, the default configurations may not meet your business needs. You can customize the tokenizer configuration for more flexible tokenization.

analyzer_params configuration requirements

The `analyzer_params` parameter has the following configuration requirements:

Only JSON-formatted strings are supported.

The top level of the JSON supports two keys: `tokenizer` and `filter`.

filter: Optional. The value is a JSON array used to configure token filter properties. If you configure multiple token filter properties, they are applied to each token in the specified order.

tokenizer: Required. The value is a JSON object used to configure tokenizer properties. The JSON object supports the following keys:

type: Required. The tokenizer name.

Other parameters for the tokenizer object. Different tokenizers support different parameters, as shown in the following table:

Tokenizer	Other parameters for the tokenizer object	Parameter description	Parameter values
jieba	mode	Defines the tokenization mode.	search (default): Lists multiple possible combinations during tokenization, which allows for redundancy. For example, the tokenization result for "traditional festival" is "traditional", "festival", and "traditional festival". exact: Prevents redundant splitting during tokenization. For example, the tokenization result for "traditional festival" is only "traditional festival".
jieba	hmm	Specifies whether to use a Hidden Markov Model (HMM) to identify words that are not in the dictionary. This improves new word recognition.	true (default): Enabled false: Disabled.
standard	max_token_length	The maximum token length.	A positive integer. The default value is 255. If a token exceeds this length, it is split at intervals of `max_token_length`.
ik	mode	Defines the tokenization mode.	ik_max_word (default): Performs fine-grained tokenization that outputs all possible short words. For example, the tokenization result for "Nanjing City" is "Nanjing" and "City". ik_smart: Performs coarse-grained tokenization that prioritizes long words and reduces the number of splits. The output tokens do not overlap. This mode attempts to combine numbers and quantifiers into a single token. For example, "Nanjing City" and "1 kilometer" are not further tokenized.
ik	enable_lowercase	Specifies whether to convert tokens to lowercase.	true (default) false
ngram	min_ngram	The minimum character length of a token.	A positive number. The default value is 1. The maximum difference between `min_ngram` and `max_ngram` is 3. Note: You can adjust the maximum difference by setting the following GUC. For example, `SET hg_fulltext_index_max_ngram_diff = 5`.
	max_ngram	The maximum character length of a token.	The default value is 2. The value must be in the range of [1, 255]. The maximum difference between `min_ngram` and `max_ngram` is 3. Note If the difference between `max_ngram` and `min_ngram` is too large, the ngram tokenizer generates many tokens. This can increase resource consumption, require more index storage, and lengthen index building times.
	prefix_only	Specifies whether to consider only prefixes.	true false (default)
pinyin	keep_first_letter	Specifies whether to keep the first letter of each Chinese character as a token.	true (default): Keeps the first letter. For example, the tokenization result for "Li Ming" includes "lm". false: Do not retain.
	keep_separate_first_letter	Specifies whether to keep the first letter of each Chinese character as a separate token.	true: Keeps the first letters separately. For example, the tokenization result for "Li Ming" includes "l,m". false (default): The setting does not retain the value.
	limit_first_letter_length	The maximum length of the first-letter token.	An integer that specifies the maximum length of the first-letter string. The default value is 16.
	keep_full_pinyin	Specifies whether to keep the full Pinyin of each Chinese character as separate tokens.	true (default): Keeps the full Pinyin. For example, the tokenization result for "Li Ming" includes "li,ming". false: The item is not retained.
	keep_joined_full_pinyin	Specifies whether to join the full Pinyin of each Chinese character into a single token.	true: Joins the full Pinyin. For example, the tokenization result for "Li Ming" includes "liming". false (default): The system does not establish a connection.
	keep_none_chinese	Specifies whether to keep non-Chinese letters or numbers in the tokenization result.	true (default): Retain. false: Not retained.
	keep_none_chinese_together	Specifies whether to keep non-Chinese letters or numbers together.	true (default): Keeps non-Chinese characters together. For example, the tokenization result for "DJ Li Ming" is "DJ,li,ming", where "DJ" stays together. false: Does not keep non-Chinese characters together. For example, the tokenization result for "DJ Li Ming" is "D,J,li,ming". Note This parameter takes effect only when `keep_none_chinese` is set to `true`.
	keep_none_chinese_in_first_letter	Specifies whether to keep non-Chinese letters or numbers in the first-letter token.	true (default): Retain. For example, the tokenization result of 李明AT2025 includes lmat2025. false: The item is not retained.
	keep_none_chinese_in_joined_full_pinyin	Specifies whether to keep non-Chinese letters or numbers in the joined full Pinyin token.	true: Keeps non-Chinese characters. For example, the tokenization result for "Li Ming AT2025" includes "limingat2025". false (default): The resource is not retained.
	none_chinese_pinyin_tokenize	If non-Chinese letters are valid Pinyin, specifies whether to split them into separate Pinyin terms.	true (default): Splits the terms. For example, the tokenization result for "limingalibaba2025" includes "li,ming,a,li,ba,ba,2025". false: Does not split the terms. Note This parameter takes effect only when `keep_none_chinese` and `keep_none_chinese_together` are set to `true`.
	keep_original	Specifies whether to keep the original input.	true: Keeps the original input. false (default): Does not keep the original input.
	lowercase	Specifies whether to convert non-Chinese letters to lowercase.	true (default) false
	trim_whitespace	Specifies whether to trim whitespace characters.	true (default) false
	remove_duplicated_term	Specifies whether to remove duplicate terms.	If this parameter is set to true, certain tokens are removed. For example, "de的" is tokenized as "de", which affects the results of phrase queries for "de的". false (default): Does not remove duplicate terms.
	keep_separate_chinese	Specifies whether to keep individual Chinese characters as separate tokens.	true: Keeps individual characters. For example, the tokenization result for "Li Ming" includes the two tokens "Li" and "Ming". false (default): The item is not retained.

Default analyzer_params configurations

The default `analyzer_params` configurations for different tokenizers are as follows:

Analyzer name	Default analyzer_params configuration	Tokenization example
jieba (default analyzer)	`{ "tokenizer": { "type": "jieba", "mode": "search", "hmm": true }, "filter": [ "removepunct", "lowercase", {"type": "stop", "stop_words": ["_english_"]}, {"type": "stemmer", "language": "english"} ] }`	`Spring Festival , that is , the Lunar New Year , is China 's most important traditional festival traditional festival .`
whitespace	`{ "tokenizer": { "type": "whitespace" } }`	`Spring Festival, that is the Lunar New Year, is China's most important traditional festival.`
keyword	`{ "tokenizer": { "type": "keyword" } }`	`Spring Festival, that is the Lunar New Year, is China's most important traditional festival.`
simple	`{ "tokenizer": { "type": "simple" }, "filter": [ "lowercase" ] }`	`Spring Festival that is the Lunar New Year is China's most important traditional festival`
standard	`{ "tokenizer": { "type": "standard", "max_token_length": 255 }, "filter": [ "lowercase" ] }`	`Spring Festival that is the Lunar New Year is China 's most important traditional festival`
icu	`{ "tokenizer": { "type": "icu" }, "filter": [ "removepunct", "lowercase" ] }`	`Spring Festival , that is the Lunar New Year , is China 's most important traditional festival .`
ik	`{ "tokenizer": { "type": "ik", "mode": "ik_max_word", "enable_lowercase": true }, "filter": [ {"type": "stop", "stop_words": ["_english_"]}, {"type": "stemmer", "language": "english"} ] }`	`Spring Festival that is the Lunar New Year is China 's most most important important traditional festival traditional festival`
ngram	`{ "tokenizer": { "type": "ngram", "min_gram": 1, "max_gram": 2, "prefix_only": false } }`	`Spring Spring Festival Festival Festival, , , that that that is is is the the the Lunar Lunar Lunar New New New Year Year Year, , , is is is China China China's 's 's most most most important important important important important important traditional traditional traditional festival festival festival .`
pinyin	{ "tokenizer": { "type": "pinyin", "keep_first_letter": true, "keep_separate_first_letter": false, "keep_full_pinyin": true, "keep_joined_full_pinyin": false, "keep_none_chinese": true, "keep_none_chinese_together": true, "none_chinese_pinyin_tokenize": true, "keep_original": false, "limit_first_letter_length": 16, "lowercase": true, "trim_whitespace": true, "keep_none_chinese_in_first_letter": true, "keep_none_chinese_in_joined_full_pinyin": false, "remove_duplicated_term": false, "ignore_pinyin_offset": true, "fixed_pinyin_offset": false, "keep_separate_chinese": false } }	`chun cjjnlxnszgzzydct jie ji nong li xin nian shi zhong guo zui zhong yao de chuan tong jie ri`

Filter configuration in analyzer_params

Hologres supports the following filters, also known as token filter properties, in `analyzer_params`.

Note

If you configure multiple token filter properties, they are applied to each token in the specified order.

Property name	Property description	Parameter format	Example
lowercase	Converts uppercase letters in a token to lowercase.	Declare `lowercase`. `"lowercase"`	Token filter property definition `"filter": ["lowercase"]` Token filter result `["Hello", "WORLD"]` is converted to `["hello", "world"]`.
stop	Removes stop-word tokens.	`stop_words`: A list of stop words. This must be a list that contains only strings. You can define custom stop words or use the built-in stop word dictionaries for specific languages: `"_english_" "_danish_" "_dutch_" "_finnish_" "_french_" "_german_" "_hungarian_" "_italian_" "_norwegian_" "_portuguese_" "_russian_" "_spanish_" "_swedish_"`	Token filter property definition `"filter": [{ "type": "stop", "stop_words": ["_english_", "cat"] }]` Token filter result `["the", "cat", "is", "on", "a", "mat"]` is converted to `["mat"]`. Note "cat" is a custom stop word. "the", "is", "on", and "a" are stop words from the built-in "_english_" dictionary.
stemmer	Converts a token to its corresponding root form (stem) based on the grammatical rules of the language.	`language`: The language. The following built-in languages are supported: `"arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish"`	Token filter property definition `"filter": [{ "type": "stemmer", "language": "english" }]` Token filter result `["machine", "learning"]` is converted to `["machin", "learn"]`.
length	Removes tokens that exceed a specified length.	`max`: The maximum length to keep. This must be a positive integer. `{"type": "length", "max": 10}`	Token filter property definition `"filter": [{"type": "length", "max": 10}]` Token filter result `["AI", "for", "Artificial", "Intelligence"]` is converted to `["AI", "for", "Artificial"]`.
removepunct	Removes tokens that consist only of punctuation characters.	Declare `removepunct`. `"removepunct"` Note In V4.0.8 and later, `removepunct` has a new `mode` parameter that defines the removal mode. The values are as follows: if_all (default): Removes the token only if all characters in the token are punctuation characters. if_any: Removes the token if it contains any punctuation characters.	Token filter property definition `"filter": ["removepunct"]` Note This is equivalent to "filter": [{"type": "removepunct", "mode": "if_all"}]. Token filter result `["chinese", "english", "chinese.", "english.", "124", "124!=8", ".", ",", ",,", " ..."]` is converted to `["chinese", "english", "chinese.", "english.", "124", "124!=8"]`. `"filter": [{"type": "removepunct", "mode": "if_any"}]` ["chinese", "english", "chinese.", "english.", "124", "124!=8", ".", ",", ",,", " ..."] is converted to ["chinese", "english", "124"].
pinyin	Pinyin token filter.	{ "type": "pinyin", "keep_first_letter": true, "keep_separate_first_letter": false, "keep_full_pinyin": true, "keep_joined_full_pinyin": false, "keep_none_chinese": true, "keep_none_chinese_together": true, "none_chinese_pinyin_tokenize": true, "keep_original": false, "limit_first_letter_length": 16, "lowercase": true, "trim_whitespace": true, "keep_none_chinese_in_first_letter": true, "keep_none_chinese_in_joined_full_pinyin": false, "remove_duplicated_term": false, "ignore_pinyin_offset": true, "fixed_pinyin_offset": false, "keep_separate_chinese": false }	The filter configuration properties are the same as for the Pinyin tokenizer.