All Products
Search
Document Center

Hologres:Full-text inverted index

Last Updated:Feb 13, 2026

Hologres V4.0 and later supports full-text inverted indexes. This feature is built on Tantivy, a high-performance full-text search engine. It provides high-performance retrieval and supports the BM25 similarity scoring algorithm for document sorting, keyword search, and phrase search.

How it works

When you write source text to Hologres, it builds a full-text inverted index file for each data file based on your index configuration. First, a tokenizer breaks the text into tokens. The index then records the mapping between each token and the source text. It also records information such as position and term frequency.

To search the text, the search query is first tokenized into a set of target tokens. The BM25 algorithm then calculates a relevance score for each source text against the target tokens. This process enables fast and accurate full-text search.

Notes

  • Full-text inverted indexes are supported only for column-oriented tables and row-column hybrid tables in Hologres V4.0 and later. Row-oriented tables are not supported.

  • You can create full-text inverted indexes only on columns of the TEXT, CHAR, or VARCHAR data type.

  • You can build a full-text inverted index on only one column at a time. Each column supports only one full-text inverted index. To index multiple columns, you must create a separate index for each one.

  • After you create a full-text inverted index, the index files for existing data and newly imported data are built asynchronously during the data compaction process. Until the index file is built, the BM25 relevance score for the data is 0.

  • For real-time data writes after you create a full-text inverted index: In versions earlier than Hologres V4.0.8, the index was built synchronously with real-time data writes. In V4.0.8 and later, the system asynchronously refreshes the in-memory real-time index every second. This ensures efficient data writes and index building. Data can be queried using the index only after the index is refreshed.

  • You can run full-text searches only on columns that have a full-text index. Brute-force searches on unindexed columns are not supported.

  • You can use Serverless Computing resources for batch data imports. Serverless resources complete compaction and full-text index building synchronously during the data import. For more information, see Use Serverless Computing for read and write tasks and Use Serverless Computing for compaction tasks. If you do not use Serverless resources, you must manually run the following command to trigger Compaction after you import data in batches or modify an index.

    VACUUM <schema_name>.<table_name>;
  • The BM25 search algorithm calculates relevance scores at the file level. If you import a small amount of data, you can manually trigger Compaction to merge files and improve search accuracy.

  • You can use Serverless Computing resources to run full-text search queries.

  • You can choose a tokenizer based on the scenarios in the following table:

    Scenario

    Tokenizer

    Notes

    Keyword extraction from long articles

    Jieba

    Supports new word discovery and complex pattern switching.

    Chinese descriptive text search

    IK

    Accurately identifies Chinese terms.

    English title text search

    Simple, Whitespace, Standard

    Simple and efficient. Use as needed based on the target English text.

    Fuzzy search for log text

    Ngram

    No dictionary required. Meets the needs of fuzzy text queries.

    Pinyin search for Chinese product or person names

    Pinyin

    Supports various Chinese Pinyin scenarios such as full Pinyin, first letter, and polyphone derivation.

Manage indexes

Create an index

Syntax

CREATE INDEX [ IF NOT EXISTS ] idx_name ON table_name
       USING FULLTEXT (column_name [ , ... ])
       [ WITH ( storage_parameter [ = value ] [ , ... ] ) ];

Parameters

Parameter

Description

idx_name

The index name.

table_name

The target table name.

column_name

The name of the target column for building the full-text inverted index.

storage_parameter

The parameters for the full-text inverted index. There are two types of parameters:

  • tokenizer: The tokenizer name. The following tokenizers are supported:

    • jieba (default): A Chinese tokenizer that combines rule-based matching and statistical models.

    • whitespace: A tokenizer that splits text by spaces.

    • standard: A standard tokenizer based on Unicode Standard Annex #29.

    • simple: A simple tokenizer that splits text by spaces and punctuation.

    • keyword: A keyword tokenizer that performs no operations and outputs the original text.

    • icu: A tokenizer for multilingual text processing.

    • ik: A Chinese tokenizer based on the IK Analyzer. It can automatically identify special formats such as English words, email addresses, URLs without `://`, and IP addresses. This tokenizer is supported in Hologres V4.0.9 and later.

    • ngram: A character-based sliding window tokenizer. It splits text into consecutive n characters or words, called n-grams, to improve the recall rate and fuzzy matching capabilities of the search system. This tokenizer is suitable for accelerating `like` and `ilike` fuzzy matching scenarios. It is supported in Hologres V4.0.9 and later.

    • pinyin: A Pinyin tokenizer. It generates Pinyin for Chinese characters and words in the text and supports Pinyin splitting and derivation for non-Chinese strings. This helps you achieve more accurate Pinyin generation and a higher search recall rate. This tokenizer is supported in Hologres V4.0.9 and later.

  • analyzer_params: The tokenizer configuration. Only JSON-formatted strings are supported.

    • Each tokenizer has a default `analyzer_params` configuration. In most cases, use the default configuration. This means you only need to specify the `tokenizer` parameter and do not need to explicitly configure the `analyzer_params` parameter.

    • You can customize some of the configurations. For more information, see Advanced operations: Customize tokenizer configurations.

Note

Only one type of `tokenizer` and `analyzer_params` can be set within the same index.

Examples

  • Create a full-text inverted index that uses the default tokenizer and configuration. The default tokenizer is the Jieba tokenizer.

    CREATE INDEX idx1 ON tbl 
           USING FULLTEXT (col1);
  • Explicitly specify the IK tokenizer and use its default configuration.

    CREATE INDEX idx1 ON tbl 
           USING FULLTEXT (col1)
           WITH (tokenizer = 'ik');
  • Explicitly specify a custom tokenizer configuration that uses the Jieba tokenizer in `exact` mode and only the `lowercase` filter to convert text to lowercase.

    CREATE INDEX idx1 ON tbl 
           USING FULLTEXT (col1)
           WITH (tokenizer = 'jieba',
                 analyzer_params = '{"tokenizer":{"type":"jieba","mode":"exact"}, "filter":["lowercase"]}');
Note
  • After you create a full-text inverted index, the index files are built during the Compaction process after data is imported.

  • You can use Serverless Computing resources for batch data imports. Serverless resources complete compaction and full-text index building synchronously during the data import. For more information, see Use Serverless Computing for read and write tasks and Use Serverless Computing for compaction tasks.

  • If you do not use Serverless resources, you must manually run the following command to trigger Compaction after you import data in batches or modify an index. For more information, see Compaction (Beta).

    VACUUM <schema_name>.<table_name>;

Modify an index

Syntax

-- Modify the index configuration
ALTER INDEX [ IF EXISTS ] <idx_name> SET ( <storage_parameter> = '<storage_value>' [ , ... ] );

-- Reset to the default configuration
ALTER INDEX [ IF EXISTS ] <idx_name> RESET ( <storage_parameter> [ , ... ] );

Parameters

For more information about the parameters, see Parameters.

Examples

Note

After you modify a full-text inverted index, the index files are rebuilt asynchronously during the data compaction process. After you modify the index, you must manually run the VACUUM <schema_name>.<table_name>; command to synchronously trigger Compaction. For more information, see Compaction.

  • Change the index tokenizer to `standard`.

    ALTER INDEX idx1 SET (tokenizer = 'standard');
  • Change the index tokenizer to `ik` with the `ik_max_word` mode, without converting text to lowercase or adding filters.

    ALTER INDEX idx1 SET (
      tokenizer = 'ik',
      analyzer_params = '{"tokenizer":{"type":"ik","mode":"ik_max_word","enable_lowercase": false}}'
    );
  • Reset the index to use the default Jieba tokenizer and its default `analyzer_params` configuration.

    ALTER INDEX idx1 RESET (tokenizer);
    ALTER INDEX idx1 RESET (tokenizer, analyzer_params);
  • Reset the `analyzer_params` configuration to the default for the current tokenizer.

    ALTER INDEX idx1 RESET (analyzer_params);

Delete an index

Syntax

DROP INDEX [ IF EXISTS ] <idx_name> [ RESTRICT ];

Parameters

For more information about the parameters, see Parameters.

View indexes

Hologres provides the hologres.hg_index_properties system table to view created full-text inverted indexes and their locations.

SELECT * FROM hologres.hg_index_properties;

You can run the following SQL statement to view the table and column that correspond to an index.

SELECT 
    t.relname AS table_name, 
    a.attname AS column_name
FROM pg_class t
    JOIN pg_index i ON t.oid = i.indrelid
    JOIN pg_class idx ON i.indexrelid = idx.oid
    JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = ANY(i.indkey)
WHERE t.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = '<namespace>')
    AND idx.relname = '<indexname>'
LIMIT 1;

Parameters:

  • namespace: The value of the table_namespace field in the result returned by the SELECT * FROM hologres.hg_index_properties; command.

  • indexname: The name of the index.

Use indexes for full-text search

Hologres supports various search modes that allow you to perform full-text searches flexibly as needed.

Search mode

Description

Keyword match

Searches by the keywords from the tokenized search query. Supports defining AND/OR relationships between keywords.

Phrase search

Searches by phrases from the search query. A match requires that the distance between multiple words meets the specified requirement.

Natural language search

Lets you define complex query conditions to flexibly achieve search goals, such as defining AND/OR relationships, required words, excluded words, and phrases.

Term search

Performs an exact search for the search query. A match requires that the index contains the exact query string.

TEXT_SEARCH search function

The `TEXT_SEARCH` function calculates the BM25 relevance score for a source text based on a search query.

Function syntax

TEXT_SEARCH (
  <search_data> TEXT/VARCHAR/CHAR
  ,<search_expression> TEXT
  [ ,<mode> TEXT DEFAULT 'match'
  ,<operator> TEXT DEFAULT 'OR'
  ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT ''
  ,<options> TEXT DEFAULT '']
)

Parameters

Parameter

Required

Description

search_data

Yes

The search source. This parameter supports the TEXT, VARCHAR, and CHAR data types. Only column input is supported, and the column must have a full-text index. Otherwise, an error is reported.

search_expression

Yes

The search object. This parameter supports the TEXT, VARCHAR, and CHAR data types. Only constants are supported.

mode

No

The search mode. The following modes are supported:

  • match (default): Keyword match. Each token is a keyword. The relationship between multiple keywords is set by the operator parameter, which defaults to OR.

  • phrase: Phrase search. The distance between words in a phrase is configured by specifying slop in the options parameter. The default value is 0, which means the words in the phrase must be adjacent.

  • natural_language: Natural language search. Use natural language to express complex query conditions, such as AND/OR keywords, required words, excluded words, and phrases. For more information, see Tantivy.

  • term: Term search. No tokenization or other processing is performed on search_expression. The function performs an exact match in the index.

operator

No

The logical operator between keywords. This parameter takes effect only when mode is set to `match`. The following values are supported:

  • OR (default): If the search query has multiple tokens, a match is returned if any token matches.

  • AND: If the search query has multiple tokens, a match is returned only if all tokens match.

tokenizer, analyzer_params

No

The tokenizer and configuration used for the search query search_expression. You do not usually need to configure these parameters.

  • If not specified, the same tokenizer and configuration as the full-text inverted index on the search_data source column are used by default. If the source is a constant, the default Jieba tokenizer is used.

  • If specified, the search query search_expression is tokenized using the specified tokenizer and configuration.

options

No

Other parameters for full-text index. Specify these parameters in the format 'key1=v1;key2=v2;...;keyN=vN;'.

Currently, only the slop parameter is supported. It takes effect only when mode is set to phrase. You can set slop to 0 (default) or a positive integer. This value defines the maximum distance allowed between terms in a phrase.

Note

The slop value specifies the maximum allowed gap—or transformation overhead—between terms in a phrase. For tokenizers such as jieba, keyword, and icu, the gap is measured in characters, not tokens. For tokenizers such as standard, simple, and whitespace, the gap is measured in tokens.

Return value

This function returns a non-negative FLOAT value that represents the BM25 relevance score between the source text and the search query. A higher score indicates a higher relevance. A score of 0 indicates that the texts are completely irrelevant.

Examples

  • Use the keyword match mode and change the operator to AND.

    -- Specify the parameter name.
    SELECT TEXT_SEARCH (content, 'machine learning', operator => 'AND') FROM tbl;
    
    -- Do not specify the parameter name. You must specify the parameters in order.
    SELECT TEXT_SEARCH (content, 'machine learning', 'match', 'AND') FROM tbl;
  • Use the phrase search mode and set slop to 2.

    SELECT TEXT_SEARCH (content, 'machine learning', 'phrase', options => 'slop=2;') FROM tbl;
  • Use the natural language search mode.

    -- Define the token search logic using the AND and OR operators.
    SELECT TEXT_SEARCH (content, 'machine AND (system OR recognition)', 'natural_language') FROM tbl;
    
     -- Define the token search logic using + (must include) and - (must exclude).
    SELECT TEXT_SEARCH (content, '+learning -machine system', 'natural_language') FROM tbl;
  • Use the term search mode.

    SELECT TEXT_SEARCH (content, 'machine learning', 'term') FROM tbl;

TOKENIZE function

The `TOKENIZE` function outputs tokenization results based on the tokenizer configuration. You can use it to debug the tokenization effect of a full-text inverted index.

Function syntax

TOKENIZE (
  <search_data> TEXT
  [ ,<tokenizer> TEXT DEFAULT ''
  ,<analyzer_params> TEXT DEFAULT '']
)

Parameters

  • search_data: Required. The target text for tokenization. Only constant input is supported.

  • tokenizer, analyzer_params: Optional. The tokenizer and configuration used for the target text search_data. The default is the Jieba tokenizer.

Return value

This function returns a TEXT array that contains the collection of tokens from the target text.

Verify index usage

You can use an execution plan to determine whether an SQL statement uses a full-text inverted index. If the execution plan includes Fulltext Filter, the full-text inverted index is used. For more information about execution plans, see EXPLAIN and EXPLAIN ANALYZE.

Example SQL statement:

EXPLAIN ANALYZE SELECT * FROM wiki_articles WHERE text_search(content, 'Yangtze River') > 0;

The following execution plan contains the Fulltext Filter field. This indicates that the SQL statement uses a full-text inverted index.

QUERY PLAN
Gather  (cost=0.00..1.00 rows=1 width=12)
  ->  Local Gather  (cost=0.00..1.00 rows=1 width=12)
        ->  Index Scan using Clustering_index on wiki_articles  (cost=0.00..1.00 rows=1 width=12)
              Fulltext Filter: (text_search(content, search_expression => 'Yangtze River'::text, mode => match, operator => OR, tokenizer => jieba, analyzer_params => {"filter":["removepunct","lowercase",{"stop_words":["_english_"],"type":"stop"},{"language":"english","type":"stemmer"}],"tokenizer":{"hmm":true,"mode":"search","type":"jieba"}}, options => ) > '0'::double precision)
Query Queue: init_warehouse.default_queue
Optimizer: HQO version 4.0.0

Usage examples

Data preparation

Run the following SQL statements to create a test table and insert data into it.

-- Create a table.
CREATE TABLE wiki_articles (id int, content text);

-- Create an index.
CREATE INDEX ft_idx_1 ON wiki_articles
       USING FULLTEXT (content)
       WITH (tokenizer = 'jieba');

-- Write data.
INSERT INTO wiki_articles VALUES
  (1, 'The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.'),
  (2, 'Li was born in 1962 in Wendeng County, Shandong.'),
  (3, 'He graduated from the department of physics at Shandong University.'),
  (4, 'The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.'),
  (5, 'The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year''s Eve dinner, and making New Year visits.'),
  (6, 'In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.'),
  (7, 'Shandong has dozens of universities.'),
  (8, 'ShanDa is a famous university of Shandong.');

-- Compaction
VACUUM wiki_articles;

-- Query table data.
SELECT * FROM wiki_articles limit 1;

The following result is returned:

id |                       content                       
---+---------------------------------------------------
 1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.

Different search examples

  • Keyword match.

    -- (K1) Keyword match (default operator=OR). Documents containing 'shandong' or 'university' are matched.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0;
    
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      2 | Li was born in 1962 in Wendeng County, Shandong.
      3 | He graduated from the department of physics at Shandong University.
      7 | Shandong has dozens of universities.
      8 | ShanDa is a famous university of Shandong.
    
    -- (K2) Keyword match (operator=AND). Documents must contain both 'shandong' and 'university' to be matched.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', operator => 'AND') > 0;
    
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      3 | He graduated from the department of physics at Shandong University.
      7 | Shandong has dozens of universities.
      8 | ShanDa is a famous university of Shandong.
  • Phrase search.

    -- (P1) Phrase search (default slop = 0). A match occurs only if 'shandong' is immediately followed by 'university'.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;
    
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      3 | He graduated from the department of physics at Shandong University.
    (1 row)
    
    -- (P2) Phrase search with slop = 14. The distance between 'shandong' and 'university' cannot exceed 14 characters. This matches "Shandong has dozens of universities."
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=14;') > 0;
            
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      3 | He graduated from the department of physics at Shandong University.
      7 | Shandong has dozens of universities.
    (2 rows)
    
    -- (P3) Phrase search supports unordered phrases, but the slop calculation is different and requires a larger value than for ordered phrases.
    --      Therefore, 'university of Shandong' can also match the following query, but it will not be matched if slop=22.
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase', options => 'slop=23;') > 0;
            
    -- Result   
     id |                               content                               
    ----+---------------------------------------------------------------------
      3 | He graduated from the department of physics at Shandong University.
      7 | Shandong has dozens of universities.
      8 | ShanDa is a famous university of Shandong.
    (3 rows)
    
    -- (P4) Behavior with a standard tokenizer index. (ALTER INDEX ft_idx_1 SET (tokenizer = 'standard');)
    --      For the standard tokenizer, slop is calculated in tokens.
    --      As long as there are 0 tokens in between, regardless of the number of spaces, it is considered a phrase match.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university', mode => 'phrase') > 0;
    
    -- Result
     id |        content         
    ----+------------------------
      1 | shandong university
      2 | shandong  university
      3 | shandong   university
      4 | shandong    university
    (4 rows)
    
    -- (P5) Punctuation is ignored. (Example with Jieba tokenizer)
    --      Even though there is a comma between 'river' and 'with' in the text, and the query string has a period.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'river. with', mode => 'phrase') > 0;
    
    -- Result
     id |                       content                       
    ----+-----------------------------------------------------
      1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.
    (1 row)
  • Natural language search.

    -- (N1) Natural language query: Without any symbols, this is equivalent to a keyword match. Same as (K1).
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, 'shandong university', 'natural_language') > 0;
     id |                               content                               
    ----+---------------------------------------------------------------------
      7 | Shandong has dozens of universities.
      2 | Li was born in 1962 in Wendeng County, Shandong.
      3 | He graduated from the department of physics at Shandong University.
      8 | ShanDa is a famous university of Shandong.
    
    -- (N2) Natural language query: Keyword match. Must contain ('shandong' AND 'university') OR 'heritage'. The AND operator has higher precedence than OR.
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, '(shandong AND university) OR heritage', 'natural_language') > 0;
    -- Equivalent to
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, 'shandong AND university OR heritage', 'natural_language') > 0;
    -- Equivalent to
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, '(+shandong +university) heritage', 'natural_language') > 0;
    
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      8 | ShanDa is a famous university of Shandong.
      7 | Shandong has dozens of universities.
      3 | He graduated from the department of physics at Shandong University.
      6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.
    
    -- (N3) Natural language query: Keyword match. Must contain 'shandong', must not contain 'university', and may contain 'heritage'.
    --      In this query, the 'heritage' keyword does not have a + or - sign, so it does not affect which rows are matched. However, it does affect the match score. Rows with 'heritage' will have a higher score.
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, '+shandong -university heritage', 'natural_language') > 0;
     id |                     content                      
    ----+--------------------------------------------------
      2 | Li was born in 1962 in Wendeng County, Shandong.
    
    -- Must contain 'shandong', must not contain 'physics', and may contain 'famous'. The relevance score is higher if it contains 'famous'.
    -- Note: This query shows the score calculation result on a single shard. The calculated BM25 score may vary with different numbers of shards and different file organizations.
    SELECT id,
           content,
           TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') as score
    FROM wiki_articles
    WHERE TEXT_SEARCH(content, '+shandong -physics famous', 'natural_language') > 0
    ORDER BY score DESC;
    
    -- Result
     id |                     content                      |  score   
    ----+--------------------------------------------------+----------
      8 | ShanDa is a famous university of Shandong.       |  2.92376
      7 | Shandong has dozens of universities.             | 0.863399
      2 | Li was born in 1962 in Wendeng County, Shandong. | 0.716338
    
    -- (N4) Natural language query: Phrase search. Equivalent to (P1). The phrase must be enclosed in double quotation marks (""). If the phrase contains a double quotation mark, it must be escaped with a backslash (\).
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, '"shandong university"', 'natural_language') > 0;
            
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      3 | He graduated from the department of physics at Shandong University.
    
    -- (N5) Natural language query: Phrase search. Equivalent to (P2). Supports setting slop with the ~ syntax.
    SELECT * FROM wiki_articles
            WHERE TEXT_SEARCH(content, '"shandong university"~23', 'natural_language') > 0;
            
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      8 | ShanDa is a famous university of Shandong.
      7 | Shandong has dozens of universities.
      3 | He graduated from the department of physics at Shandong University.
    
    -- (N6) Natural language query: Match all documents.
    SELECT * FROM wiki_articles                                                                                  
            WHERE TEXT_SEARCH(content, '*', 'natural_language') > 0;
            
    -- Result
     id |                                           content                                            
    ----+----------------------------------------------------------------------------------------------
      1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.
      2 | Li was born in 1962 in Wendeng County, Shandong.
      3 | He graduated from the department of physics at Shandong University.
      4 | The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.
      5 | The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year's Eve dinner, and making New Year visits.
      6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.
      7 | Shandong has dozens of universities.
      8 | ShanDa is a famous university of Shandong.
  • Term search.

    -- (T1) Term query: The tokenization result explicitly contains "Spring Festival", so it can be retrieved.
    SELECT * FROM wiki_articles 
             WHERE TEXT_SEARCH(content, 'Spring Festival', 'term') > 0;
    
    -- Result         
     id |                                           content                                            
    ----+----------------------------------------------------------------------------------------------
      4 | The Spring Festival, also known as the Lunar New Year, is the most important traditional festival in China.
      5 | The Spring Festival usually falls between late January and mid-February in the Gregorian calendar. Main customs during the Spring Festival include pasting spring couplets, setting off firecrackers, having a New Year's Eve dinner, and making New Year visits.
      6 | In 2006, the Spring Festival was approved by the State Council as one of the first national intangible cultural heritages.
    
    
    -- (T2) Term query: Example that returns no results.
    -- Because the content column of the wiki_articles source table uses the Jieba tokenizer by default, which splits "shandong university".
    -- We recommend using term queries with the keyword tokenizer. In that case, the target data can be retrieved.
    SELECT * FROM wiki_articles 
             WHERE TEXT_SEARCH(content, 'shandong university', 'term') > 0;
    
    -- Result         
     id | content 
    ----+---------

Complex query examples

  • Query with a primary key.

    -- Retrieve text that contains 'shandong' or 'university' and has an id of 3.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 and id = 3;
    
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      3 | He graduated from the department of physics at Shandong University.
      
    
    -- Retrieve text that contains 'shandong' or 'university', or has an id less than 2.
    SELECT * FROM wiki_articles WHERE TEXT_SEARCH(content, 'shandong university') > 0 OR id < 2;
    
    -- Result
     id |                               content                               
    ----+---------------------------------------------------------------------
      2 | Li was born in 1962 in Wendeng County, Shandong.
      8 | ShanDa is a famous university of Shandong.
      1 | The Yangtze River is the longest river in China and the third longest in the world, with a total length of about 6,300 kilometers.
      3 | He graduated from the department of physics at Shandong University.
      7 | Shandong has dozens of universities.
  • Obtain the score and retrieve the top three results.

    SELECT id,
           content,
           TEXT_SEARCH(content, 'shandong university') AS score,
           TOKENIZE(content, 'jieba')
      FROM wiki_articles
    ORDER BY score DESC
    LIMIT 3;
    
    -- Result
    id  |                               content                               |  score  |                     tokenize                     
    ----+---------------------------------------------------------------------+---------+--------------------------------------------------
      8 | ShanDa is a famous university of Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
      7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
      3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
  • Use the TEXT_SEARCH function in both the SELECT list and the WHERE clause.

    SELECT id,
           content,
           TEXT_SEARCH(content, 'shandong university') AS score,
           TOKENIZE(content, 'jieba')
      FROM wiki_articles
     WHERE TEXT_SEARCH(content, 'shandong university') > 0
    ORDER BY score DESC;
    
    -- Result
    id  |                               content                               |  score  |                     tokenize                     
    ----+---------------------------------------------------------------------+---------+--------------------------------------------------
      7 | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
      8 | ShanDa is a famous university of Shandong.                          | 2.74634 | {shanda,famous,univers,shandong}
      3 | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
      2 | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}
  • Retrieve the documents from the 'wiki' source that are most relevant to 'shandong university'.

    -- Source table for JOIN.
    CREATE TABLE article_source (id int primary key, source text);
    INSERT INTO article_source VALUES (1, 'baike'), (2, 'wiki'), (3, 'wiki'), (4, 'baike'),
                                      (5, 'baike'), (6, 'baike'), (7, 'wiki'), (8, 'paper'),
                                      (9, 'http_log'), (10, 'http_log'), (11, 'http_log');
                                      
    SELECT a.id,
           source, content,
           TEXT_SEARCH(content, 'shandong university') AS score,
           TOKENIZE(a.content, 'jieba')
      FROM wiki_articles a
      JOIN article_source b
        ON (a.id = b.id)
     WHERE TEXT_SEARCH(a.content, 'shandong university') > 0
       AND b.source = 'wiki'
    ORDER BY score DESC;
    
    -- Result
    id  | source |                               content                               |  score  |                     tokenize                     
    ----+--------+---------------------------------------------------------------------+---------+--------------------------------------------------
      7 | wiki   | Shandong has dozens of universities.                                | 2.74634 | {shandong,has,dozen,univers}
      3 | wiki   | He graduated from the department of physics at Shandong University. | 2.38178 | {he,graduat,from,depart,physic,shandong,univers}
      2 | wiki   | Li was born in 1962 in Wendeng County, Shandong.                    | 1.09244 | {li,born,1962,wendeng,counti,shandong}

Usage recommendations

Reindex using Serverless resources

If you modify table properties, compaction may be triggered and the index may be rebuilt, which consumes a large amount of CPU resources. To modify the following table properties, perform the steps below:

  • Modifying `bitmap_columns`, `dictionary_encoding_columns`, or vector indexes triggers compaction and reindexing. Therefore, do not use the `ALTER TABLE xxx SET` syntax. Instead, run the following command to use the `REBUILD` syntax with Serverless Computing resources. For more information, see REBUILD.

ASYNC REBUILD TABLE <table_name> 
WITH (
    rebuild_guc_hg_computing_resource = 'serverless'
)
SET (
    bitmap_columns = '<col1>,<col2>',
    dictionary_encoding_columns = '<col1>:on,<col2>:off',
    vectors = '{
    "<col_vector>": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
            "base_quantization_type": "rabitq",
            "graph_storage_type": "compressed",
            "max_degree": 64,
            "ef_construction": 400,
            "precise_quantization_type": "fp32",
            "use_reorder": true,
            "max_total_size_to_merge_mb" : 4096
        }
    }
    }'
);
  • Modifying column-oriented JSONB columns or full-text index columns also triggers compaction and reindexing. The `REBUILD` syntax is not supported for these modifications. Instead, create a temporary table by following these steps:

BEGIN ;
-- Clean up potential temporary tables.
DROP TABLE IF EXISTS <table_new>;
-- Create a temporary table.
SET hg_experimental_enable_create_table_like_properties=on;
CALL HG_CREATE_TABLE_LIKE ('<table_new>', 'select * from <table>');
COMMIT ;

-- Enable column-oriented storage for JSON-formatted data for the corresponding column.
ALTER TABLE <table_new> ALTER COLUMN <column_name> SET (enable_columnar_type = ON);
-- Create a full-text index on the corresponding column.
CREATE INDEX <idx_name> ON <table_new> USING FULLTEXT (column_name);

-- Insert data into the temporary table, use Serverless resources to run the job, and synchronously complete index building.
SET hg_computing_resource = 'serverless';
INSERT INTO <table_new> SELECT * FROM <table>;
ANALYZE <table_new>;

BEGIN ;
-- Delete the old table.
DROP TABLE IF EXISTS  <table>;
-- Rename the temporary table.
ALTER TABLE <table_new> RENAME TO <table>;
COMMIT ;
  • To modify other properties, such as `distribution_key`, `clustering_key`, `segment_key`, and storage format, use the `REBUILD` syntax with Serverless Computing resources.

Advanced operations: Customize tokenizer configurations

Hologres recommends using the default tokenizer configurations. However, in some cases, the default configurations may not meet your business needs. You can customize the tokenizer configuration for more flexible tokenization.

analyzer_params configuration requirements

The `analyzer_params` parameter has the following configuration requirements:

  • Only JSON-formatted strings are supported.

  • The top level of the JSON supports two keys: `tokenizer` and `filter`.

    • filter: Optional. The value is a JSON array used to configure token filter properties. If you configure multiple token filter properties, they are applied to each token in the specified order.

    • tokenizer: Required. The value is a JSON object used to configure tokenizer properties. The JSON object supports the following keys:

      • type: Required. The tokenizer name.

      • Other parameters for the tokenizer object. Different tokenizers support different parameters, as shown in the following table:

        Tokenizer

        Other parameters for the
        tokenizer object

        Parameter description

        Parameter values

        jieba

        mode

        Defines the tokenization mode.

        • search (default): Lists multiple possible combinations during tokenization, which allows for redundancy. For example, the tokenization result for "traditional festival" is "traditional", "festival", and "traditional festival".

        • exact: Prevents redundant splitting during tokenization. For example, the tokenization result for "traditional festival" is only "traditional festival".

        hmm

        Specifies whether to use a Hidden Markov Model (HMM) to identify words that are not in the dictionary. This improves new word recognition.

        • true (default): Enabled

        • false: Disabled.

        standard

        max_token_length

        The maximum token length.

        A positive integer. The default value is 255. If a token exceeds this length, it is split at intervals of `max_token_length`.

        ik

        mode

        Defines the tokenization mode.

        • ik_max_word (default): Performs fine-grained tokenization that outputs all possible short words. For example, the tokenization result for "Nanjing City" is "Nanjing" and "City".

        • ik_smart: Performs coarse-grained tokenization that prioritizes long words and reduces the number of splits. The output tokens do not overlap. This mode attempts to combine numbers and quantifiers into a single token. For example, "Nanjing City" and "1 kilometer" are not further tokenized.

        enable_lowercase

        Specifies whether to convert tokens to lowercase.

        • true (default)

        • false

        ngram

        min_ngram

        The minimum character length of a token.

        A positive number. The default value is 1. The maximum difference between `min_ngram` and `max_ngram` is 3.

        Note: You can adjust the maximum difference by setting the following GUC. For example, SET hg_fulltext_index_max_ngram_diff = 5.

        max_ngram

        The maximum character length of a token.

        The default value is 2. The value must be in the range of [1, 255]. The maximum difference between `min_ngram` and `max_ngram` is 3.

        Note

        If the difference between `max_ngram` and `min_ngram` is too large, the ngram tokenizer generates many tokens. This can increase resource consumption, require more index storage, and lengthen index building times.

        prefix_only

        Specifies whether to consider only prefixes.

        • true

        • false (default)

        pinyin

        keep_first_letter

        Specifies whether to keep the first letter of each Chinese character as a token.

        • true (default): Keeps the first letter. For example, the tokenization result for "Li Ming" includes "lm".

        • false: Do not retain.

        keep_separate_first_letter

        Specifies whether to keep the first letter of each Chinese character as a separate token.

        • true: Keeps the first letters separately. For example, the tokenization result for "Li Ming" includes "l,m".

        • false (default): The setting does not retain the value.

        limit_first_letter_length

        The maximum length of the first-letter token.

        An integer that specifies the maximum length of the first-letter string. The default value is 16.

        keep_full_pinyin

        Specifies whether to keep the full Pinyin of each Chinese character as separate tokens.

        • true (default): Keeps the full Pinyin. For example, the tokenization result for "Li Ming" includes "li,ming".

        • false: The item is not retained.

        keep_joined_full_pinyin

        Specifies whether to join the full Pinyin of each Chinese character into a single token.

        • true: Joins the full Pinyin. For example, the tokenization result for "Li Ming" includes "liming".

        • false (default): The system does not establish a connection.

        keep_none_chinese

        Specifies whether to keep non-Chinese letters or numbers in the tokenization result.

        • true (default): Retain.

        • false: Not retained.

        keep_none_chinese_together

        Specifies whether to keep non-Chinese letters or numbers together.

        • true (default): Keeps non-Chinese characters together. For example, the tokenization result for "DJ Li Ming" is "DJ,li,ming", where "DJ" stays together.

        • false: Does not keep non-Chinese characters together. For example, the tokenization result for "DJ Li Ming" is "D,J,li,ming".

        Note

        This parameter takes effect only when `keep_none_chinese` is set to `true`.

        keep_none_chinese_in_first_letter

        Specifies whether to keep non-Chinese letters or numbers in the first-letter token.

        • true (default): Retain. For example, the tokenization result of 李明AT2025 includes lmat2025.

        • false: The item is not retained.

        keep_none_chinese_in_joined_full_pinyin

        Specifies whether to keep non-Chinese letters or numbers in the joined full Pinyin token.

        • true: Keeps non-Chinese characters. For example, the tokenization result for "Li Ming AT2025" includes "limingat2025".

        • false (default): The resource is not retained.

        none_chinese_pinyin_tokenize

        If non-Chinese letters are valid Pinyin, specifies whether to split them into separate Pinyin terms.

        • true (default): Splits the terms. For example, the tokenization result for "limingalibaba2025" includes "li,ming,a,li,ba,ba,2025".

        • false: Does not split the terms.

        Note

        This parameter takes effect only when `keep_none_chinese` and `keep_none_chinese_together` are set to `true`.

        keep_original

        Specifies whether to keep the original input.

        • true: Keeps the original input.

        • false (default): Does not keep the original input.

        lowercase

        Specifies whether to convert non-Chinese letters to lowercase.

        • true (default)

        • false

        trim_whitespace

        Specifies whether to trim whitespace characters.

        • true (default)

        • false

        remove_duplicated_term

        Specifies whether to remove duplicate terms.

        • If this parameter is set to true, certain tokens are removed. For example, "de的" is tokenized as "de", which affects the results of phrase queries for "de的".

        • false (default): Does not remove duplicate terms.

        keep_separate_chinese

        Specifies whether to keep individual Chinese characters as separate tokens.

        • true: Keeps individual characters. For example, the tokenization result for "Li Ming" includes the two tokens "Li" and "Ming".

        • false (default): The item is not retained.

Default analyzer_params configurations

The default `analyzer_params` configurations for different tokenizers are as follows:

Analyzer name

Default analyzer_params configuration

Tokenization example

jieba (default analyzer)

{
  "tokenizer": {
    "type": "jieba", 
    "mode": "search",
    "hmm": true
  }, 
  "filter": [
    "removepunct",
    "lowercase",
    {"type": "stop", "stop_words": ["_english_"]},
    {"type": "stemmer", "language": "english"}
  ]
}
 Spring Festival
 ,
 that is
 ,
 the Lunar New Year
 ,
 is
 China
 's
 most
 important
 traditional
 festival
 traditional festival
 .

whitespace

{
  "tokenizer": {
    "type": "whitespace"
  }
}
Spring Festival, that is the Lunar New Year, is China's most important traditional festival.

keyword

{
  "tokenizer": {
    "type": "keyword"
  }
}
Spring Festival, that is the Lunar New Year, is China's most important traditional festival.

simple

{
  "tokenizer": {
    "type": "simple"
  }, 
  "filter": [
    "lowercase"
  ]
}
Spring Festival
that is the Lunar New Year
is China's most important traditional festival

standard

{
  "tokenizer": {
    "type": "standard",
    "max_token_length": 255
  }, 
  "filter": [
    "lowercase"
  ]
}
Spring
Festival
that
is
the
Lunar
New
Year
is
China
's
most
important
traditional
festival

icu

{
  "tokenizer": {
    "type": "icu"
  }, 
  "filter": [
    "removepunct",
    "lowercase"
  ]
}
Spring Festival
,
 that is
 the Lunar New Year
 ,
 is
 China
 's most
 important
 traditional festival
 .

ik

{
  "tokenizer": {
    "type": "ik",
    "mode": "ik_max_word",
    "enable_lowercase": true
  },
  "filter": [
    {"type": "stop", "stop_words": ["_english_"]},
    {"type": "stemmer", "language": "english"}
  ]
}
Spring Festival
that
is
the Lunar New Year
is
China
's most
most important
important
traditional festival
traditional
festival

ngram

{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 1,
    "max_gram": 2,
    "prefix_only": false
  }
}
Spring
Spring Festival
Festival
Festival,
,
, that
that
that is
is
is the
the
the Lunar
Lunar
Lunar New
New
New Year
Year
Year,
,
, is
is
is China
China
China's
's
's most
most
most important
important
important
important
important
important
traditional
traditional
traditional festival
festival
festival
.

pinyin

{
  "tokenizer": {
    "type": "pinyin",
    "keep_first_letter": true,
    "keep_separate_first_letter": false,
    "keep_full_pinyin": true,
    "keep_joined_full_pinyin": false,
    "keep_none_chinese": true,
    "keep_none_chinese_together": true,
    "none_chinese_pinyin_tokenize": true,
    "keep_original": false,
    "limit_first_letter_length": 16,
    "lowercase": true,
    "trim_whitespace": true,
    "keep_none_chinese_in_first_letter": true,
    "keep_none_chinese_in_joined_full_pinyin": false,
    "remove_duplicated_term": false,
    "ignore_pinyin_offset": true,
    "fixed_pinyin_offset": false,
    "keep_separate_chinese": false
  }
}
chun
cjjnlxnszgzzydct
jie
ji
nong
li
xin
nian
shi
zhong
guo
zui
zhong
yao
de
chuan
tong
jie
ri

Filter configuration in analyzer_params

Hologres supports the following filters, also known as token filter properties, in `analyzer_params`.

Note

If you configure multiple token filter properties, they are applied to each token in the specified order.

Property name

Property description

Parameter format

Example

lowercase

Converts uppercase letters in a token to lowercase.

Declare `lowercase`.

"lowercase"
  • Token filter property definition

    "filter": ["lowercase"]
  • Token filter result

    ["Hello", "WORLD"] is converted to ["hello", "world"].

stop

Removes stop-word tokens.


stop_words: A list of stop words. This must be a list that contains only strings. You can define custom stop words or use the built-in stop word dictionaries for specific languages:

"_english_"
"_danish_"
"_dutch_"
"_finnish_"
"_french_"
"_german_"
"_hungarian_"
"_italian_"
"_norwegian_"
"_portuguese_"
"_russian_"
"_spanish_"
"_swedish_"
  • Token filter property definition

    "filter": [{
      "type": "stop",
      "stop_words": ["_english_", "cat"]
    }]
  • Token filter result

    ["the", "cat", "is", "on", "a", "mat"] is converted to ["mat"].

    Note

    "cat" is a custom stop word. "the", "is", "on", and "a" are stop words from the built-in "_english_" dictionary.

stemmer

Converts a token to its corresponding root form (stem) based on the grammatical rules of the language.

language: The language. The following built-in languages are supported:

"arabic",
"danish",
"dutch",
"english",
"finnish",
"french",
"german",
"greek",
"hungarian",
"italian",
"norwegian",
"portuguese",
"romanian",
"russian",
"spanish",
"swedish",
"tamil",
"turkish"
  • Token filter property definition

    "filter": [{
      "type": "stemmer",
      "language": "english"
    }]
  • Token filter result

    ["machine", "learning"] is converted to ["machin", "learn"].

length

Removes tokens that exceed a specified length.

max: The maximum length to keep. This must be a positive integer.

{"type": "length", "max": 10}
  • Token filter property definition

    "filter": [{"type": "length", "max": 10}]
  • Token filter result

    ["AI", "for", "Artificial", "Intelligence"] is converted to ["AI", "for", "Artificial"].

removepunct

Removes tokens that consist only of punctuation characters.

Declare `removepunct`.

"removepunct"
Note

In V4.0.8 and later, `removepunct` has a new `mode` parameter that defines the removal mode. The values are as follows:

  • if_all (default): Removes the token only if all characters in the token are punctuation characters.

  • if_any: Removes the token if it contains any punctuation characters.

  • Token filter property definition

    "filter": ["removepunct"]
    Note

    This is equivalent to "filter": [{"type": "removepunct", "mode": "if_all"}].

  • Token filter result

    ["chinese", "english", "chinese.", "english.", "124", "124!=8", ".", ",", ",,", " ..."] is converted to ["chinese", "english", "chinese.", "english.", "124", "124!=8"].

    "filter": [{"type": "removepunct", "mode": "if_any"}]

    ["chinese", "english", "chinese.", "english.", "124", "124!=8", ".", ",", ",,", " ..."] is converted to ["chinese", "english", "124"].

pinyin

Pinyin token filter.

{
  "type": "pinyin",
  "keep_first_letter": true,
  "keep_separate_first_letter": false,
  "keep_full_pinyin": true,
  "keep_joined_full_pinyin": false,
  "keep_none_chinese": true,
  "keep_none_chinese_together": true,
  "none_chinese_pinyin_tokenize": true,
  "keep_original": false,
  "limit_first_letter_length": 16,
  "lowercase": true,
  "trim_whitespace": true,
  "keep_none_chinese_in_first_letter": true,
  "keep_none_chinese_in_joined_full_pinyin": false,
  "remove_duplicated_term": false,
  "ignore_pinyin_offset": true,
  "fixed_pinyin_offset": false,
  "keep_separate_chinese": false
}

The filter configuration properties are the same as for the Pinyin tokenizer.