An inverted index maps tokens to the rows that contain them, letting SelectDB skip irrelevant data entirely instead of scanning every row. Use inverted indexes for full-text search on string columns and for equality or range queries on numeric and date columns.
How it works
Each row in a SelectDB table is treated as a document, and each column is a field in that document. When you create an inverted index, SelectDB tokenizes the column values and builds a token-to-row mapping in a separate inverted file.
For example, given three rows in a comments table:
Row 1: "OLAP databases are fast"
Row 2: "Fast databases enable real-time analytics"
Row 3: "Real-time OLAP is ideal for dashboards"The index maps tokens to rows like this:
olap → [1, 3]
databases → [1, 2]
fast → [1, 2]
analytics → [2]
real-time → [2, 3]A search for OLAP resolves instantly by reading the index entry rather than scanning all rows.
The inverted file has only a logical relationship with the main data file (segment), not a physical one. This means adding or dropping an inverted index does not rewrite the main data file, so index management has minimal overhead.
Use cases
Full-text search on string columns
Equality and range filtering (
=,!=,>,>=,<,<=) on string, numeric, and datetime columns
Benefits
Push down OR and NOT logic to the index layer
Support arbitrary combinations of AND, OR, and NOT across multiple conditions
Create inverted indexes at table creation time (synchronous)
Add inverted indexes to existing tables (asynchronous)
Delete existing inverted indexes from tables (asynchronous)
Limitations
| Limitation | Workaround |
|---|---|
| FLOAT and DOUBLE are not supported (floating-point precision) | Use DECIMAL, which supports inverted indexes |
| MAP, STRUCT, JSON, HLL, BITMAP, QUANTILE_STATE, and AGG_STATE are not supported | For JSON, use VARIANT instead |
Numeric columns do not support a tokenizer (parser property is not applicable) | — |
| Aggregate model: inverted indexes are supported only on key columns | — |
| Unique model without Merge-on-Write: inverted indexes are supported only on key columns | Enable Merge-on-Write to support inverted indexes on any column |
| Duplicate model: inverted indexes are supported on any column | — |
Aggregate and Unique (without Merge-on-Write) models require reading all data before merging, so the engine cannot use indexes for early filtering on non-key columns.
Create an inverted index
At table creation (synchronous)
The index is ready as soon as the table is created.
Index coverage depends on the data model:
Duplicate model: any column
Unique model with Merge-on-Write enabled: any column
Aggregate model and Unique model without Merge-on-Write: key columns only
Syntax
CREATE TABLE [IF NOT EXISTS] [db_name.]<table_name>
(
<column_definition_list>,
[<index_definition_list>]
)
table_properties;index_definition syntax:
INDEX <index_name>(<column_name>) USING INVERTED [PROPERTIES("<key>" = "<value>")] [COMMENT '<comment>']Parameters
| Parameter | Required | Description |
|---|---|---|
db_name | No | Name of the target database |
table_name | Yes | Name of the target table |
column_definition_list | Yes | List of column definitions. See CREATE TABLE |
table_properties | Yes | Table properties: data model, partitioning, bucketing. See Data model |
index_definition_list | No | One or more index definitions, comma-separated |
`index_definition` required parameters
| Parameter | Description |
|---|---|
index_name | Name of the index |
column_name | Name of the indexed column |
index_type | Always USING INVERTED |
`PROPERTIES` (optional)
Use PROPERTIES to control tokenization behavior. Specify one or more key-value pairs in the format "<key>" = "<value>", separated by commas. To preview how a string will be tokenized before creating the index, use the TOKENIZE function. See Tokenization functions.
| Key | Description |
|---|---|
parser | Tokenizer. Not applicable to numeric columns. Options: english (splits on spaces and punctuation, highest performance), chinese (Chinese text, lower performance than english), unicode (multilingual text including mixed Chinese and English, email addresses, IP addresses, and alphanumeric strings). Default: no tokenizer — the entire value is indexed as a single token. |
parser_mode | Tokenization granularity. All tokenizers default to coarse_grained (prefers longer tokens). When parser=chinese, fine_grained is also available (prefers shorter tokens). Example: 'Wuhan City Yangtze River Bridge' → coarse-grained: ['Wuhan City', 'Yangtze River Bridge']; fine-grained: ['Wuhan', 'Wuhan City', 'Mayor', 'Yangtze River', 'Yangtze River Bridge', 'Bridge']. |
support_phrase | Enables phrase query acceleration for MATCH_PHRASE. Default: false. Set to true to enable (requires additional storage). When false, use MATCH_ALL for multi-keyword queries instead. |
char_filter | Preprocesses strings before tokenization. Only char_replace is supported. char_replace replaces each character in char_filter_pattern with the corresponding character in char_filter_replacement (defaults to a space if not set). |
ignore_above | Maximum length for non-tokenized string indexes (when parser is not set). Strings longer than this value are not indexed; for arrays, the limit applies per element. Default: 256 bytes. |
lower_case | true: converts tokens to lowercase for case-insensitive matching. false: no conversion. |
stopwords | Stopword list used during ingestion and querying. Built-in list includes common words like is, the, and a. Set to none to use an empty list. |
dict_compression | Enables ZSTD dictionary compression for the index dictionary. true: enables compression (reduces storage by approximately 20% for large text and log workloads; works best with inverted_index_storage_format = "V3"). false: disabled (default). Enable for large text or log datasets where storage cost matters. Requires version 4.1.0 or later. |
COMMENT (optional)
| Parameter | Description |
|---|---|
comment | Description of the index |
Example
-- Creates an inverted index on the comment column using English tokenization.
-- Omit PROPERTIES("parser" = "english") to index the full string without tokenization.
-- Also supports "chinese" and "unicode".
CREATE TABLE hackernews_1m
(
`id` BIGINT,
`deleted` TINYINT,
`type` String,
`author` String,
`timestamp` DateTimeV2,
`comment` String,
`dead` TINYINT,
`parent` BIGINT,
`poll` BIGINT,
`children` Array<BIGINT>,
`url` String,
`score` INT,
`title` String,
`parts` Array<INT>,
`descendants` INT,
INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10;Add an index to an existing table (asynchronous)
This is an asynchronous operation. Track progress with SHOW ALTER TABLE COLUMN.
Syntax
ALTER TABLE <table_name> ADD INDEX <index_name>(<column_name>) USING INVERTED [PROPERTIES("<key>" = "<value>")];Parameters are the same as for table creation.
Examples
-- Add an index without tokenization (treats the full value as a single token)
ALTER TABLE user_tb ADD INDEX index_userId(user_id) USING INVERTED;
-- Add an index with English tokenization
ALTER TABLE user_tb ADD INDEX index_city(city) USING INVERTED PROPERTIES("parser" = "english");View indexes
View all indexes for a table
SHOW INDEXES FROM <table_name>;Example:
SHOW INDEXES FROM user_tb;Check async operation progress
ALTER TABLE ADD INDEX and DROP INDEX are asynchronous. Check progress with:
SHOW ALTER TABLE COLUMN;Drop an index
Dropping an index is asynchronous. Both syntaxes are equivalent.
Dropping an index may reduce query performance. Proceed with caution.
-- Syntax 1
DROP INDEX <index_name> ON <table_name>;
-- Syntax 2
ALTER TABLE <table_name> DROP INDEX <index_name>;Examples:
DROP INDEX index_userId ON user_tb;
ALTER TABLE user_tb DROP INDEX index_city;Query with inverted indexes
Full-text search
SELECT * FROM <table_name> WHERE <column_name> <match_function> '<keywords>';| Parameter | Required | Description |
|---|---|---|
table_name | Yes | Name of the target table |
column_name | Yes | Name of the indexed column |
match_function | Yes | MATCH_ANY: matches rows containing any of the keywords. MATCH_ALL: matches rows containing all keywords. MATCH_PHRASE: matches rows containing the exact phrase. Combine with AND, OR, and NOT for complex filters. |
keywords | Yes | One or more keywords, separated by spaces. Example: keyword1 keyword2 keyword3 |
Examples
-- Rows where logmsg contains "keyword1"
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1';
-- Rows where logmsg contains "keyword1" or "keyword2"
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1 keyword2';
-- Rows where logmsg contains both "keyword1" and "keyword2"
SELECT * FROM log_tb WHERE logmsg MATCH_ALL 'keyword1 keyword2';
-- Rows where logmsg contains "keyword1" immediately followed by "keyword2"
SELECT * FROM log_tb WHERE logmsg MATCH_PHRASE 'keyword1 keyword2';Equality and range queries
For numeric and date columns, use standard SQL syntax. SelectDB automatically uses the inverted index when one exists.
-- Equality query
SELECT * FROM user_tb WHERE id = 123;
-- Range query
SELECT * FROM user_tb WHERE ts > '2023-01-01 00:00:00';
-- IN query
SELECT * FROM user_tb WHERE op_type IN ('add', 'delete');Performance benchmark
The following results use 1,000,000 rows from the hackernews dataset to compare query performance with and without inverted indexes.
Set up the environment
Step 1: Create the table
CREATE DATABASE test_inverted_index;
USE test_inverted_index;
CREATE TABLE hackernews_1m
(
`id` BIGINT,
`deleted` TINYINT,
`type` String,
`author` String,
`timestamp` DateTimeV2,
`comment` String,
`dead` TINYINT,
`parent` BIGINT,
`poll` BIGINT,
`children` Array<BIGINT>,
`url` String,
`score` INT,
`title` String,
`parts` Array<INT>,
`descendants` INT,
INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10;Step 2: Import data
Download the dataset.
wget https://qa-build.oss-cn-beijing.aliyuncs.com/regression/index/hacknernews_1m.csv.gzImport using Stream Load. Find the connection host and port on the instance details page in the ApsaraDB for SelectDB console. For more information, see Stream Load.
curl --location-trusted -u root: -H "compress_type:gz" \ -T hacknernews_1m.csv.gz \ http://<host>:<port>/api/test_inverted_index/hackernews_1m/_stream_loadExpected output:
{ "TxnId": 2, "Label": "a8a3e802-2329-49e8-912b-04c800a461a6", "TwoPhaseCommit": "false", "Status": "Success", "Message": "OK", "NumberTotalRows": 1000000, "NumberLoadedRows": 1000000, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 130618406, "LoadTimeMs": 8988, "BeginTxnTimeMs": 23, "StreamLoadPutTimeMs": 113, "ReadDataTimeMs": 4788, "WriteDataTimeMs": 8811, "CommitAndPublishTimeMs": 38 }Verify the import.
SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+
Full-text search results
Tokenized inverted index results may differ slightly from LIKE results because tokenization normalizes input (for example, converting to lowercase).| Query | LIKE | Inverted index | Speedup |
|---|---|---|---|
comment contains OLAP | 0.18 s | 0.02 s (MATCH_ANY) | 9x |
comment contains OLTP | 0.07 s | 0.01 s (MATCH_ANY) | 7x |
comment contains both OLAP and OLTP | 0.13 s | 0.01 s (MATCH_ALL) | 12x |
comment contains OLAP or OLTP | 0.12 s | 0.01 s (MATCH_ANY) | 12x |
Example queries:
-- LIKE (0.18 s)
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%';
-- Inverted index (0.02 s)
SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP';
-- LIKE (0.13 s)
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%';
-- Inverted index (0.01 s)
SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP';Equality and range query results
| Query | Before index | After index | Speedup |
|---|---|---|---|
timestamp > '2007-08-23 04:17:00' | 0.03 s | 0.01 s | 2x |
author = 'faster' | 0.03 s | 0.01 s | 2x |
parent = 11189 | 0.01 s | 0.01 s | — |
Steps for the timestamp example:
-- 1. Baseline: 0.03 s
SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';
-- 2. Create the index
CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED;
-- 3. Check progress (building 1M rows takes approximately 1 second)
SHOW ALTER TABLE COLUMN;
-- 4. After the index is ready: 0.01 s
SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';Steps for the author example:
-- 1. Baseline: 0.03 s
SELECT count() FROM hackernews_1m WHERE author = 'faster';
-- 2. Create the index (no tokenizer — indexes the full string as one token)
ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED;
-- 3. Check progress (building 1M rows takes approximately 1.5 seconds)
SHOW ALTER TABLE COLUMN;
-- 4. After the index is ready: 0.01 s
SELECT count() FROM hackernews_1m WHERE author = 'faster';Tokenization functions
Use TOKENIZE to preview how a string will be tokenized before creating an index. This helps you choose the right parser and parser_mode settings.
TOKENIZE('<text>', '"parser"="<parser_name>"[,"parser_mode"="<mode>"]')| Parameter | Description |
|---|---|
parser | english, chinese, or unicode. See the parser key in the PROPERTIES table above. |
parser_mode | coarse_grained or fine_grained. Applies only when parser=chinese. |
Examples
-- English tokenization: lowercases and splits on spaces and punctuation
SELECT TOKENIZE('I love CHINA', '"parser"="english"');
+------------------------------------------------+
| tokenize('I love CHINA', '"parser"="english"') |
+------------------------------------------------+
| ["i", "love", "china"] |
+------------------------------------------------+
-- Chinese tokenization, fine-grained: produces shorter tokens with overlapping phrases
SELECT TOKENIZE('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"');
+-----------------------------------------------------------------------------------+
| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') |
+-----------------------------------------------------------------------------------+
| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] |
+-----------------------------------------------------------------------------------+
-- Chinese tokenization, coarse-grained: produces fewer, longer tokens
SELECT TOKENIZE('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"');
+----------------------------------------------------------------------------------------+
| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') |
+----------------------------------------------------------------------------------------+
| ["武汉市", "长江大桥"] |
+----------------------------------------------------------------------------------------+
-- Unicode tokenization: handles mixed Chinese and English text
SELECT TOKENIZE('I love CHINA 我爱我的祖国', '"parser"="unicode"');
+-------------------------------------------------------------------+
| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') |
+-------------------------------------------------------------------+
| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] |
+-------------------------------------------------------------------+