All Products
Search
Document Center

ApsaraDB for SelectDB:Inverted Index

Last Updated:Mar 28, 2026

An inverted index maps tokens to the rows that contain them, letting SelectDB skip irrelevant data entirely instead of scanning every row. Use inverted indexes for full-text search on string columns and for equality or range queries on numeric and date columns.

How it works

Each row in a SelectDB table is treated as a document, and each column is a field in that document. When you create an inverted index, SelectDB tokenizes the column values and builds a token-to-row mapping in a separate inverted file.

For example, given three rows in a comments table:

Row 1: "OLAP databases are fast"
Row 2: "Fast databases enable real-time analytics"
Row 3: "Real-time OLAP is ideal for dashboards"

The index maps tokens to rows like this:

olap      → [1, 3]
databases → [1, 2]
fast      → [1, 2]
analytics → [2]
real-time → [2, 3]

A search for OLAP resolves instantly by reading the index entry rather than scanning all rows.

The inverted file has only a logical relationship with the main data file (segment), not a physical one. This means adding or dropping an inverted index does not rewrite the main data file, so index management has minimal overhead.

Use cases

  • Full-text search on string columns

  • Equality and range filtering (=, !=, >, >=, <, <=) on string, numeric, and datetime columns

Benefits

  • Push down OR and NOT logic to the index layer

  • Support arbitrary combinations of AND, OR, and NOT across multiple conditions

  • Create inverted indexes at table creation time (synchronous)

  • Add inverted indexes to existing tables (asynchronous)

  • Delete existing inverted indexes from tables (asynchronous)

Limitations

LimitationWorkaround
FLOAT and DOUBLE are not supported (floating-point precision)Use DECIMAL, which supports inverted indexes
MAP, STRUCT, JSON, HLL, BITMAP, QUANTILE_STATE, and AGG_STATE are not supportedFor JSON, use VARIANT instead
Numeric columns do not support a tokenizer (parser property is not applicable)
Aggregate model: inverted indexes are supported only on key columns
Unique model without Merge-on-Write: inverted indexes are supported only on key columnsEnable Merge-on-Write to support inverted indexes on any column
Duplicate model: inverted indexes are supported on any column
Aggregate and Unique (without Merge-on-Write) models require reading all data before merging, so the engine cannot use indexes for early filtering on non-key columns.

Create an inverted index

At table creation (synchronous)

The index is ready as soon as the table is created.

Important

Index coverage depends on the data model:

  • Duplicate model: any column

  • Unique model with Merge-on-Write enabled: any column

  • Aggregate model and Unique model without Merge-on-Write: key columns only

Syntax

CREATE TABLE [IF NOT EXISTS] [db_name.]<table_name>
(
  <column_definition_list>,
  [<index_definition_list>]
)
table_properties;

index_definition syntax:

INDEX <index_name>(<column_name>) USING INVERTED [PROPERTIES("<key>" = "<value>")] [COMMENT '<comment>']

Parameters

ParameterRequiredDescription
db_nameNoName of the target database
table_nameYesName of the target table
column_definition_listYesList of column definitions. See CREATE TABLE
table_propertiesYesTable properties: data model, partitioning, bucketing. See Data model
index_definition_listNoOne or more index definitions, comma-separated

`index_definition` required parameters

ParameterDescription
index_nameName of the index
column_nameName of the indexed column
index_typeAlways USING INVERTED

`PROPERTIES` (optional)

Use PROPERTIES to control tokenization behavior. Specify one or more key-value pairs in the format "<key>" = "<value>", separated by commas. To preview how a string will be tokenized before creating the index, use the TOKENIZE function. See Tokenization functions.

KeyDescription
parserTokenizer. Not applicable to numeric columns. Options: english (splits on spaces and punctuation, highest performance), chinese (Chinese text, lower performance than english), unicode (multilingual text including mixed Chinese and English, email addresses, IP addresses, and alphanumeric strings). Default: no tokenizer — the entire value is indexed as a single token.
parser_modeTokenization granularity. All tokenizers default to coarse_grained (prefers longer tokens). When parser=chinese, fine_grained is also available (prefers shorter tokens). Example: 'Wuhan City Yangtze River Bridge' → coarse-grained: ['Wuhan City', 'Yangtze River Bridge']; fine-grained: ['Wuhan', 'Wuhan City', 'Mayor', 'Yangtze River', 'Yangtze River Bridge', 'Bridge'].
support_phraseEnables phrase query acceleration for MATCH_PHRASE. Default: false. Set to true to enable (requires additional storage). When false, use MATCH_ALL for multi-keyword queries instead.
char_filterPreprocesses strings before tokenization. Only char_replace is supported. char_replace replaces each character in char_filter_pattern with the corresponding character in char_filter_replacement (defaults to a space if not set).
ignore_aboveMaximum length for non-tokenized string indexes (when parser is not set). Strings longer than this value are not indexed; for arrays, the limit applies per element. Default: 256 bytes.
lower_casetrue: converts tokens to lowercase for case-insensitive matching. false: no conversion.
stopwordsStopword list used during ingestion and querying. Built-in list includes common words like is, the, and a. Set to none to use an empty list.
dict_compressionEnables ZSTD dictionary compression for the index dictionary. true: enables compression (reduces storage by approximately 20% for large text and log workloads; works best with inverted_index_storage_format = "V3"). false: disabled (default). Enable for large text or log datasets where storage cost matters. Requires version 4.1.0 or later.

COMMENT (optional)

ParameterDescription
commentDescription of the index

Example

-- Creates an inverted index on the comment column using English tokenization.
-- Omit PROPERTIES("parser" = "english") to index the full string without tokenization.
-- Also supports "chinese" and "unicode".
CREATE TABLE hackernews_1m
(
    `id`          BIGINT,
    `deleted`     TINYINT,
    `type`        String,
    `author`      String,
    `timestamp`   DateTimeV2,
    `comment`     String,
    `dead`        TINYINT,
    `parent`      BIGINT,
    `poll`        BIGINT,
    `children`    Array<BIGINT>,
    `url`         String,
    `score`       INT,
    `title`       String,
    `parts`       Array<INT>,
    `descendants` INT,
    INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10;

Add an index to an existing table (asynchronous)

This is an asynchronous operation. Track progress with SHOW ALTER TABLE COLUMN.

Syntax

ALTER TABLE <table_name> ADD INDEX <index_name>(<column_name>) USING INVERTED [PROPERTIES("<key>" = "<value>")];

Parameters are the same as for table creation.

Examples

-- Add an index without tokenization (treats the full value as a single token)
ALTER TABLE user_tb ADD INDEX index_userId(user_id) USING INVERTED;

-- Add an index with English tokenization
ALTER TABLE user_tb ADD INDEX index_city(city) USING INVERTED PROPERTIES("parser" = "english");

View indexes

View all indexes for a table

SHOW INDEXES FROM <table_name>;

Example:

SHOW INDEXES FROM user_tb;

Check async operation progress

ALTER TABLE ADD INDEX and DROP INDEX are asynchronous. Check progress with:

SHOW ALTER TABLE COLUMN;

Drop an index

Dropping an index is asynchronous. Both syntaxes are equivalent.

Important

Dropping an index may reduce query performance. Proceed with caution.

-- Syntax 1
DROP INDEX <index_name> ON <table_name>;

-- Syntax 2
ALTER TABLE <table_name> DROP INDEX <index_name>;

Examples:

DROP INDEX index_userId ON user_tb;
ALTER TABLE user_tb DROP INDEX index_city;

Query with inverted indexes

Full-text search

SELECT * FROM <table_name> WHERE <column_name> <match_function> '<keywords>';
ParameterRequiredDescription
table_nameYesName of the target table
column_nameYesName of the indexed column
match_functionYesMATCH_ANY: matches rows containing any of the keywords. MATCH_ALL: matches rows containing all keywords. MATCH_PHRASE: matches rows containing the exact phrase. Combine with AND, OR, and NOT for complex filters.
keywordsYesOne or more keywords, separated by spaces. Example: keyword1 keyword2 keyword3

Examples

-- Rows where logmsg contains "keyword1"
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1';

-- Rows where logmsg contains "keyword1" or "keyword2"
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1 keyword2';

-- Rows where logmsg contains both "keyword1" and "keyword2"
SELECT * FROM log_tb WHERE logmsg MATCH_ALL 'keyword1 keyword2';

-- Rows where logmsg contains "keyword1" immediately followed by "keyword2"
SELECT * FROM log_tb WHERE logmsg MATCH_PHRASE 'keyword1 keyword2';

Equality and range queries

For numeric and date columns, use standard SQL syntax. SelectDB automatically uses the inverted index when one exists.

-- Equality query
SELECT * FROM user_tb WHERE id = 123;

-- Range query
SELECT * FROM user_tb WHERE ts > '2023-01-01 00:00:00';

-- IN query
SELECT * FROM user_tb WHERE op_type IN ('add', 'delete');

Performance benchmark

The following results use 1,000,000 rows from the hackernews dataset to compare query performance with and without inverted indexes.

Set up the environment

Step 1: Create the table

CREATE DATABASE test_inverted_index;
USE test_inverted_index;

CREATE TABLE hackernews_1m
(
    `id`          BIGINT,
    `deleted`     TINYINT,
    `type`        String,
    `author`      String,
    `timestamp`   DateTimeV2,
    `comment`     String,
    `dead`        TINYINT,
    `parent`      BIGINT,
    `poll`        BIGINT,
    `children`    Array<BIGINT>,
    `url`         String,
    `score`       INT,
    `title`       String,
    `parts`       Array<INT>,
    `descendants` INT,
    INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10;

Step 2: Import data

  1. Download the dataset.

    wget https://qa-build.oss-cn-beijing.aliyuncs.com/regression/index/hacknernews_1m.csv.gz
  2. Import using Stream Load. Find the connection host and port on the instance details page in the ApsaraDB for SelectDB console. For more information, see Stream Load.

    curl --location-trusted -u root: -H "compress_type:gz" \
      -T hacknernews_1m.csv.gz \
      http://<host>:<port>/api/test_inverted_index/hackernews_1m/_stream_load

    Expected output:

    {
        "TxnId": 2,
        "Label": "a8a3e802-2329-49e8-912b-04c800a461a6",
        "TwoPhaseCommit": "false",
        "Status": "Success",
        "Message": "OK",
        "NumberTotalRows": 1000000,
        "NumberLoadedRows": 1000000,
        "NumberFilteredRows": 0,
        "NumberUnselectedRows": 0,
        "LoadBytes": 130618406,
        "LoadTimeMs": 8988,
        "BeginTxnTimeMs": 23,
        "StreamLoadPutTimeMs": 113,
        "ReadDataTimeMs": 4788,
        "WriteDataTimeMs": 8811,
        "CommitAndPublishTimeMs": 38
    }
  3. Verify the import.

    SELECT count() FROM hackernews_1m;
    +---------+
    | count() |
    +---------+
    | 1000000 |
    +---------+

Full-text search results

Tokenized inverted index results may differ slightly from LIKE results because tokenization normalizes input (for example, converting to lowercase).
QueryLIKEInverted indexSpeedup
comment contains OLAP0.18 s0.02 s (MATCH_ANY)9x
comment contains OLTP0.07 s0.01 s (MATCH_ANY)7x
comment contains both OLAP and OLTP0.13 s0.01 s (MATCH_ALL)12x
comment contains OLAP or OLTP0.12 s0.01 s (MATCH_ANY)12x

Example queries:

-- LIKE (0.18 s)
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%';

-- Inverted index (0.02 s)
SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP';

-- LIKE (0.13 s)
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%';

-- Inverted index (0.01 s)
SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP';

Equality and range query results

QueryBefore indexAfter indexSpeedup
timestamp > '2007-08-23 04:17:00'0.03 s0.01 s2x
author = 'faster'0.03 s0.01 s2x
parent = 111890.01 s0.01 s

Steps for the timestamp example:

-- 1. Baseline: 0.03 s
SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';

-- 2. Create the index
CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED;

-- 3. Check progress (building 1M rows takes approximately 1 second)
SHOW ALTER TABLE COLUMN;

-- 4. After the index is ready: 0.01 s
SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';

Steps for the author example:

-- 1. Baseline: 0.03 s
SELECT count() FROM hackernews_1m WHERE author = 'faster';

-- 2. Create the index (no tokenizer — indexes the full string as one token)
ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED;

-- 3. Check progress (building 1M rows takes approximately 1.5 seconds)
SHOW ALTER TABLE COLUMN;

-- 4. After the index is ready: 0.01 s
SELECT count() FROM hackernews_1m WHERE author = 'faster';

Tokenization functions

Use TOKENIZE to preview how a string will be tokenized before creating an index. This helps you choose the right parser and parser_mode settings.

TOKENIZE('<text>', '"parser"="<parser_name>"[,"parser_mode"="<mode>"]')
ParameterDescription
parserenglish, chinese, or unicode. See the parser key in the PROPERTIES table above.
parser_modecoarse_grained or fine_grained. Applies only when parser=chinese.

Examples

-- English tokenization: lowercases and splits on spaces and punctuation
SELECT TOKENIZE('I love CHINA', '"parser"="english"');
+------------------------------------------------+
| tokenize('I love CHINA', '"parser"="english"') |
+------------------------------------------------+
| ["i", "love", "china"]                         |
+------------------------------------------------+

-- Chinese tokenization, fine-grained: produces shorter tokens with overlapping phrases
SELECT TOKENIZE('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"');
+-----------------------------------------------------------------------------------+
| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"')       |
+-----------------------------------------------------------------------------------+
| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"]                              |
+-----------------------------------------------------------------------------------+

-- Chinese tokenization, coarse-grained: produces fewer, longer tokens
SELECT TOKENIZE('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"');
+----------------------------------------------------------------------------------------+
| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"')        |
+----------------------------------------------------------------------------------------+
| ["武汉市", "长江大桥"]                                                                 |
+----------------------------------------------------------------------------------------+

-- Unicode tokenization: handles mixed Chinese and English text
SELECT TOKENIZE('I love CHINA 我爱我的祖国', '"parser"="unicode"');
+-------------------------------------------------------------------+
| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"')       |
+-------------------------------------------------------------------+
| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"]        |
+-------------------------------------------------------------------+