Inverted indexes are commonly used for information retrieval. You can tokenize text into terms to create inverted indexes. This helps you quickly retrieve documents that contain the specified terms. ApsaraDB for SelectDB supports inverted indexes. You can use inverted indexes to perform full-text searches for data of the TEXT type and equality or range queries on data of the NUMERIC and DATE types. This way, you can quickly retrieve data that meets specific conditions from a large amount of data. This topic describes the features of inverted indexes in ApsaraDB for SelectDB and how to create and use inverted indexes.
Implementation
In the implementation of inverted indexes in ApsaraDB for SelectDB, each row in a table corresponds to a document, and each column corresponds to a field in the document. Therefore, you can use an inverted index to quickly find rows that contain specific terms. This accelerates queries that contain the WHERE clause.
Inverted indexes are different from other indexes in ApsaraDB for SelectDB. ApsaraDB for SelectDB uses a separate file to store inverted indexes. The separate file has a logical mapping with the segment file, but the two files are independent of each other. This eliminates the need to rewrite the segment file when you update and delete inverted indexes, which significantly reduces processing overhead.
Scenarios
You want to accelerate full-text searches for data of the STRING type.
You want to accelerate searches for data of the STRING, NUMERIC, or DATETIME type by using the following operators:
=, !=, >, >=, <, and <=.
Benefits
Support logic combination.
Support index filter pushdown for the
OR and NOToperators.Support a combination of the
AND, OR, and NOToperators.
Support flexible and quick management of indexes.
Allow you to create inverted indexes when you create a table.
Allow you to create inverted indexes for an existing table.
Allow you to delete existing inverted indexes from a table.
Limits
The FLOAT and DOUBLE data types that do not have high precision do not support inverted indexes. The DECIMAL data type has high precision and supports inverted indexes. You can use the DECIMAL type.
Some complex data types do not support inverted indexes. Complex data types include MAP, STRUCT, JSON, HLL, BITMAP, QUANTILE_STATE, and AGG_STATE. The JSON data type can support inverted indexes after it is converted into the VARIANT data type.
Fields of the NUMERIC type support inverted indexes, but you are not allowed to specify parser for the fields. Valid values of parser: english, chinese, and unicode.
You can create inverted indexes for any columns in the DUPLICATE model and the UNIQUE model for which the Merge on Write (MoW) is enabled. You can create inverted indexes only for the Key column in the AGGREGATE model and the UNIQUE model for which MoW is disabled. You can create inverted indexes for other columns in the models. The models need to read all data and merge it. Therefore, you cannot use indexes to filter data in advance.
Create inverted indexes
You can create an inverted index when you create a table or create an inverted index on a column in an existing table.
Create inverted indexes when you create a table
This operation is synchronous. If you create inverted indexes when you create a table, the table and inverted indexes are synchronously created.
Inverted indexes have different limits in the following data models:
In the Aggregate key model, you can create inverted indexes only on key columns.
In the Unique key model, you must enable the MoW feature. After you enable the MoW feature, you can create an inverted index on any column.
In the Duplicate key model, you can create an inverted index on any column.
Syntax
CREATE TABLE [IF NOT EXISTS] [db_name.]<table_name>
(
<column_definition_list>,
[<index_definition_list>]
)
table_properties;Parameters
Parameters that are used to create a table
Parameter | Required | Description |
db_name | No | The name of the database in which you want to create a table. |
table_name | Yes | The name of the table that you want to create. |
column_definition_list | Yes | The list of column definitions. For more information, see CREATE-TABLE. |
table_properties | Yes | The properties of the table, such as the data model and partitioning and bucketing settings. For more information, see Data models. |
index_definition_list | No | The list of index definitions. |
index_definition_list
You can define multiple indexes in the following format when you create a table: index_definition[, index_definition][, index_definition]....
Syntax of index_definition
INDEX <index_name>(<column_name>) <index_type> [PROPERTIES("<key>" = "<value>")] [COMMENT '<comment>']Parameters of index_definition
Required parameters
Parameter | Description |
index_name | The name of the index. |
column_name | The name of the column on which the index is created. |
index_type | The type of the index. Set the value to |
Optional parameters
PROPERTIES
The PROPERTIES parameter specifies whether to tokenize text into terms based on the index. The value of the PROPERTIES parameter consists of one or more key-value pairs that are separated by commas (,). Each key-value pair is in the format of "<key>" = "<value>". To view the tokenization result of continuous text, you can call the TOKENIZE function. For more information, see TOKENIZE function.
key | value |
parser | The tokenizer. By default, the key is left empty, which specifies that text is not tokenized into terms. You are not allowed to to specify parser for the fields of the NUMERIC type. Valid values:
|
parser_mode | The word tokenization mode. The granularity of word tokenization varies based on the word tokenization mode. By default, the coarse_grained mode is used for all tokenizers. In coarse_grained mode, text is tokenized into long terms. For example, If the For more information about how to use word tokenization, see TOKENIZE function. |
support_phrase | Specifies whether to accelerate the MATCH_PHRASE query by using the index. Default value: false.
|
char_filter | The parameters that are used to process strings before text is tokenized into terms. The char_filter_type parameter can be set to char_replace only. If you set the char_filter_type parameter to char_replace, characters in the value of the char_filter_pattern parameter are replaced with the characters in the value of the char_filter_replacement parameter.
|
COMMENT
Parameter | Description |
comment | The description of the index. |
Example
-- Create a table and an inverted index named idx_comment on the comment column.
-- USING INVERTED specifies that the index type is inverted index.
-- PROPERTIES("parser" = "english") specifies that the English tokenizer is used to tokenize text into terms. You can also set the parser parameter to chinese or unicode. If you leave the parser parameter empty, text is not tokenized into terms.
CREATE TABLE hackernews_1m
(
`id` BIGINT,
`deleted` TINYINT,
`type` String,
`author` String,
`timestamp` DateTimeV2,
`comment` String,
`dead` TINYINT,
`parent` BIGINT,
`poll` BIGINT,
`children` Array<BIGINT>,
`url` String,
`score` INT,
`title` String,
`parts` Array<INT>,
`descendants` INT,
INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10;Create inverted indexes for an existing table
This operation is asynchronous. You can execute the SHOW ALTER TABLE COLUMN; statement to query the index creation progress.
Syntax
ALTER TABLE <table_name> ADD INDEX <index_name>(<column_name>) <index_type> [PROPERTIES("<key>" = "<value>")];Parameters
The parameters are the same as those used to create inverted indexes when you create a table.
Examples
Create an inverted index that is not used to split text into terms.
ALTER TABLE user_tb ADD INDEX index_userId(user_id) USING INVERTED ;Create an inverted index that is used to split text into terms by using the English tokenizer.
ALTER TABLE user_tb ADD INDEX index_city(city) USING INVERTED PROPERTIES("parser" = "english");Query the information about inverted indexes
Query the index change progress
You can execute the ALTER or DROP statement to change an inverted index. This operation is asynchronous. You can execute the following statement to query the change progress:
SHOW ALTER TABLE COLUMN;Query all indexes that are created for a table
Syntax
SHOW INDEXES FROM <table_name>;Example
SHOW INDEXES FROM user_tb;Delete inverted indexes
You can delete indexes in asynchronous mode. For more information about how to query the index deletion progress, see Query the information about inverted indexes.
If you delete inverted indexes from a table, the performance of queries on the table decreases. Proceed with caution.
Syntax
-- Syntax 1
DROP INDEX <index_name> ON <table_name>;
-- Syntax 2
ALTER TABLE <table_name> DROP INDEX <index_name>;Example
DROP INDEX index_userId ON user_tb;
ALTER TABLE user_tb DROP INDEX index_city;Use inverted indexes
Full-text searches
Syntax
SELECT * FROM <table_name> WHERE <column_name> <conditional_logic> '<keywords>';Parameters
Parameter | Required | Description |
table_name | Yes | The name of the table that you want to query. |
column_name | Yes | The name of the column that you want to query. |
conditional_logic | Yes | The conditional logic, which is a combination of full-text search keywords and logical operators. Logical operators: Full-text search keywords:
|
keywords | Yes | The terms that are used to query data. Separate multiple terms with spaces. Example: |
Examples
-- Query all rows that contain keyword1 in the logmsg column from the log_tb table.
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1';
-- Query all rows that contain keyword1 or keyword2 in the logmsg column from the log_tb table.
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1 keyword2';
-- Query all rows that contain keyword1 and keyword2 in the logmsg column from the log_tb table.
SELECT * FROM log_tb WHERE logmsg MATCH_ALL 'keyword1 keyword2';
-- Query all rows that contain keyword1 and keyword2 in the logmsg column from the log_tb table and in which keyword2 follows keyword1.
SELECT * FROM log_tb WHERE logmsg MATCH_PHRASE 'keyword1 keyword2';Equality or range queries on data of the NUMERIC and DATE types
In these scenarios, the query syntax is the same as the standard SQL syntax.
Example
-- Perform an equality query, a range query, and a query that contains the IN or NOT IN condition.
SELECT * FROM user_tb WHERE id = 123;
SELECT * FROM user_tb WHERE ts > '2023-01-01 00:00:00';
SELECT * FROM user_tb WHERE op_type IN ('add', 'delete');Compare query performance
In this example, the hackernews table that contains 1 million data entries is used to compare the query performance before and after inverted indexes are used.
Environment preparations
Step 1: Create a table
Create a database.
CREATE DATABASE test_inverted_index;Switch to the created database.
USE test_inverted_index;Create a table.
CREATE TABLE hackernews_1m ( `id` BIGINT, `deleted` TINYINT, `type` String, `author` String, `timestamp` DateTimeV2, `comment` String, `dead` TINYINT, `parent` BIGINT, `poll` BIGINT, `children` Array<BIGINT>, `url` String, `score` INT, `title` String, `parts` Array<INT>, `descendants` INT, INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment' ) DUPLICATE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 10; -- Create a table and an inverted index named idx_comment on the comment column. -- USING INVERTED specifies that the index type is inverted index. -- PROPERTIES("parser" = "english") specifies that the English tokenizer is used to split text into terms. You can also set the parser parameter to chinese or unicode. If you leave the parser parameter empty, text is not split into terms.
Step 2: Import data
Import data to the created table.
Download the data file.
wget https://qa-build.oss-cn-beijing.aliyuncs.com/regression/index/hacknernews_1m.csv.gzImport data to the table by using Stream Load.
On the Instance Details page of an ApsaraDB for SelectDB instance, you can view the endpoint and port number of the ApsaraDB for SelectDB instance. For more information about Stream Load, see Import data by using Stream Load.
curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz http://<host>:<port>/api/test_inverted_index/hackernews_1m/_stream_load { "TxnId": 2, "Label": "a8a3e802-2329-49e8-912b-04c800a461a6", "TwoPhaseCommit": "false", "Status": "Success", "Message": "OK", "NumberTotalRows": 1000000, "NumberLoadedRows": 1000000, "NumberFilteredRows": 0, "NumberUnselectedRows": 0, "LoadBytes": 130618406, "LoadTimeMs": 8988, "BeginTxnTimeMs": 23, "StreamLoadPutTimeMs": 113, "ReadDataTimeMs": 4788, "WriteDataTimeMs": 8811, "CommitAndPublishTimeMs": 38 }Call the
count()function to check whether the data is imported.SELECT count() FROM hackernews_1m; +---------+ | count() | +---------+ | 1000000 | +---------+ 1 row in set (0.02 sec)
Performance comparison
The results queried by using an inverted index that is used to split text into terms are different from those queried without using an inverted index. The inverted index splits the values in a column into terms and normalizes the terms by converting them to lowercase. Therefore, the results queried by using the inverted index are more than those queried without using an inverted index.
The query performance in some examples does not differ significantly because the dataset is small. The larger the dataset, the greater the performance difference.
Full-text searches
Count the number of rows that contain
OLAPin the comment column.Count the number of rows that contain
OLAPin the comment column based on the LIKE condition. The query takes 0.18 seconds.SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%'; +---------+ | count() | +---------+ | 34 | +---------+ 1 row in set (0.18 sec)Count the number of rows that contain
OLAPin the comment column based on the full-text search keywordMATCH_ANY. The query takes 0.02 seconds. The query performance based on the full-text search keyword is eight times higher than that based on the LIKE condition.SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP'; +---------+ | count() | +---------+ | 35 | +---------+ 1 row in set (0.02 sec)
Count the number of rows that contain
OLTPin the comment column.Count the number of rows that contain
OLTPin the comment column based on the LIKE condition. The query takes 0.07 seconds.SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 48 | +---------+ 1 row in set (0.07 sec)Count the number of rows that contain
OLTPin the comment column based on the full-text search keyword MATCH_ANY. The query takes 0.01 seconds. The query performance based on the full-text search keyword is six times higher than that based on the LIKE condition.SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP'; +---------+ | count() | +---------+ | 51 | +---------+ 1 row in set (0.01 sec)
Count the number of rows that contain both
OLAPandOLTPin the comment column.Count the number of rows that contain both OLAP and OLTP in the comment column based on the LIKE condition. The query takes 0.13 seconds.
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 14 | +---------+ 1 row in set (0.13 sec)Count the number of rows that contain both OLAP and OLTP in the comment column based on the full-text search keyword
MATCH_ALL. The query takes 0.01 seconds. The query performance based on the full-text search keyword is 12 times higher than that based on the LIKE condition.SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ 1 row in set (0.01 sec)
Count the number of rows that contain
OLAPorOLTPin the comment column.Count the number of rows that contain OLAP or OLTP in the comment column based on the LIKE condition. The query takes 0.12 seconds.
SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 68 | +---------+ 1 row in set (0.12 sec)Count the number of rows that contain OLAP or OLTP in the comment column based on the full-text search keyword. The query takes 0.01 seconds. The query performance based on the full-text search keyword is 11 times higher than that based on the LIKE condition.
SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP'; +---------+ | count() | +---------+ | 71 | +---------+ 1 row in set (0.01 sec)
Equality or range queries
Compare the performance of column-range queries on data of the DATETIME type.
Query the data whose value is greater than
2007-08-23 04:17:00in the timestamp column before an inverted index is created. The query takes 0.03 seconds.SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ 1 row in set (0.03 sec)Create an inverted index on the timestamp column.
CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED; Query OK, 0 rows affected (0.03 sec)Query the index creation progress. Based on the differences between the values of the FinishTime and CreateTime parameters, you can find that it takes only 1 second to create an inverted index on the timestamp column that contains 1 million data entries.
SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ 1 row in set (0.00 sec)After the inverted index is created, execute the same query statement to query the data whose value is greater than
2007-08-23 04:17:00in the timestamp column. The query takes 0.01 seconds. The query performance is 0.02 seconds faster than that before the inverted index is created.SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00'; +---------+ | count() | +---------+ | 999081 | +---------+ 1 row in set (0.01 sec)
Compare the performance of equality queries on data of the NUMERIC type.
Query the data whose value is 11189 in the parent column before an inverted index is created.
SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ 1 row in set (0.01 sec)Create an inverted index that is not used to split text into terms on the parent column.
-- You do not need to specify the tokenizer for a column of the NUMERIC type when you create an inverted index on the column. -- ALTER TABLE t ADD INDEX is the second syntax for creating inverted indexes. ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED; Query OK, 0 rows affected (0.01 sec)Query the index creation progress.
SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+After the inverted index is created, execute the same query statement to query the data whose value is 11189 in the parent column.
SELECT count() FROM hackernews_1m WHERE parent = 11189; +---------+ | count() | +---------+ | 2 | +---------+ 1 row in set (0.01 sec)
Compare the performance of equality queries on data of the STRING type.
Query the data whose value is faster in the author column before an inverted index is created. The query takes 0.03 seconds.
SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ 1 row in set (0.03 sec)Create an inverted index that is not used to split text into terms on the author column.
-- In this example, an inverted index is created on the author column. Each value of the author column is not split into terms and is considered as a term. ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED; Query OK, 0 rows affected (0.01 sec)Query the index creation progress.
-- It takes only 1.5 seconds to create an inverted index in incremental mode on the author column that contains 1 million data entries. SHOW ALTER TABLE COLUMN; +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+ | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 | | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 | | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077 | 10008 | 1:1335127701 | 5 | FINISHED | | NULL | 2592000 | +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+After the inverted index is created, execute the same query statement to query the data whose value is faster in the author column. The query takes 0.01 seconds. The query performance is 0.02 seconds faster than that before the inverted index is created.
-- After the inverted index is created, the equality query on strings is also significantly accelerated. SELECT count() FROM hackernews_1m WHERE author = 'faster'; +---------+ | count() | +---------+ | 20 | +---------+ 1 row in set (0.01 sec)
TOKENIZE function
The TOKENIZE function can tokenize continuous text into separate terms or phrases. The TOKENIZE function is important for creating and using inverted indexes and is closely related to inverted indexes. The tokenization quality and method directly affect the quality and performance of inverted indexes.
You can also call the TOKENIZE function to query the tokenization results of continuous text. The TOKENIZE function contains the parser and parser_mode parameters. The following table describes the parameters.
Parameter | Description |
| The tokenizer. By default, this parameter is left empty, which specifies that text is not tokenized. Valid values:
|
| The word tokenization mode. The granularity of word tokenization varies based on the word tokenization mode. By default, the coarse_grained mode is used for all tokenizers. In coarse_grained mode, text is tokenized into long terms. For example, If the |
Example
-- English word tokenization result
SELECT TOKENIZE('I love CHINA','"parser"="english"');
+------------------------------------------------+
| tokenize('I love CHINA', '"parser"="english"') |
+------------------------------------------------+
| ["i", "love", "china"] |
+------------------------------------------------+
1 row in set (0.02 sec)
-- Fine-grained word tokenization result implemented by using the Chinese tokenizer.
SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"');
+-----------------------------------------------------------------------------------+
| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') |
+-----------------------------------------------------------------------------------+
| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] |
+-----------------------------------------------------------------------------------+
1 row in set (0.02 sec)
-- Coarse-grained word tokenization result implemented by using the Chinese tokenizer.
SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"');
+----------------------------------------------------------------------------------------+
| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') |
+----------------------------------------------------------------------------------------+
| ["武汉市", "长江大桥"] |
+----------------------------------------------------------------------------------------+
1 row in set (0.02 sec)
-- Word tokenization result of multi-language text
SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"');
+-------------------------------------------------------------------+
| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') |
+-------------------------------------------------------------------+
| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] |
+-------------------------------------------------------------------+
1 row in set (0.02 sec)