All Products
Search
Document Center

ApsaraDB for SelectDB:Inverted indexes

Last Updated:Jan 26, 2025

Inverted indexes are commonly used for information retrieval. You can tokenize text into terms to create inverted indexes. This helps you quickly retrieve documents that contain the specified terms. ApsaraDB for SelectDB supports inverted indexes. You can use inverted indexes to perform full-text searches for data of the TEXT type and equality or range queries on data of the NUMERIC and DATE types. This way, you can quickly retrieve data that meets specific conditions from a large amount of data. This topic describes the features of inverted indexes in ApsaraDB for SelectDB and how to create and use inverted indexes.

Implementation

In the implementation of inverted indexes in ApsaraDB for SelectDB, each row in a table corresponds to a document, and each column corresponds to a field in the document. Therefore, you can use an inverted index to quickly find rows that contain specific terms. This accelerates queries that contain the WHERE clause.

Inverted indexes are different from other indexes in ApsaraDB for SelectDB. ApsaraDB for SelectDB uses a separate file to store inverted indexes. The separate file has a logical mapping with the segment file, but the two files are independent of each other. This eliminates the need to rewrite the segment file when you update and delete inverted indexes, which significantly reduces processing overhead.

Scenarios

  • You want to accelerate full-text searches for data of the STRING type.

  • You want to accelerate searches for data of the STRING, NUMERIC, or DATETIME type by using the following operators: =, !=, >, >=, <, and <=.

Benefits

  • Support logic combination.

    • Support index filter pushdown for the OR and NOT operators.

    • Support a combination of the AND, OR, and NOT operators.

  • Support flexible and quick management of indexes.

    • Allow you to create inverted indexes when you create a table.

    • Allow you to create inverted indexes for an existing table.

    • Allow you to delete existing inverted indexes from a table.

Limits

  • The FLOAT and DOUBLE data types that do not have high precision do not support inverted indexes. The DECIMAL data type has high precision and supports inverted indexes. You can use the DECIMAL type.

  • Some complex data types do not support inverted indexes. Complex data types include MAP, STRUCT, JSON, HLL, BITMAP, QUANTILE_STATE, and AGG_STATE. The JSON data type can support inverted indexes after it is converted into the VARIANT data type.

  • Fields of the NUMERIC type support inverted indexes, but you are not allowed to specify parser for the fields. Valid values of parser: english, chinese, and unicode.

  • You can create inverted indexes for any columns in the DUPLICATE model and the UNIQUE model for which the Merge on Write (MoW) is enabled. You can create inverted indexes only for the Key column in the AGGREGATE model and the UNIQUE model for which MoW is disabled. You can create inverted indexes for other columns in the models. The models need to read all data and merge it. Therefore, you cannot use indexes to filter data in advance.

Create inverted indexes

You can create an inverted index when you create a table or create an inverted index on a column in an existing table.

Create inverted indexes when you create a table

This operation is synchronous. If you create inverted indexes when you create a table, the table and inverted indexes are synchronously created.

Important

Inverted indexes have different limits in the following data models:

  • In the Aggregate key model, you can create inverted indexes only on key columns.

  • In the Unique key model, you must enable the MoW feature. After you enable the MoW feature, you can create an inverted index on any column.

  • In the Duplicate key model, you can create an inverted index on any column.

Syntax

CREATE TABLE  [IF NOT EXISTS] [db_name.]<table_name>
(
  <column_definition_list>,
  [<index_definition_list>] 
)
table_properties;

Parameters

Parameters that are used to create a table

Parameter

Required

Description

db_name

No

The name of the database in which you want to create a table.

table_name

Yes

The name of the table that you want to create.

column_definition_list

Yes

The list of column definitions. For more information, see CREATE-TABLE.

table_properties

Yes

The properties of the table, such as the data model and partitioning and bucketing settings. For more information, see Data models.

index_definition_list

No

The list of index definitions.

index_definition_list

You can define multiple indexes in the following format when you create a table: index_definition[, index_definition][, index_definition]....

Syntax of index_definition

INDEX <index_name>(<column_name>) <index_type> [PROPERTIES("<key>" = "<value>")] [COMMENT '<comment>']

Parameters of index_definition

Required parameters

Parameter

Description

index_name

The name of the index.

column_name

The name of the column on which the index is created.

index_type

The type of the index. Set the value to USING INVERTED.

Optional parameters
PROPERTIES

The PROPERTIES parameter specifies whether to tokenize text into terms based on the index. The value of the PROPERTIES parameter consists of one or more key-value pairs that are separated by commas (,). Each key-value pair is in the format of "<key>" = "<value>". To view the tokenization result of continuous text, you can call the TOKENIZE function. For more information, see TOKENIZE function.

key

value

parser

The tokenizer. By default, the key is left empty, which specifies that text is not tokenized into terms. You are not allowed to to specify parser for the fields of the NUMERIC type. Valid values:

  • english: the English tokenizer, which is suitable for fields that contain English terms. This tokenizer tokenizes text into terms by using spaces and punctuations, and provides high performance.

  • chinese: the Chinese tokenizer, which is suitable for fields that contain Chinese terms. This tokenizer provides lower performance than the English tokenizer.

  • unicode: the mixed Unicode tokenizer, which is suitable for fields that contain English and Chinese terms. This tokenizer can tokenize the prefixes and suffixes of email addresses, IP addresses, and the combinations of characters and digits. It can also tokenize Chinese terms by character.

parser_mode

The word tokenization mode. The granularity of word tokenization varies based on the word tokenization mode.

By default, the coarse_grained mode is used for all tokenizers. In coarse_grained mode, text is tokenized into long terms. For example, 武汉市长江大桥 is tokenized into the following two terms: 武汉市 and 长江大桥.

If the parser=chinese setting is configured, the Chinese tokenizer is used and the fine_grained mode is supported. In fine_grained mode, text is tokenized into short terms. For example, 武汉市长江大桥 is tokenized into the following six terms: 武汉, 武汉市, 市长, 长江, 长江大桥, and 大桥.

For more information about how to use word tokenization, see TOKENIZE function.

support_phrase

Specifies whether to accelerate the MATCH_PHRASE query by using the index. Default value: false.

  • If you set this parameter to true, the MATCH_PHRASE query is accelerated but the index requires more storage capacity.

  • If you set this parameter to false, the MATCH_PHRASE query is not accelerated and the index requires less storage space. You can perform a MATCH_ALL query to match multiple terms at a time.

char_filter

The parameters that are used to process strings before text is tokenized into terms. The char_filter_type parameter can be set to char_replace only.

If you set the char_filter_type parameter to char_replace, characters in the value of the char_filter_pattern parameter are replaced with the characters in the value of the char_filter_replacement parameter.

  • char_filter_pattern: the array of characters to be replaced.

  • char_filter_replacement: the array of characters used to replace the characters to be replaced. You can leave this parameter empty. The default value is a space character.

COMMENT

Parameter

Description

comment

The description of the index.

Example

-- Create a table and an inverted index named idx_comment on the comment column.
-- USING INVERTED specifies that the index type is inverted index.
-- PROPERTIES("parser" = "english") specifies that the English tokenizer is used to tokenize text into terms. You can also set the parser parameter to chinese or unicode. If you leave the parser parameter empty, text is not tokenized into terms.
CREATE TABLE hackernews_1m
(
    `id` BIGINT,
    `deleted` TINYINT,
    `type` String,
    `author` String,
    `timestamp` DateTimeV2,
    `comment` String,
    `dead` TINYINT,
    `parent` BIGINT,
    `poll` BIGINT,
    `children` Array<BIGINT>,
    `url` String,
    `score` INT,
    `title` String,
    `parts` Array<INT>,
    `descendants` INT,
    INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10;

Create inverted indexes for an existing table

This operation is asynchronous. You can execute the SHOW ALTER TABLE COLUMN; statement to query the index creation progress.

Syntax

ALTER TABLE <table_name> ADD INDEX <index_name>(<column_name>) <index_type> [PROPERTIES("<key>" = "<value>")];

Parameters

The parameters are the same as those used to create inverted indexes when you create a table.

Examples

Create an inverted index that is not used to split text into terms.

ALTER TABLE user_tb ADD INDEX index_userId(user_id) USING INVERTED ;

Create an inverted index that is used to split text into terms by using the English tokenizer.

ALTER TABLE user_tb ADD INDEX index_city(city) USING INVERTED PROPERTIES("parser" = "english");

Query the information about inverted indexes

Query the index change progress

You can execute the ALTER or DROP statement to change an inverted index. This operation is asynchronous. You can execute the following statement to query the change progress:

SHOW ALTER TABLE COLUMN;

Query all indexes that are created for a table

Syntax

SHOW INDEXES FROM <table_name>;

Example

SHOW INDEXES FROM user_tb;

Delete inverted indexes

You can delete indexes in asynchronous mode. For more information about how to query the index deletion progress, see Query the information about inverted indexes.

Important

If you delete inverted indexes from a table, the performance of queries on the table decreases. Proceed with caution.

Syntax

-- Syntax 1
DROP INDEX <index_name> ON <table_name>;
-- Syntax 2
ALTER TABLE <table_name> DROP INDEX <index_name>;

Example

DROP INDEX index_userId ON user_tb;
ALTER TABLE user_tb DROP INDEX index_city;

Use inverted indexes

Full-text searches

Syntax

SELECT * FROM <table_name> WHERE <column_name> <conditional_logic> '<keywords>';

Parameters

Parameter

Required

Description

table_name

Yes

The name of the table that you want to query.

column_name

Yes

The name of the column that you want to query.

conditional_logic

Yes

The conditional logic, which is a combination of full-text search keywords and logical operators.

Logical operators: AND, OR, and NOT.

Full-text search keywords:

  • MATCH_ALL: matches multiple terms at a time.

  • MATCH_ANY: matches a term.

  • MATCH_PHRASE: matches a phrase.

keywords

Yes

The terms that are used to query data.

Separate multiple terms with spaces.

Example: keyword1 keyword2 keyword3.

Examples

-- Query all rows that contain keyword1 in the logmsg column from the log_tb table. 
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1';

-- Query all rows that contain keyword1 or keyword2 in the logmsg column from the log_tb table. 
SELECT * FROM log_tb WHERE logmsg MATCH_ANY 'keyword1 keyword2';

-- Query all rows that contain keyword1 and keyword2 in the logmsg column from the log_tb table. 
SELECT * FROM log_tb WHERE logmsg MATCH_ALL 'keyword1 keyword2';

-- Query all rows that contain keyword1 and keyword2 in the logmsg column from the log_tb table and in which keyword2 follows keyword1. 
SELECT * FROM log_tb WHERE logmsg MATCH_PHRASE 'keyword1 keyword2';

Equality or range queries on data of the NUMERIC and DATE types

In these scenarios, the query syntax is the same as the standard SQL syntax.

Example

-- Perform an equality query, a range query, and a query that contains the IN or NOT IN condition.
SELECT * FROM user_tb WHERE id = 123;
SELECT * FROM user_tb WHERE ts > '2023-01-01 00:00:00';
SELECT * FROM user_tb WHERE op_type IN ('add', 'delete');

Compare query performance

In this example, the hackernews table that contains 1 million data entries is used to compare the query performance before and after inverted indexes are used.

Environment preparations

Step 1: Create a table

  1. Create a database.

    CREATE DATABASE test_inverted_index;
  2. Switch to the created database.

    USE test_inverted_index;
  3. Create a table.

    CREATE TABLE hackernews_1m
    (
        `id` BIGINT,
        `deleted` TINYINT,
        `type` String,
        `author` String,
        `timestamp` DateTimeV2,
        `comment` String,
        `dead` TINYINT,
        `parent` BIGINT,
        `poll` BIGINT,
        `children` Array<BIGINT>,
        `url` String,
        `score` INT,
        `title` String,
        `parts` Array<INT>,
        `descendants` INT,
        INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
    )
    DUPLICATE KEY(`id`)
    DISTRIBUTED BY HASH(`id`) BUCKETS 10;
    -- Create a table and an inverted index named idx_comment on the comment column.
    -- USING INVERTED specifies that the index type is inverted index.
    -- PROPERTIES("parser" = "english") specifies that the English tokenizer is used to split text into terms. You can also set the parser parameter to chinese or unicode. If you leave the parser parameter empty, text is not split into terms.

Step 2: Import data

Import data to the created table.

  1. Download the data file.

    wget https://qa-build.oss-cn-beijing.aliyuncs.com/regression/index/hacknernews_1m.csv.gz
  2. Import data to the table by using Stream Load.

    On the Instance Details page of an ApsaraDB for SelectDB instance, you can view the endpoint and port number of the ApsaraDB for SelectDB instance. For more information about Stream Load, see Import data by using Stream Load.

    curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz  http://<host>:<port>/api/test_inverted_index/hackernews_1m/_stream_load
    {
        "TxnId": 2,
        "Label": "a8a3e802-2329-49e8-912b-04c800a461a6",
        "TwoPhaseCommit": "false",
        "Status": "Success",
        "Message": "OK",
        "NumberTotalRows": 1000000,
        "NumberLoadedRows": 1000000,
        "NumberFilteredRows": 0,
        "NumberUnselectedRows": 0,
        "LoadBytes": 130618406,
        "LoadTimeMs": 8988,
        "BeginTxnTimeMs": 23,
        "StreamLoadPutTimeMs": 113,
        "ReadDataTimeMs": 4788,
        "WriteDataTimeMs": 8811,
        "CommitAndPublishTimeMs": 38
    }
  3. Call the count() function to check whether the data is imported.

    SELECT count() FROM hackernews_1m;
    +---------+
    | count() |
    +---------+
    | 1000000 |
    +---------+
    1 row in set (0.02 sec)

Performance comparison

Note
  • The results queried by using an inverted index that is used to split text into terms are different from those queried without using an inverted index. The inverted index splits the values in a column into terms and normalizes the terms by converting them to lowercase. Therefore, the results queried by using the inverted index are more than those queried without using an inverted index.

  • The query performance in some examples does not differ significantly because the dataset is small. The larger the dataset, the greater the performance difference.

Full-text searches
  • Count the number of rows that contain OLAP in the comment column.

    • Count the number of rows that contain OLAP in the comment column based on the LIKE condition. The query takes 0.18 seconds.

      SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%';
      +---------+
      | count() |
      +---------+
      |      34 |
      +---------+
      1 row in set (0.18 sec)
    • Count the number of rows that contain OLAP in the comment column based on the full-text search keyword MATCH_ANY. The query takes 0.02 seconds. The query performance based on the full-text search keyword is eight times higher than that based on the LIKE condition.

      SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP';
      +---------+
      | count() |
      +---------+
      |      35 |
      +---------+
      1 row in set (0.02 sec)
  • Count the number of rows that contain OLTP in the comment column.

    • Count the number of rows that contain OLTP in the comment column based on the LIKE condition. The query takes 0.07 seconds.

      SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%';
      +---------+
      | count() |
      +---------+
      |      48 |
      +---------+
      1 row in set (0.07 sec)
    • Count the number of rows that contain OLTP in the comment column based on the full-text search keyword MATCH_ANY. The query takes 0.01 seconds. The query performance based on the full-text search keyword is six times higher than that based on the LIKE condition.

      SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP';
      +---------+
      | count() |
      +---------+
      |      51 |
      +---------+
      1 row in set (0.01 sec)
  • Count the number of rows that contain both OLAP and OLTP in the comment column.

    • Count the number of rows that contain both OLAP and OLTP in the comment column based on the LIKE condition. The query takes 0.13 seconds.

      SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%';
      +---------+
      | count() |
      +---------+
      |      14 |
      +---------+
      1 row in set (0.13 sec)
    • Count the number of rows that contain both OLAP and OLTP in the comment column based on the full-text search keyword MATCH_ALL. The query takes 0.01 seconds. The query performance based on the full-text search keyword is 12 times higher than that based on the LIKE condition.

       SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP';
      +---------+
      | count() |
      +---------+
      |      15 |
      +---------+
      1 row in set (0.01 sec)
  • Count the number of rows that contain OLAP or OLTP in the comment column.

    • Count the number of rows that contain OLAP or OLTP in the comment column based on the LIKE condition. The query takes 0.12 seconds.

      SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%';
      +---------+
      | count() |
      +---------+
      |      68 |
      +---------+
      1 row in set (0.12 sec)
    • Count the number of rows that contain OLAP or OLTP in the comment column based on the full-text search keyword. The query takes 0.01 seconds. The query performance based on the full-text search keyword is 11 times higher than that based on the LIKE condition.

      SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP';
      +---------+
      | count() |
      +---------+
      |      71 |
      +---------+
      1 row in set (0.01 sec)

Equality or range queries

  • Compare the performance of column-range queries on data of the DATETIME type.

    1. Query the data whose value is greater than 2007-08-23 04:17:00 in the timestamp column before an inverted index is created. The query takes 0.03 seconds.

       SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';
      +---------+
      | count() |
      +---------+
      |  999081 |
      +---------+
      1 row in set (0.03 sec)
    2. Create an inverted index on the timestamp column.

      CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED;
      Query OK, 0 rows affected (0.03 sec)
    3. Query the index creation progress. Based on the differences between the values of the FinishTime and CreateTime parameters, you can find that it takes only 1 second to create an inverted index on the timestamp column that contains 1 million data entries.

      SHOW ALTER TABLE COLUMN;
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      | JobId | TableName     | CreateTime              | FinishTime              | IndexName     | IndexId | OriginIndexId | SchemaVersion | TransactionId | State    | Msg  | Progress | Timeout |
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031   | 10008         | 1:1994690496  | 3             | FINISHED |      | NULL     | 2592000 |
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      1 row in set (0.00 sec)
    4. After the inverted index is created, execute the same query statement to query the data whose value is greater than 2007-08-23 04:17:00 in the timestamp column. The query takes 0.01 seconds. The query performance is 0.02 seconds faster than that before the inverted index is created.

      SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';
      +---------+
      | count() |
      +---------+
      |  999081 |
      +---------+
      1 row in set (0.01 sec)
  • Compare the performance of equality queries on data of the NUMERIC type.

    1. Query the data whose value is 11189 in the parent column before an inverted index is created.

      SELECT count() FROM hackernews_1m WHERE parent = 11189;
      +---------+
      | count() |
      +---------+
      |       2 |
      +---------+
      1 row in set (0.01 sec)
    2. Create an inverted index that is not used to split text into terms on the parent column.

      -- You do not need to specify the tokenizer for a column of the NUMERIC type when you create an inverted index on the column.
      -- ALTER TABLE t ADD INDEX is the second syntax for creating inverted indexes.
      ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED;
      Query OK, 0 rows affected (0.01 sec)
    3. Query the index creation progress.

      SHOW ALTER TABLE COLUMN;
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      | JobId | TableName     | CreateTime              | FinishTime              | IndexName     | IndexId | OriginIndexId | SchemaVersion | TransactionId | State    | Msg  | Progress | Timeout |
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031   | 10008         | 1:1994690496  | 3             | FINISHED |      | NULL     | 2592000 |
      | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054   | 10008         | 1:378856428   | 4             | FINISHED |      | NULL     | 2592000 |
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
    4. After the inverted index is created, execute the same query statement to query the data whose value is 11189 in the parent column.

      SELECT count() FROM hackernews_1m WHERE parent = 11189;
      +---------+
      | count() |
      +---------+
      |       2 |
      +---------+
      1 row in set (0.01 sec)
  • Compare the performance of equality queries on data of the STRING type.

    1. Query the data whose value is faster in the author column before an inverted index is created. The query takes 0.03 seconds.

      SELECT count() FROM hackernews_1m WHERE author = 'faster';
      +---------+
      | count() |
      +---------+
      |      20 |
      +---------+
      1 row in set (0.03 sec)
    2. Create an inverted index that is not used to split text into terms on the author column.

      -- In this example, an inverted index is created on the author column. Each value of the author column is not split into terms and is considered as a term.
      ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED;
      Query OK, 0 rows affected (0.01 sec)
    3. Query the index creation progress.

      -- It takes only 1.5 seconds to create an inverted index in incremental mode on the author column that contains 1 million data entries.
      SHOW ALTER TABLE COLUMN;
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      | JobId | TableName     | CreateTime              | FinishTime              | IndexName     | IndexId | OriginIndexId | SchemaVersion | TransactionId | State    | Msg  | Progress | Timeout |
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031   | 10008         | 1:1994690496  | 3             | FINISHED |      | NULL     | 2592000 |
      | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054   | 10008         | 1:378856428   | 4             | FINISHED |      | NULL     | 2592000 |
      | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077   | 10008         | 1:1335127701  | 5             | FINISHED |      | NULL     | 2592000 |
      +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
      
    4. After the inverted index is created, execute the same query statement to query the data whose value is faster in the author column. The query takes 0.01 seconds. The query performance is 0.02 seconds faster than that before the inverted index is created.

      -- After the inverted index is created, the equality query on strings is also significantly accelerated.
      SELECT count() FROM hackernews_1m WHERE author = 'faster';
      +---------+
      | count() |
      +---------+
      |      20 |
      +---------+
      1 row in set (0.01 sec)

TOKENIZE function

The TOKENIZE function can tokenize continuous text into separate terms or phrases. The TOKENIZE function is important for creating and using inverted indexes and is closely related to inverted indexes. The tokenization quality and method directly affect the quality and performance of inverted indexes.

You can also call the TOKENIZE function to query the tokenization results of continuous text. The TOKENIZE function contains the parser and parser_mode parameters. The following table describes the parameters.

Parameter

Description

parser

The tokenizer. By default, this parameter is left empty, which specifies that text is not tokenized. Valid values:

  • english: the English tokenizer, which is suitable for fields that contain English terms. This tokenizer tokenizes text into terms by using spaces and punctuations, and provides high performance.

  • chinese: the Chinese tokenizer, which is suitable for fields that contain Chinese terms. This tokenizer provides lower performance than the English tokenizer.

  • unicode: the mixed Unicode tokenizer, which is suitable for fields that contain English and Chinese terms. This tokenizer can tokenize the prefixes and suffixes of email addresses, IP addresses, and the combinations of characters and digits. It can also tokenize Chinese terms by character.

parser_mode

The word tokenization mode. The granularity of word tokenization varies based on the word tokenization mode.

By default, the coarse_grained mode is used for all tokenizers. In coarse_grained mode, text is tokenized into long terms. For example, 武汉市长江大桥 is tokenized into the following two terms: 武汉市 and 长江大桥.

If the parser=chinese setting is configured, the Chinese tokenizer is used and the fine_grained mode is supported. In fine_grained mode, text is tokenized into short terms. For example, 武汉市长江大桥 is tokenized into the following six terms: 武汉, 武汉市, 市长, 长江, 长江大桥, and 大桥.

Example

-- English word tokenization result 
SELECT TOKENIZE('I love CHINA','"parser"="english"');
+------------------------------------------------+
| tokenize('I love CHINA', '"parser"="english"') |
+------------------------------------------------+
| ["i", "love", "china"]                         |
+------------------------------------------------+
1 row in set (0.02 sec)

-- Fine-grained word tokenization result implemented by using the Chinese tokenizer. 
SELECT TOKENIZE('武汉长江大桥','"parser"="chinese","parser_mode"="fine_grained"');
+-----------------------------------------------------------------------------------+
| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"')       |
+-----------------------------------------------------------------------------------+
| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"]                              |
+-----------------------------------------------------------------------------------+
1 row in set (0.02 sec)

-- Coarse-grained word tokenization result implemented by using the Chinese tokenizer. 
SELECT TOKENIZE('武汉市长江大桥','"parser"="chinese","parser_mode"="coarse_grained"');
+----------------------------------------------------------------------------------------+
| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"')        |
+----------------------------------------------------------------------------------------+
| ["武汉市", "长江大桥"]                                                                 |
+----------------------------------------------------------------------------------------+
1 row in set (0.02 sec)

-- Word tokenization result of multi-language text 
SELECT TOKENIZE('I love CHINA 我爱我的祖国','"parser"="unicode"');
+-------------------------------------------------------------------+
| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"')       |
+-------------------------------------------------------------------+
| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"]        |
+-------------------------------------------------------------------+
1 row in set (0.02 sec)