Lindorm columnstore index - Lindorm - Alibaba Cloud Documentation Center

Indexes are an important way to accelerate data queries. A columnstore index enhances analytical computing capabilities for massive data in wide tables. It is primarily used for scenarios such as device information statistics in the Internet of Vehicles (IoV) and Internet of Things (IoT), data analytics in e-commerce, and order statistics in the logistics industry. This topic describes the basic and advanced uses of the columnstore index to help you quickly get started with and master this feature.

Prerequisites

The compute engine is enabled. For more information, see Enable services.
LindormDFS is enabled, and its version is 4.0.0 or later.
The wide table engine is enabled, and its version is 2.5.0 or later.

Notes

Columnstore indexes cannot be built synchronously.
Building a columnstore index takes about 15 minutes. If there are many background index building tasks or a large volume of business data, it may take longer.
Building a columnstore index requires reading back data, which generates read operations. If the hot and cold data separation feature is enabled for the instance, be aware of the throttling of cold storage (storage-optimized cloud storage). If read operations on cold storage are throttled, the index building efficiency is directly affected. This can lead to backpressure on write operations.

Enable columnstore index

Log on to the Lindorm console.
In the upper-left corner of the page, select the region where the instance is deployed.
On the Instances page, click the ID of the target instance or click View Instance Details in the Actions column for the instance.
In the navigation pane on the left, choose Wide Table Engine.
Click the Columnar Index tab, and then click Activate Now.
In the dialog box that appears, click OK.

Quick Start

Assume that you need to perform efficient, parallel data analysis on a massive data table named my_tbl. To do this, you must create a columnstore index for the table.

The schema of the sample table my_tbl is as follows:

+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl     | pk0         | INT     | true           |
| my_tbl     | pk1         | VARCHAR | true           |
| my_tbl     | pt_d        | VARCHAR | true           |
| my_tbl     | col0        | INT     | false          |
| my_tbl     | col1        | VARCHAR | false          |
| my_tbl     | json_col0   | JSON    | false          |
+------------+-------------+---------+----------------+

The primary key pk0 identifies each row and has a large cardinality. The primary key pt_d is the date on which the data was generated. Data is typically analyzed on a daily basis.

Create a columnstore index. The columnstore index can automatically expand the data stored in JSON fields.

If the schema of my_tbl is stable and does not change frequently, execute the following statement:

CREATE INDEX my_tbl_idx USING COLUMNAR
ON my_tbl(*) 
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_index_db',
  `lindorm_columnar.user.index.table` = 'my_index_tbl',
  `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = 'json_col0'
  );

The columnstore index creates an index table based on the current table schema and the structure of json_col0.

If the schema of my_tbl may change frequently, execute the following statement:

CREATE INDEX my_tbl_idx USING COLUMNAR
ON my_tbl(*) 
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_index_db',
  `lindorm_columnar.user.index.table` = 'my_index_tbl',
  `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = 'json_col0',
  `lindorm_columnar.user.syncer.lci.dynamicSchema` = 'true');

The columnstore index creates an index table based on the current table schema and the structure of json_col0. It then dynamically expands the index table based on subsequent changes to the table schema and the content of the JSON field.

View the index status.
```
SHOW INDEX FROM my_tbl;
```
For more information about how to use SHOW INDEX and for a description of the result set, see SHOW INDEX.
Use the columnstore index for data query and analysis. For more information, see Use a columnstore index.

Basic usage

Assume that the schema of the sample table my_tbl is as follows:

+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl     | pk0         | INT     | true           |
| my_tbl     | pk1         | VARCHAR | true           |
| my_tbl     | pk2         | VARCHAR | true           |
| my_tbl     | col0        | INT     | false          |
| my_tbl     | col1        | VARCHAR | false          |
+------------+-------------+---------+----------------+

Create a columnstore index

Syntax

CREATE INDEX index_name USING COLUMNAR
ON table_name(column_name(,..))
PARTITION BY ENUMERABLE (column_name(,...), bucket(bucket_num, column_name))
WITH (`lindorm_columnar.user.index.database` = 'columnar_db_name',
      `lindorm_columnar.user.index.table` = 'columnar_tbl_name');

Parameter description

Parameter	Description
index_name	The name of the columnstore index. It can contain uppercase letters, lowercase letters, digits, and underscores (_).
table_name	The name of the wide table.
column_name(,...))	The list of fields for which to create the columnstore index. Separate multiple fields with commas (,). The supported field types include TINYINT, SMALLINT, INTEGER, BIGINT, LONG, FLOAT, DOUBLE, VARCHAR, BINARY, VARBINARY, BOOLEAN, DECIMAL, JSON, DATE, and TIMESTAMP. Note This list must include all primary key fields of the wide table. To create a columnstore index for all fields (primary key and non-primary key), you can use `(*)` as a shorthand.
PARTITION BY ENUMERABLE(column_name(,...), bucket(bucket_num, column_name))	Specifies that the index data is partitioned using the enumeration algorithm to improve retrieval performance during queries. The partition filter expression includes regular partition expressions and bucket partition expressions. The fields in both types of expressions must be primary key fields of the wide table. Regular partition expression You can specify zero or more regular partition expressions. Separate multiple expressions with commas (,). A regular partition expression is a primary key field of the wide table, such as a city or date. The index data is built based on different partition values. This lets you efficiently locate data using partition filter conditions in queries. Bucket partition expression Specify at least one bucket partition expression. `bucket_num` is the number of bucket partitions. `column_name` is the bucket partition field used to calculate the bucket partition number (bucket_index). The bucket_index is calculated by computing a hash value based on the bucket partition field in the expression, and then taking the remainder of the division by `bucket_num`. In the example below, the `bucket_index` is calculated as `hash(pk0)%128`. The bucket partition field must be a primary key field in the wide table. Ensure that the bucket partition field has sufficient discrete features to avoid data skew between different partitions. The regular partition expressions and bucket partition expressions together determine the number of partitions for the index data. Set the data volume of each partition to between 50 MB and 512 MB. For example, assume the regular partition expression is the date field `dt`, the daily data volume of the wide table is 50 GB, and the primary key fields of the wide table are `(id, dt)`. You can configure the partition filter expression as `PARTITION BY ENUMERABLE (dt, bucket(200, id))`. Note When you design a partition policy, be careful not to use high-cardinality fields (fields with many unique values) as regular partition keys. Using them directly as regular partition expressions creates many partitions and small files, which affects the storage system.
WITH(`key` = 'value')	Use the `WITH` keyword to specify the following parameters for the columnstore index: lindorm_columnar.user.index.database: Specifies the name of the database where the columnstore index table is located. lindorm_columnar.user.index.table: Specifies the name of the columnstore index table.

Example

Create a columnstore index for the table my_tbl:

CREATE INDEX my_tbl_idx USING COLUMNAR
ON my_tbl(pk0, pk1, pk2, col0, col1)
PARTITION BY ENUMERABLE (pk1, pk2, bucket(128, pk0))
WITH (`lindorm_columnar.user.index.database` = 'my_index_db',
      `lindorm_columnar.user.index.table` = 'my_index_tbl');

View a columnstore index

After a columnstore index is created, its data is built continuously. The wide table, which is the primary table, continuously synchronizes its data to the columnstore index table. Data synchronization includes historical data and incremental data synchronization. During incremental data synchronization, the index data lags behind the primary table data. This delay is typically less than 1 hour.

You can use the SHOW INDEX statement to view the status of a columnstore index. For more information about how to use SHOW INDEX and for a description of the result set, see SHOW INDEX.

Use a columnstore index

A columnstore index can enhance the analytical computing capabilities for massive data in wide tables. You can specify HINT parameters in your SELECT query statement to route the query request to the compute engine. The compute engine then uses the columnstore index to accelerate the query, which improves the efficiency of big data computing.

Example 1: Big data statistics

SELECT /*+ _use_ldps_(cg0), _columnar_index_ */ COUNT(*), SUM(col0), MIN(col0), MAX(col0)
FROM my_index_db.my_index_tbl
WHERE col0 > 100 AND col0 < 200 OR col0 > 500
GROUP BY pk1;

Example 2: Big data sorting

SELECT /*+ _use_ldps_(cg0), _columnar_index_ */ pk0 + col0, pk1
FROM my_index_db.my_index_tbl
WHERE col0 > 100 AND col0 < 200 OR col0 > 500
ORDER BY pk1
LIMIT 100;

Example 3: Big data association

If you have created columnstore indexes for multiple wide tables, you can also associate data between the wide tables.

SELECT /*+ _use_ldps_(cg0), _columnar_index_ */ *
FROM my_index_db.my_index_tbl0 as t0
  JOIN my_index_db.my_index_tbl1 as t1 
ON t0.pk0 = t1.pk0
AND t0.pk1 = t1.pk1
LIMIT 100;

Delete a columnstore index

You can use the DROP INDEX statement to delete a columnstore index. For more information about how to use DROP INDEX and for examples, see DROP INDEX.

Advanced usage

Complex partition filter expressions

Assume that the schema of the sample table my_ts_tbl is as follows:

+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_ts_tbl  | id          | INT     | true           |
| my_ts_tbl  | ts          | LONG    | true           |
| my_ts_tbl  | col0        | VARCHAR | false          |
| my_ts_tbl  | col1        | INT     | false          |
+------------+-------------+---------+----------------+

When you create a columnstore index, if the primary key fields of the wide table cannot be directly used in the partition expression for the columnstore index, you can include calculation logic in the partition expression. The following is an example.

Create a columnstore index for all fields of the wide table and partition the columnstore index data by day based on the timestamp field ts:

CREATE INDEX my_ts_idx USING COLUMNAR ON my_ts_tbl(*)
PARTITION BY ENUMERABLE (ifnull(substring(from_unixtime(ts), 0, 10), 'unknown') AS dt, bucket(128, id))
WITH (`lindorm_columnar.user.index.database` = 'my_ts_index_db',
      `lindorm_columnar.user.index.table` = 'my_ts_index_tbl');

After the columnstore index is created, you can specify filter conditions in your query statement to query data in the columnstore index.
```
SELECT /*+ _use_ldps_ */ COUNT(1)
FROM lindorm_columnar.my_ts_index_db.my_ts_index_tbl
WHERE dt = '2020-06-06';
```

Build a columnstore index for incremental data only

To skip historical data in a wide table and build a columnstore index only for incremental data, specify the lindorm_columnar.user.syncer.skip.fullsync = 'true' parameter. The following is an example:

CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(*)
PARTITION BY ENUMERABLE (pk1, pk2, bucket(128, pk0))
WITH (`lindorm_columnar.user.index.database` = 'my_index_db',
      `lindorm_columnar.user.index.table` = 'my_index_tbl',
      `lindorm_columnar.user.syncer.skip.fullsync` = 'true');

Expand and store JSON fields

The columnstore index supports expanding and storing JSON fields during data synchronization. Two methods are supported: static expansion and storage and dynamic expansion and storage.

Assume that the sample table my_json_tbl has the following table schema:

+-------------+-------------+--------+----------------+
| TABLE_NAME  | COLUMN_NAME |  TYPE  | IS_PRIMARY_KEY |
+-------------+-------------+--------+----------------+
| my_json_tbl | id          | BIGINT | true           |
| my_json_tbl | col1        | INT    | false          |
| my_json_tbl | json_col    | JSON   | false          |
+-------------+-------------+--------+----------------+

Execute the following statement to insert JSON-formatted data:

UPSERT INTO my_json_tbl (id,col1,json_col) VALUES(2,2,'{"a": {"b": {"c": "hello,world", "d": 123}, "e": false }, "f": 3.14}');

json_col has the following structure:

Click to view the JSON column data structure

{
  "a": {
    "b": {
      "c": "hello,world",
      "d": 123
    },
    "e": false
  },
  "f": 3.14
}

Static expansion

When you create a columnstore index, specify `lindorm_columnar.user.syncer.lci.jsonMapping.<JSON_COL>` = '<JSON_MAPPING_RULE>' to define a static mapping between the JSON fields of a wide table and the fields of a column-oriented table. The following is an example:

CREATE INDEX columnar_idx USING COLUMNAR ON my_json_tbl(*) 
PARTITION BY ENUMERABLE (ifnull(id%16, 0) as dt, bucket(16,id)) 
WITH (
  `lindorm_columnar.user.syncer.lci.jsonMapping.json_col` = 'a.b.c VARCHAR, a.e BOOLEAN ,f DOUBLE',
  `lindorm_columnar.user.index.database` = 'my_index_db',
  `lindorm_columnar.user.index.table` = 'my_index_tbl');

Note

lindorm_columnar.user.syncer.lci.jsonMapping.json_col: Specifies the JSON column to statically expand. In this example, the column is json_col.
a.b.c VARCHAR,a.e BOOLEAN ,f DOUBLE: Specifies each expanded field. Separate the fields with a comma (,).
- Field name: The JSON path of the expanded field. Separate path segments with a period (.). For example, a.b.c.
- Field type: Supported data types are BOOLEAN, BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, and VARCHAR.
Use the WITH keyword to specify multiple lindorm_columnar.user.syncer.lci.jsonMapping parameters and create mappings for multiple JSON fields.
Do not define the same JSON field in both static and dynamic expansion mappings.

Dynamic column expansion (Public preview)

When you create a columnstore index, specify `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = '<JSON_COL1>,<JSON_COL2>' to dynamically expand JSON fields from a wide table and map them to a column-oriented table. The following is an example:

CREATE INDEX columnar_idx USING COLUMNAR ON my_json_tbl(*) 
PARTITION BY ENUMERABLE (ifnull(id%16, 0) as dt, bucket(16,id)) 
WITH (
  `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = 'json_col',
  `lindorm_columnar.user.index.database` = 'my_index_db',
  `lindorm_columnar.user.index.table` = 'my_index_tbl');

Note

lindorm_columnar.user.syncer.lci.dynamicJsonColumns: Specifies the JSON columns to dynamically expand. In this example, the specified column is json_col. You can specify multiple JSON columns for dynamic expansion. Separate the column names with a comma (,), for example, json_col1,json_col2.
The columnstore index infers data types for the column-oriented table from the actual storage types in the JSON data. The only supported inferred types are BOOLEAN, LONG, DOUBLE, and STRING. If a field contains values of multiple types, the column-oriented table uses the STRING type for storage.
Dynamic expansion of JSON fields is not supported when you build a columnstore index for historical data.
Do not define the same JSON field in both static and dynamic expansion mappings.

When you configure a JSON field for static expansion and storage or dynamic expansion and storage, the original JSON field is not stored in the column-oriented table. Additionally, the names of the expanded JSON fields in the column-oriented table are prefixed with the name of the corresponding JSON field.

For example, when you use `lindorm_columnar.user.syncer.lci.jsonMapping.json_col` = 'a.b.c VARCHAR, a.e BOOLEAN' to specify static JSON expansion for a columnstore index, the column-oriented table will not contain a column named json_col. Instead, it will contain the following columns:

A column named json_col.a.b.c of type STRING that stores the a.b.c value from the json_col field.
A column named json_col.a.e of type BOOLEAN that stores the a.e value from the json_col field.

If you still want to sync the original JSON field, or you do not want the names of the expanded JSON fields in the column-oriented table to be prefixed with the corresponding JSON field names, you can use the following methods:

Sync original JSON fields

To sync the original JSON field, specify `lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true' when you create an index. The column-oriented table will then contain the following columns:

A column named json_col of the STRING type. This column stores the value of the json_col field.
A column named json_col.a.b.c of the STRING type. This column stores the value of a.b.c from the json_col field.
A column named json_col.a.e of the BOOLEAN type. This column stores the value of a.e from the json_col field.

Omit the JSON field name prefix from field names in the column-oriented table

To prevent the names of expanded JSON fields in the column-oriented table from being prefixed with the original JSON field name, specify `lindorm_columnar.user.syncer.lci.json.ignoreJsonMappingPrefix` = 'true' when you create an index. The column-oriented table will then contain the following columns:

A column named a.b.c of the STRING type. This column stores the value of a.b.c from the json_col field.
A column named a.e of the BOOLEAN type. This column stores the value of a.e from the json_col field.

Important

If different JSON fields, such as json_col1 and json_col2, contain the same mapping to expand and store the value of a.b.c, the columnstore index creation fails.

Dynamically sense schema evolution (Public preview)

During data synchronization, the columnstore index can dynamically sense schema changes in the data table and apply them to the schema of the column-oriented table. When you create a columnstore index, you can specify `lindorm_columnar.user.syncer.lci.dynamicSchema` = 'true' to ensure that the schema of the column-oriented table remains consistent with the primary table schema. The following is an example:

CREATE INDEX my_tbl_idx USING COLUMNAR
ON my_tbl(*) 
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
  `lindorm_columnar.user.index.database` = 'my_index_db',
  `lindorm_columnar.user.index.table` = 'my_index_tbl',
  `lindorm_columnar.user.syncer.lci.dynamicSchema` = 'true');

Important

If the primary table has dynamic columns or wildcard columns, and you specify lindorm_columnar.user.syncer.lci.dynamicSchema = 'true' when creating the columnstore index, the dynamic and wildcard columns of the primary table are synchronized to the columnstore index.

Add columns to a column-oriented table (Public preview)

After creating a columnstore index, you can use the ALTER INDEX syntax to add fields to the columnstore index without reindexing. You can add regular fields or mapping columns for static expansion of JSON fields.

Assume that the schema of the table my_json_tbl is as follows:

+-------------+-------------+---------+----------------+
|  TABLE_NAME | COLUMN_NAME |  TYPE   | IS_PRIMARY_KEY |
+-------------+-------------+---------+----------------+
| my_json_tbl | id          | BIGINT  | true           |
| my_json_tbl | col1        | INT     | false          |
| my_json_tbl | col2        | VARCHAR | false          |
| my_json_tbl | json_col1   | JSON    | false          |
| my_json_tbl | json_col2   | JSON    | false          |
+--------------+-------------+---------+----------------+

You can use the following SQL statement to create a columnstore index:

CREATE INDEX columnar_idx USING COLUMNAR ON my_json_tbl(id, col1, json_col1) 
PARTITION BY ENUMERABLE (ifnull(id%16, 0) as dt, bucket(16,id)) 
WITH (
  `lindorm_columnar.user.syncer.lci.jsonMapping.json_col1` = 'a.b.c VARCHAR, a.e BOOLEAN',
  `lindorm_columnar.user.index.database` = 'my_index_db',
  `lindorm_columnar.user.index.table` = 'my_index_tbl');

At this point, the column-oriented table does not contain columns related to col2 or json_col2. You can use the following statement to add a regular field to the columnstore index:

ALTER INDEX IF EXISTS columnar_idx ON my_json_tbl ADD COLUMNS(col2);

You can also use the following statement to add a static mapping rule for a JSON field to the columnstore index:

ALTER INDEX IF EXISTS columnar_idx ON my_json_tbl
ADD COLUMNS (
  json_extract_long(json_col2, '$.key1'),
  json_extract_boolean(json_col2, '$.key2'),
  json_extract_double(json_col2, '$.key3.key4')
);

Note

Currently, only the json_extract_boolean, json_extract_long, json_extract_double, and json_extract_string extraction functions are supported.

FAQ

Q: Are there extra fees for creating a columnstore index?
A: Yes, there are. The primary fees include storage costs for the columnstore index data and the cost of CUs used for data synchronization between the primary table and the columnstore index.
Q: Can the partition filter expression contain non-primary key fields?
A: No, it cannot. All fields in the partition filter expression must be primary key fields.
Q: Can a bucket partition expression contain a complex partition filter expression?
A: No, it cannot. A bucket partition expression includes only the bucket_num and the bucket partition field.
Q: What are the effects of having too many or too few partitions?
A: Too many partitions can lead to data bloat, which affects query efficiency. Therefore, ensure that the data volume of a single partition is greater than 50 MB, and the bucket_num in the bucket partition expression is less than 1024. Too few partitions can affect the data read and write throughput or cause data skew. Ensure that the data volume of a single partition is less than 512 MB.
Q: Can I directly access columnstore index data using the Lindorm compute engine?
A: Yes, you can. You must first customize the name of the index table and then access the columnstore index data through the compute engine. For more information, see Access columnstore data.
Important
Exercise caution when you modify a columnstore index table. To modify a columnstore index table, contact Lindorm technical support (DingTalk ID: s0s3eg3).
Q: Can I create multiple columnstore indexes for the same wide table?
A: No, you cannot. You can create only one columnstore index for a wide table.
Q: If data in a wide table is purged because its TTL expires, is the data in the columnstore index automatically purged?
A: No, it is not.
Q: My columnstore index creation failed. Why do I get an error when I try to create it again?
A: You can create only one columnstore index for a wide table, even if a previous attempt to create an index failed. You must first delete the failed columnstore index before you can create a new one. For the syntax to delete a columnstore index, see DROP INDEX.