Indexes are an important way to accelerate data queries. A columnstore index enhances analytical computing capabilities for massive data in wide tables. It is primarily used for scenarios such as device information statistics in the Internet of Vehicles (IoV) and Internet of Things (IoT), data analytics in e-commerce, and order statistics in the logistics industry. This topic describes the basic and advanced uses of the columnstore index to help you quickly get started with and master this feature.
Prerequisites
The compute engine is enabled. For more information, see Enable services.
LindormDFS is enabled, and its version is 4.0.0 or later.
The wide table engine is enabled, and its version is 2.5.0 or later.
Notes
Columnstore indexes cannot be built synchronously.
Building a columnstore index takes about 15 minutes. If there are many background index building tasks or a large volume of business data, it may take longer.
Building a columnstore index requires reading back data, which generates read operations. If the hot and cold data separation feature is enabled for the instance, be aware of the throttling of cold storage (storage-optimized cloud storage). If read operations on cold storage are throttled, the index building efficiency is directly affected. This can lead to backpressure on write operations.
Enable columnstore index
Log on to the Lindorm console.
In the upper-left corner of the page, select the region where the instance is deployed.
On the Instances page, click the ID of the target instance or click View Instance Details in the Actions column for the instance.
In the navigation pane on the left, choose Wide Table Engine.
Click the Columnar Index tab, and then click Activate Now.
In the dialog box that appears, click OK.
Quick Start
Assume that you need to perform efficient, parallel data analysis on a massive data table named my_tbl. To do this, you must create a columnstore index for the table.
The schema of the sample table my_tbl is as follows:
+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME | TYPE | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl | pk0 | INT | true |
| my_tbl | pk1 | VARCHAR | true |
| my_tbl | pt_d | VARCHAR | true |
| my_tbl | col0 | INT | false |
| my_tbl | col1 | VARCHAR | false |
| my_tbl | json_col0 | JSON | false |
+------------+-------------+---------+----------------+The primary key pk0 identifies each row and has a large cardinality. The primary key pt_d is the date on which the data was generated. Data is typically analyzed on a daily basis.
Create a columnstore index. The columnstore index can automatically expand the data stored in JSON fields.
If the schema of
my_tblis stable and does not change frequently, execute the following statement:CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(*) PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0)) WITH ( `lindorm_columnar.user.index.database` = 'my_index_db', `lindorm_columnar.user.index.table` = 'my_index_tbl', `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = 'json_col0' );The columnstore index creates an index table based on the current table schema and the structure of
json_col0.If the schema of
my_tblmay change frequently, execute the following statement:CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(*) PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0)) WITH ( `lindorm_columnar.user.index.database` = 'my_index_db', `lindorm_columnar.user.index.table` = 'my_index_tbl', `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = 'json_col0', `lindorm_columnar.user.syncer.lci.dynamicSchema` = 'true');The columnstore index creates an index table based on the current table schema and the structure of
json_col0. It then dynamically expands the index table based on subsequent changes to the table schema and the content of the JSON field.
View the index status.
SHOW INDEX FROM my_tbl;For more information about how to use
SHOW INDEXand for a description of the result set, see SHOW INDEX.Use the columnstore index for data query and analysis. For more information, see Use a columnstore index.
Basic usage
Assume that the schema of the sample table my_tbl is as follows:
+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME | TYPE | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_tbl | pk0 | INT | true |
| my_tbl | pk1 | VARCHAR | true |
| my_tbl | pk2 | VARCHAR | true |
| my_tbl | col0 | INT | false |
| my_tbl | col1 | VARCHAR | false |
+------------+-------------+---------+----------------+Create a columnstore index
Syntax
CREATE INDEX index_name USING COLUMNAR
ON table_name(column_name(,..))
PARTITION BY ENUMERABLE (column_name(,...), bucket(bucket_num, column_name))
WITH (`lindorm_columnar.user.index.database` = 'columnar_db_name',
`lindorm_columnar.user.index.table` = 'columnar_tbl_name');Parameter description
Parameter | Description |
index_name | The name of the columnstore index. It can contain uppercase letters, lowercase letters, digits, and underscores (_). |
table_name | The name of the wide table. |
column_name(,...)) | The list of fields for which to create the columnstore index. Separate multiple fields with commas (,). The supported field types include TINYINT, SMALLINT, INTEGER, BIGINT, LONG, FLOAT, DOUBLE, VARCHAR, BINARY, VARBINARY, BOOLEAN, DECIMAL, JSON, DATE, and TIMESTAMP. Note This list must include all primary key fields of the wide table. To create a columnstore index for all fields (primary key and non-primary key), you can use |
PARTITION BY ENUMERABLE(column_name(,...), bucket(bucket_num, column_name)) | Specifies that the index data is partitioned using the enumeration algorithm to improve retrieval performance during queries. The partition filter expression includes regular partition expressions and bucket partition expressions. The fields in both types of expressions must be primary key fields of the wide table.
Note When you design a partition policy, be careful not to use high-cardinality fields (fields with many unique values) as regular partition keys. Using them directly as regular partition expressions creates many partitions and small files, which affects the storage system. |
WITH(`key` = 'value') | Use the
|
Example
Create a columnstore index for the table my_tbl:
CREATE INDEX my_tbl_idx USING COLUMNAR
ON my_tbl(pk0, pk1, pk2, col0, col1)
PARTITION BY ENUMERABLE (pk1, pk2, bucket(128, pk0))
WITH (`lindorm_columnar.user.index.database` = 'my_index_db',
`lindorm_columnar.user.index.table` = 'my_index_tbl');View a columnstore index
After a columnstore index is created, its data is built continuously. The wide table, which is the primary table, continuously synchronizes its data to the columnstore index table. Data synchronization includes historical data and incremental data synchronization. During incremental data synchronization, the index data lags behind the primary table data. This delay is typically less than 1 hour.
You can use the SHOW INDEX statement to view the status of a columnstore index. For more information about how to use SHOW INDEX and for a description of the result set, see SHOW INDEX.
Use a columnstore index
A columnstore index can enhance the analytical computing capabilities for massive data in wide tables. You can specify HINT parameters in your SELECT query statement to route the query request to the compute engine. The compute engine then uses the columnstore index to accelerate the query, which improves the efficiency of big data computing.
Example 1: Big data statistics
SELECT /*+ _use_ldps_(cg0), _columnar_index_ */ COUNT(*), SUM(col0), MIN(col0), MAX(col0)
FROM my_index_db.my_index_tbl
WHERE col0 > 100 AND col0 < 200 OR col0 > 500
GROUP BY pk1;Example 2: Big data sorting
SELECT /*+ _use_ldps_(cg0), _columnar_index_ */ pk0 + col0, pk1
FROM my_index_db.my_index_tbl
WHERE col0 > 100 AND col0 < 200 OR col0 > 500
ORDER BY pk1
LIMIT 100;Example 3: Big data association
If you have created columnstore indexes for multiple wide tables, you can also associate data between the wide tables.
SELECT /*+ _use_ldps_(cg0), _columnar_index_ */ *
FROM my_index_db.my_index_tbl0 as t0
JOIN my_index_db.my_index_tbl1 as t1
ON t0.pk0 = t1.pk0
AND t0.pk1 = t1.pk1
LIMIT 100;Delete a columnstore index
You can use the DROP INDEX statement to delete a columnstore index. For more information about how to use DROP INDEX and for examples, see DROP INDEX.
Advanced usage
Complex partition filter expressions
Assume that the schema of the sample table my_ts_tbl is as follows:
+------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME | TYPE | IS_PRIMARY_KEY |
+------------+-------------+---------+----------------+
| my_ts_tbl | id | INT | true |
| my_ts_tbl | ts | LONG | true |
| my_ts_tbl | col0 | VARCHAR | false |
| my_ts_tbl | col1 | INT | false |
+------------+-------------+---------+----------------+When you create a columnstore index, if the primary key fields of the wide table cannot be directly used in the partition expression for the columnstore index, you can include calculation logic in the partition expression. The following is an example.
Create a columnstore index for all fields of the wide table and partition the columnstore index data by day based on the timestamp field
ts:CREATE INDEX my_ts_idx USING COLUMNAR ON my_ts_tbl(*) PARTITION BY ENUMERABLE (ifnull(substring(from_unixtime(ts), 0, 10), 'unknown') AS dt, bucket(128, id)) WITH (`lindorm_columnar.user.index.database` = 'my_ts_index_db', `lindorm_columnar.user.index.table` = 'my_ts_index_tbl');After the columnstore index is created, you can specify filter conditions in your query statement to query data in the columnstore index.
SELECT /*+ _use_ldps_ */ COUNT(1) FROM lindorm_columnar.my_ts_index_db.my_ts_index_tbl WHERE dt = '2020-06-06';
Build a columnstore index for incremental data only
To skip historical data in a wide table and build a columnstore index only for incremental data, specify the lindorm_columnar.user.syncer.skip.fullsync = 'true' parameter. The following is an example:
CREATE INDEX my_tbl_idx USING COLUMNAR ON my_tbl(*)
PARTITION BY ENUMERABLE (pk1, pk2, bucket(128, pk0))
WITH (`lindorm_columnar.user.index.database` = 'my_index_db',
`lindorm_columnar.user.index.table` = 'my_index_tbl',
`lindorm_columnar.user.syncer.skip.fullsync` = 'true');Expand and store JSON fields
The columnstore index supports expanding and storing JSON fields during data synchronization. Two methods are supported: static expansion and storage and dynamic expansion and storage.
Assume that the sample table my_json_tbl has the following table schema:
+-------------+-------------+--------+----------------+
| TABLE_NAME | COLUMN_NAME | TYPE | IS_PRIMARY_KEY |
+-------------+-------------+--------+----------------+
| my_json_tbl | id | BIGINT | true |
| my_json_tbl | col1 | INT | false |
| my_json_tbl | json_col | JSON | false |
+-------------+-------------+--------+----------------+Execute the following statement to insert JSON-formatted data:
UPSERT INTO my_json_tbl (id,col1,json_col) VALUES(2,2,'{"a": {"b": {"c": "hello,world", "d": 123}, "e": false }, "f": 3.14}');json_col has the following structure:
Static expansion
When you create a columnstore index, specify `lindorm_columnar.user.syncer.lci.jsonMapping.<JSON_COL>` = '<JSON_MAPPING_RULE>' to define a static mapping between the JSON fields of a wide table and the fields of a column-oriented table. The following is an example:
CREATE INDEX columnar_idx USING COLUMNAR ON my_json_tbl(*)
PARTITION BY ENUMERABLE (ifnull(id%16, 0) as dt, bucket(16,id))
WITH (
`lindorm_columnar.user.syncer.lci.jsonMapping.json_col` = 'a.b.c VARCHAR, a.e BOOLEAN ,f DOUBLE',
`lindorm_columnar.user.index.database` = 'my_index_db',
`lindorm_columnar.user.index.table` = 'my_index_tbl');lindorm_columnar.user.syncer.lci.jsonMapping.json_col: Specifies the JSON column to statically expand. In this example, the column is
json_col.a.b.c VARCHAR,a.e BOOLEAN ,f DOUBLE: Specifies each expanded field. Separate the fields with a comma (,).Field name: The JSON path of the expanded field. Separate path segments with a period (.). For example,
a.b.c.Field type: Supported data types are BOOLEAN, BYTE, SHORT, INTEGER, LONG, FLOAT, DOUBLE, and VARCHAR.
Use the
WITHkeyword to specify multiplelindorm_columnar.user.syncer.lci.jsonMappingparameters and create mappings for multiple JSON fields.Do not define the same JSON field in both static and dynamic expansion mappings.
Dynamic column expansion (Public preview)
When you create a columnstore index, specify `lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = '<JSON_COL1>,<JSON_COL2>' to dynamically expand JSON fields from a wide table and map them to a column-oriented table. The following is an example:
CREATE INDEX columnar_idx USING COLUMNAR ON my_json_tbl(*)
PARTITION BY ENUMERABLE (ifnull(id%16, 0) as dt, bucket(16,id))
WITH (
`lindorm_columnar.user.syncer.lci.dynamicJsonColumns` = 'json_col',
`lindorm_columnar.user.index.database` = 'my_index_db',
`lindorm_columnar.user.index.table` = 'my_index_tbl');lindorm_columnar.user.syncer.lci.dynamicJsonColumns: Specifies the JSON columns to dynamically expand. In this example, the specified column is
json_col. You can specify multiple JSON columns for dynamic expansion. Separate the column names with a comma (,), for example,json_col1,json_col2.The columnstore index infers data types for the column-oriented table from the actual storage types in the JSON data. The only supported inferred types are BOOLEAN, LONG, DOUBLE, and STRING. If a field contains values of multiple types, the column-oriented table uses the STRING type for storage.
Dynamic expansion of JSON fields is not supported when you build a columnstore index for historical data.
Do not define the same JSON field in both static and dynamic expansion mappings.
When you configure a JSON field for static expansion and storage or dynamic expansion and storage, the original JSON field is not stored in the column-oriented table. Additionally, the names of the expanded JSON fields in the column-oriented table are prefixed with the name of the corresponding JSON field.
For example, when you use `lindorm_columnar.user.syncer.lci.jsonMapping.json_col` = 'a.b.c VARCHAR, a.e BOOLEAN' to specify static JSON expansion for a columnstore index, the column-oriented table will not contain a column named json_col. Instead, it will contain the following columns:
A column named
json_col.a.b.cof type STRING that stores thea.b.cvalue from thejson_colfield.A column named
json_col.a.eof type BOOLEAN that stores thea.evalue from thejson_colfield.
If you still want to sync the original JSON field, or you do not want the names of the expanded JSON fields in the column-oriented table to be prefixed with the corresponding JSON field names, you can use the following methods:
Sync original JSON fields
To sync the original JSON field, specify `lindorm_columnar.user.syncer.lci.json.syncOriginalJsonContent` = 'true' when you create an index. The column-oriented table will then contain the following columns:
A column named
json_colof the STRING type. This column stores the value of thejson_colfield.A column named
json_col.a.b.cof the STRING type. This column stores the value ofa.b.cfrom thejson_colfield.A column named
json_col.a.eof the BOOLEAN type. This column stores the value ofa.efrom thejson_colfield.
Omit the JSON field name prefix from field names in the column-oriented table
To prevent the names of expanded JSON fields in the column-oriented table from being prefixed with the original JSON field name, specify `lindorm_columnar.user.syncer.lci.json.ignoreJsonMappingPrefix` = 'true' when you create an index. The column-oriented table will then contain the following columns:
A column named
a.b.cof the STRING type. This column stores the value ofa.b.cfrom thejson_colfield.A column named
a.eof the BOOLEAN type. This column stores the value ofa.efrom thejson_colfield.
If different JSON fields, such as json_col1 and json_col2, contain the same mapping to expand and store the value of a.b.c, the columnstore index creation fails.
Dynamically sense schema evolution (Public preview)
During data synchronization, the columnstore index can dynamically sense schema changes in the data table and apply them to the schema of the column-oriented table. When you create a columnstore index, you can specify `lindorm_columnar.user.syncer.lci.dynamicSchema` = 'true' to ensure that the schema of the column-oriented table remains consistent with the primary table schema. The following is an example:
CREATE INDEX my_tbl_idx USING COLUMNAR
ON my_tbl(*)
PARTITION BY ENUMERABLE (pt_d, bucket(128, pk0))
WITH (
`lindorm_columnar.user.index.database` = 'my_index_db',
`lindorm_columnar.user.index.table` = 'my_index_tbl',
`lindorm_columnar.user.syncer.lci.dynamicSchema` = 'true');If the primary table has dynamic columns or wildcard columns, and you specify lindorm_columnar.user.syncer.lci.dynamicSchema = 'true' when creating the columnstore index, the dynamic and wildcard columns of the primary table are synchronized to the columnstore index.
Add columns to a column-oriented table (Public preview)
After creating a columnstore index, you can use the ALTER INDEX syntax to add fields to the columnstore index without reindexing. You can add regular fields or mapping columns for static expansion of JSON fields.
Assume that the schema of the table my_json_tbl is as follows:
+-------------+-------------+---------+----------------+
| TABLE_NAME | COLUMN_NAME | TYPE | IS_PRIMARY_KEY |
+-------------+-------------+---------+----------------+
| my_json_tbl | id | BIGINT | true |
| my_json_tbl | col1 | INT | false |
| my_json_tbl | col2 | VARCHAR | false |
| my_json_tbl | json_col1 | JSON | false |
| my_json_tbl | json_col2 | JSON | false |
+--------------+-------------+---------+----------------+You can use the following SQL statement to create a columnstore index:
CREATE INDEX columnar_idx USING COLUMNAR ON my_json_tbl(id, col1, json_col1)
PARTITION BY ENUMERABLE (ifnull(id%16, 0) as dt, bucket(16,id))
WITH (
`lindorm_columnar.user.syncer.lci.jsonMapping.json_col1` = 'a.b.c VARCHAR, a.e BOOLEAN',
`lindorm_columnar.user.index.database` = 'my_index_db',
`lindorm_columnar.user.index.table` = 'my_index_tbl');At this point, the column-oriented table does not contain columns related to col2 or json_col2. You can use the following statement to add a regular field to the columnstore index:
ALTER INDEX IF EXISTS columnar_idx ON my_json_tbl ADD COLUMNS(col2);You can also use the following statement to add a static mapping rule for a JSON field to the columnstore index:
ALTER INDEX IF EXISTS columnar_idx ON my_json_tbl
ADD COLUMNS (
json_extract_long(json_col2, '$.key1'),
json_extract_boolean(json_col2, '$.key2'),
json_extract_double(json_col2, '$.key3.key4')
);Currently, only the json_extract_boolean, json_extract_long, json_extract_double, and json_extract_string extraction functions are supported.
FAQ
Q: Are there extra fees for creating a columnstore index?
A: Yes, there are. The primary fees include storage costs for the columnstore index data and the cost of CUs used for data synchronization between the primary table and the columnstore index.
Q: Can the partition filter expression contain non-primary key fields?
A: No, it cannot. All fields in the partition filter expression must be primary key fields.
Q: Can a bucket partition expression contain a complex partition filter expression?
A: No, it cannot. A bucket partition expression includes only the
bucket_numand thebucketpartition field.Q: What are the effects of having too many or too few partitions?
A: Too many partitions can lead to data bloat, which affects query efficiency. Therefore, ensure that the data volume of a single partition is greater than 50 MB, and the
bucket_numin the bucket partition expression is less than 1024. Too few partitions can affect the data read and write throughput or cause data skew. Ensure that the data volume of a single partition is less than 512 MB.Q: Can I directly access columnstore index data using the Lindorm compute engine?
A: Yes, you can. You must first customize the name of the index table and then access the columnstore index data through the compute engine. For more information, see Access columnstore data.
ImportantExercise caution when you modify a columnstore index table. To modify a columnstore index table, contact Lindorm technical support (DingTalk ID: s0s3eg3).
Q: Can I create multiple columnstore indexes for the same wide table?
A: No, you cannot. You can create only one columnstore index for a wide table.
Q: If data in a wide table is purged because its TTL expires, is the data in the columnstore index automatically purged?
A: No, it is not.
Q: My columnstore index creation failed. Why do I get an error when I try to create it again?
A: You can create only one columnstore index for a wide table, even if a previous attempt to create an index failed. You must first delete the failed columnstore index before you can create a new one. For the syntax to delete a columnstore index, see DROP INDEX.