All Products
Search
Document Center

Lindorm:Secondary indexes

Last Updated:Apr 09, 2024

LindormTable supports the secondary index feature in the Tabular model. In query scenarios where primary key columns are not specified in match conditions, native secondary indexes can help you reduce the complexity of application development, ensure data consistency, and improve write efficiency. This topic describes the secondary index feature in the Lindorm Tabular model and provides usage examples.

Background information

In the Lindorm Tabular model, a wide table has a table schema, and the data types of the columns in the wide table are predefined. The native secondary index feature of Lindorm is used in Alibaba Cloud for years. The outstanding performance of this feature has been verified by the Double 11 Shopping Festival for years. This feature is suitable for global indexing of large amounts of data.

  • Comparison of response time (RT) between Lindorm and Apache Phoenix

    Product

    RT for single-row write

    RT for single-row read

    RT for 100-row batch write

    Lindorm

    1.409

    0.492

    10.447

    Apache Phoenix

    2.264

    2.594

    26.224

    According to the results, Lindorm takes less time to read and write the same amount of data in indexing scenarios. Compared with the RT of Apache Phoenix, the RT of Lindorm is 38% shorter for single-row write operations, 81% shorter for single-row read operations, and 60% shorter for 100-row write operations.

  • Comparison of write throughput between Lindorm and Apache Phoenix

    Product

    BatchPut

    Put

    Get

    Scan

    Lindorm

    198174

    73214

    156372

    1283451

    Apache Phoenix

    56923

    33156

    25910

    224085

    In indexing scenarios, the throughput of Lindorm is about 3.5 times that of Apache Phoenix for BatchPut operations, 2.2 times that of Apache Phoenix for Put operations, 6 times that of Apache Phoenix for Get operations, and 5.7 times that of Apache Phoenix for Scan operations.

Overview

The Lindorm secondary index feature allows you to create multiple indexes for a single table. Each index is physically mapped to a data table that is independent of the base table. You can specify different storage policies for each index. For example, you can specify different compression algorithms and hot and cold data separation policies. When you write data to a base table, Lindorm automatically updates all associated index tables and ensures data consistency between the base table and the index tables. When you query data, you need only to initiate a query for the base table. Lindorm automatically selects an appropriate index or the base table based on the WHERE condition and the schemas of the base table and the index tables to perform the query. You can also use hints to intervene in the behavior of the optimizer. Lindorm secondary indexes provide the following basic features:

  • Multiple indexes can be created for a single base table.

  • Composite indexes are supported. You can use one or more columns in a base table to create an index.

  • Covering indexes are supported. For an index that contains all the columns in a base table, a column is automatically added to the index each time the column is added to the base table.

  • Query optimization is provided. Lindorm automatically selects an index based on the WHERE clause for a query and allows you to use hints to intervene in the behavior of the optimizer.

  • Online schema changes are supported. Changes in indexes do not affect the read and write operations that are performed on the associated base tables. You can create, delete, and update indexes for a base table based on your business requirements.

  • The time-to-live (TTL) feature is supported. The TTL value that is specified for a base table automatically takes effect on all associated index tables. When the data in a base table expires, the data in the index tables also expires.

  • Dynamic columns are supported. You can create secondary indexes based on dynamic columns and specify dynamic columns as included columns in a secondary index.

  • Custom data versions are supported. After you specify a custom timestamp for a data record in a base table, the custom timestamp automatically takes effect on the corresponding data records in the associated index tables.

Environment requirements

Terms

  • Strong consistency: Lindorm ensures the data consistency between a base table and the associated index tables. To meet the requirements of data consistency, minimize the overheads of secondary indexes, and improve the write throughput, the following limits related to strong consistency are imposed on the secondary indexes of Lindorm:

    • Snapshot isolation is not supported. When data is being written to a base table, the data may not be available for queries. After the system returns a successful response to notify the client that the data is written, the data becomes available for queries in the base table and associated index tables.

    • If a client timeout error or an I/O error occurs, the data may not be available for queries in the base table or associated index tables. However, eventual consistency is ensured between the data in the base table and the data in the index tables.

  • Mutability options: If a base table has one index, the following operations are performed to complete a write operation: reading data from the base table, deleting the historical data from the index, writing data to the index, and writing data to the base table. This process increases the index maintenance cost. Write amplification does not occur in all scenarios. For example, data is only inserted and no update operation is performed for each write request in logging scenarios. In this case, no historical data exists in the index table. The system only writes data to the index and writes data to the base table. Therefore, Lindorm introduces the concept of mutability. Mutability is used to classify the write patterns for base tables and organize index data to minimize the index organization costs to meet different business requirements. The following table describes the mutability types. You can use the Table_Options parameter to configure the mutability attribute for a table when you create or modify the table. For more information, see CREATE TABLE.

    Type

    Constraint

    Cost

    Description

    No index is created for the base table.

    N/A.

    1

    If no index is created for the base table, the system directly writes data to the base table. Only one operation is required to complete a write operation.

    IMMUTABLE

    The system writes data by rows. Data cannot be updated or deleted.

    2

    The system writes data to the base table and writes data to the index table. This mutability type incurs the lowest cost and delivers the best performance in all scenarios.

    IMMUTABLE_ROWS

    The system writes data by rows. Data cannot be updated or deleted.

    2 to 3

    In normal write scenarios, the system writes data to the base table and writes data to the index table. In special scenarios in which data needs to be deleted, an additional operation is required to read data from the base table. This mutability type delivers excellent performance that is second only to IMMUTABLE.

    MUTABLE_LATEST

    Data can be updated and deleted.

    4

    The system reads data from the base table, deletes historical data from the index table, writes data to the index table, and writes data to the base table.

    MUTABLE_ALL

    No limits are imposed. You can use custom timestamps to write data.

    4

    The system reads data from the base table, deletes historical data from the index table, writes data to the index table, and writes data to the base table.

    Note
    • If you set the mutability attribute to IMMUTABLE or IMMUTABLE_ROWS, no update operation is performed. As a result, no write amplification occurs, and the costs are minimized. These two mutability types are suitable for high-throughput data write scenarios, such as logging and monitoring.

    • If you set the mutability attribute to IMMUTABLE, no delete operation is performed. This way, you can fully use the multi-zone deployment capability provided by Lindorm to implement active-active data access.

    • If you set the mutability attribute to IMMUTABLE or IMMUTABLE_ROWS, the write latency can be decreased and the overall throughput can be improved for tables that have indexes. In an actual business scenario in which the IMMUTABLE type is not suitable, you can configure data redundancy to reform your business scenario.

  • Index updates based on custom timestamps: Lindorm allows you to write data based on custom timestamps and update data based on a specific timestamp. The system determines that the update operation takes effect only on the data that is associated with the largest timestamp. The custom timestamp feature is important for managing data TTL and processing out-of-order data and idempotent operations. This feature is widely used in HBase databases. Lindorm supports column-level timestamps. You can write data to base tables based on custom timestamps. In NoSQL systems that support secondary indexes and timestamps, you can hardly update indexes based on custom timestamps. If data is not written in sequence based on timestamps, the update and delete operations that are performed on index data are difficult to manage. Lindorm provides the global secondary index (GSI) feature to resolve this issue. You can use GSIs to update data based on column-level custom timestamps. The following business scenarios are examples of using custom timestamps:

    • Import historical data and update real-time data in one task: If you want to update real-time data and import historical data at the same time, you can specify the current timestamp for real-time data and specify 23:59:59 on the previous day for historical data. This way, the import operation updates the data that has not been updated on the day and does not overwrite the data that has been updated.

    • Clear message accumulation: A business system uses messages to trigger a series of processing logic. When message accumulation occurs, the system skips the accumulated messages and directly processes the most recent message if the custom timestamp feature is used. After the most recent message is processed, the system processes the accumulated historical messages to handle the previous tasks. Alternatively, if an issue related to the business logic occurs, the system can also skip some messages to bypass the issue and process these messages after the issue is fixed. In this case, your business system can use the timestamps of messages to write data to ensure that the historical messages are correctly overwritten by the real-time messages.

  • Covering indexes that contain all columns in base tables: In most cases, a covering index is created based on specific columns in a base table to prevent the base table from being scanned after the index is scanned. A common redundancy solution is to create a covering index that contains all the columns in a base table. Lindorm supports three redundancy modes. If the schema of a base table changes or a base table contains dynamic columns, you can create covering indexes for the base table to implement full redundancy.

    • Create a covering index that contains specified columns in the base table: You can explicitly specify the columns in the base table as included columns.

    • Create a covering index that contains all columns in the schema of a base table: When you create a covering index that contains all columns in a base table, you do not need to explicitly specify each column in the CREATE INDEX statement. Instead, you can use a constant to specify to include all the columns in the index. After the covering index is created, a column is automatically added to the covering index when the column is added to the base table. You do not need to rebuild the index. This also eliminates the need to scan the base table if you want to query the data in the newly added column.

    • Create a covering index that contains dynamic columns in the base table: Lindorm supports fixed schemas and loose schemas. A loose schema refers to dynamic columns. If the DYNAMIC redundancy mode is used, the index table can automatically include all dynamic columns in the base table and all columns in the schema of the base table.

Syntax reference

For more information about the syntaxes related to secondary indexes, see the following documentations:

Create a secondary index

After you create a base table in a Lindorm instance, you can create a secondary index based on specified columns in the base table. The following procedure provides an example on how to create a secondary index for a base table:

--  Create a base table.
create table shop_item_relation (
    shop_id varchar,
    item_id varchar,
    status varchar,
constraint pk primary key(shop_id, item_id)) ; 

-- Create a covering index based on the primary key column item_id and specify all columns in the base table as included columns.
create index idxs1 on shop_item_relation (item_id) 'INDEX_COVERED_TYPE'='COVERED_ALL_COLUMNS_IN_SCHEMA';   

-- Perform a query based on the index table. The index is created based on the item_id column, and the item_id column is specified in the WHERE condition. As a result, the system directly queries data from the index table.
select * from shop_item_relation where item_id = 'item2';  
Note
  • You can create indexes in a synchronous or asynchronous manner. If a base table contains a small amount of data, you can use the synchronous method to create an index. In other cases, we recommend that you use the asynchronous method. For more information about the syntax, see CREATE INDEX.

  • When you execute the CREATE_INDEX statement to create an index for a base table that contains data, the statement also synchronizes the existing data in the base table to the index table. If the base table contains large amounts of data, the execution of the CREATE_INDEX statement requires a long period of time to complete. The data synchronization task is executed on the server. Even if the Lindorm Shell process is deleted, the data synchronization task is not affected.

View secondary indexes

You can use Lindorm SQL to view the status of all secondary indexes that are created for a base table. The following statement provides an example:

show index from shop_item_relation;
Note

In this example, the statement returns the name and type of each secondary index that is created for the shop_item_relation base table.

Change the status of a secondary index

After a secondary index is created for a base table, you must rebuild the index if the base table contains data. For more information about the syntax, see BUILD INDEX. If the base table does not contain data, you can execute the ALTER INDEX statement to change the status of the secondary index. The following statements provide examples on how to change the status of a secondary index:

alter index idxs1 on shop_item_relation ACTIVE;
alter index idxs1 on shop_item_relation DISABLED;
Note

If you directly change the state of a secondary index from DISABLED to ACTIVE, data may be lost. Therefore, you must rebuild the index in advance.

Delete a secondary index

You can execute the DROP INDEX statement to delete a secondary index for a specified base table. The following statement provides an example:

drop index idxs1 on shop_item_relation;
Note

Before you delete an index, make sure that your account is granted the Trash permission.

Query optimization

When you perform a query, Lindorm selects a secondary index based on the rule-based optimization (RBO) policy. Lindorm matches the prefixes of the associated index tables based on the query conditions and uses the index table that has the highest match rate as the index for the query. The following sample code provides examples of optimized queries:

-- Create a base table, and then create indexes for the base table.
create table dt (rowkey varchar, c1 varchar, c2 varchar, c3 varchar, c4 varchar, c5 varchar, constraint pk primary key(rowkey));
create index idx1 on dt (c1);
create index idx2 on dt (c2, c3, c4);
create index idx3 on dt (c3) include (c1, c2, c4);
create index idx4 on dt (c5) 'INDEX_COVERED_TYPE'='COVERED_ALL_COLUMNS_IN_SCHEMA';

-- Execute the following statements to run optimized queries:
select rowkey from dt where c1 = 'a';
select rowkey from dt where c2 = 'b' and c4 = 'd';
select * from dt where c2 = 'b' and c3 >= 'c' and c3 < 'f';
select * from dt where c5 = 'c';
Note
  • When Lindorm executes the select rowkey from dt where c1 = 'a' statement, Lindorm selects the idx1 index to run this query.

  • When Lindorm executes the select rowkey from dt where c2 = 'b' and c4 = 'd' statement, Lindorm selects the idx2 index to run this query. This statement queries all rows that match the c2=b condition and then filters the rows based on the c4=d condition row by row. Although c4 is specified as an index key column, the query fails to match the prefix of idx2 because column c3 is not defined in the WHERE condition.

  • When Lindorm executes the select * from dt where c2 = 'b' and c3 >= 'c' and c3 < 'f' statement, Lindorm selects the idx2 index to run this query. All the columns in the base table are selected in the query, and the index table does not contain all the columns. As a result, a table scan is required after the index is scanned. When the system scans the base table, the system may make several remote procedure calls (RPCs) because the row key columns may be dispersed in the base table. If large amounts of data needs to be scanned, the response time is long.

  • When Lindorm executes the select * from dt where c5 = 'c' statement, Lindorm selects the idx4 index to run this query. This query does not need to scan the base table because the idx4 index contains all the columns in the base table.

Limits

  • You can create indexes whose names are the same for different base tables. For example, you can create an index named Idx1 for the dt and foo tables at the same time. However, the name of each index that is created for the same base table must be unique.

  • You can create indexes only for base tables that store data of one version. You cannot create indexes for base tables that store data of multiple versions.

  • If a TTL value is specified for a base table, the same TTL value takes effect on the associated index tables. You cannot separately specify a TTL value for each index table.

  • An index can contain a maximum of three index key columns.

  • For an index table, the total length of the index key columns and the primary key of the base table cannot exceed 30 KB. We recommend that you do not use a column whose size is larger than 100 bytes as an index key column.

  • You can create a maximum of five indexes for a base table. If you create more indexes, the storage cost is excessively high, and write operations on the base table require a longer period of time to complete.

  • Each query can hit a maximum of one index. Index-merge queries are not supported.

  • When you create an index for a base table, the system synchronizes the data in the base table to the index. The execution of the CREATE INDEX statement requires a long period of time to complete if the base table contains large amounts of data.

  • Secondary indexes do not support the batch increase feature.

  • The secondary index that can be hit by a query contains columns that are sorted in an order that is different from that in the base table.

  • Secondary indexes can be created only for data written by using SQL statements or API operations. Secondary indexes cannot be created for data that is imported to Lindorm by using the bulkload method.

Support

If you have questions related to secondary indexes, you can join the Lindorm group in DingTalk or submit a ticket. For more information, see Technical support.