PolarDB-X global secondary index
Indexes are the basic components of databases. As early as the 1970s, SystemR has supported multi-dimensional queries by adding indexes. In stand-alone databases, indexes are mainly divided into BTree index, Hash index, full-text index, spatial index, etc. according to the purpose and data structure used. Generally, each table contains a primary key index. Indexes other than the primary key index are collectively referred to as secondary indexes.
Distributed databases that use the storage and computing separation and shared nothing architecture have good horizontal scalability. Through data partitions and stateless computing nodes, computing and storage are allowed to expand and shrink independently. A large number of distributed databases use this architecture (Spanner, CockroachDB, YugabyteDB, etc.).
What problems does global indexing solve?
The shared nothing architecture introduces the concept of partition. Data needs to be split according to a fixed partition key, which results in queries containing partition keys being able to quickly locate a specific partition, while other queries require full partition scanning. This situation is similar to that in a stand-alone database, querying by primary key can quickly locate the page where the data is located, while querying by non primary key requires full table scanning.
Unlike stand-alone databases, full partition scanning for distributed databases will not only increase the number of slow queries and reduce system throughput, but also may cause the system to lose its linear scalability. Refer to the example below
Before capacity expansion: two storage nodes (DN) and two data partitions. Assume that the physical QPS that a single DN can carry is 3, and the overall physical QPS is 6. Each query is a full partition scan. Logical QPS: 6/2=3
After capacity expansion, there are three storage nodes and three data partitions. The overall physical QPS is 9. Each query is scanned in full partitions. The logical QPS is 9/3=3. The machine cost has increased by 50%, and the query performance has not improved at all!
The stand-alone database uses a secondary index to avoid full table scanning. Specifically, the secondary index selects non primary key columns as keys, and the value part stores the value of the primary key (it may also be a reference to a row record, and the specific implementation does not affect the solution idea). The query process using the secondary index is: first, locate the page according to the index column of the secondary index, read the value of the primary key, and then return the primary key index to query the whole row of records (this step is called back to the table). In essence, the secondary index avoids full table scanning by redundancy of one data, which belongs to the standard idea of system optimization "space for time"
To eliminate full partition scanning in distributed databases, a similar idea can be adopted, which involves redundant index data, and the index uses a partition key different from the main table. When querying, first locate a partition according to the partition key of the index, then query the partition key and primary key of the primary table from the partition, and get the complete data back to the table. You only need to scan a fixed number of partitions (for example, for point queries, scan at most two partitions).
This index is different from the partition dimension of the primary table. We call it the Global Secondary Index (GSI), which is also often referred to as the global index. The corresponding index with the same partition dimension of the primary table is called the Local Secondary Index (LSI)
Why do I need a global index?
It has been said that full partition scanning will cause the system to be non extensible. If the user can strictly ensure that all SQL contains partition keys, does the global index not need to be used?
Yes, this situation is really unnecessary, but the complexity of the reality determines that this is a small probability event. The more common scenario is:
● The user table needs to support users to log in according to their mobile phone numbers and user IDs. Which partition key should be selected?
● In the e-commerce system, you need to query the order according to the buyer ID and the seller ID. How to select the order table partition key?
● The existing business code is written by an outsourcing company. It is unrealistic to modify SQL in a large scale. What should we do?
For more scenario analysis, you can refer to TPCC and transparent distribution. Conclusion: To provide a "transparent distributed" experience similar to a stand-alone database, you must support global indexing.
What kind of global index experience do users want?
Indexing is a very common component in stand-alone databases, which is highly accepted by users. If the global index can achieve a similar use experience to the stand-alone database index, it can be called a "transparent" index use experience. From the perspective of user use, the following four key features that affect the index use experience are listed
It is not easy to meet these four features. The read, write, and schema change processes need to be designed accordingly. Related problems range from distributed transactions, CBO index selection, and how to implement Asynchronous Online Schema Change, to on update current_ How to handle the columns of the timestamp attribute and how the affected rows are compatible with MySQL behavior need to be considered, and high performance needs to be guaranteed.
The following describes PolarDB-X's technical exploration in the process of implementing a global secondary index that is compatible with the MySQL index experience.
Global secondary index implementation
For the global index in the OLTP system, it is first necessary to ensure that the data is strongly consistent with the main table. To solve this problem, distributed transactions, logical multiple writes, and Asynchronous Online Schema Change (AOSC) are required.
Consistency of data writing
When writing data, since the data of the primary table and GSI may be located in different partitions, distributed transactions are required to ensure atomic commit. At the same time, write conflicts need to be handled due to concurrent writes. For tables without a global index, DML statements can be routed to the partition where the data resides, and the DN completes concurrency control. However, for tables containing GSI, the data to be changed needs to be read and locked first when updating data, and then the main table and index are updated according to the primary key. This method of read before write is called logical multi write.
It doesn't sound difficult to read first and then write. It's just SELECT+UPDATE/DELETE, but the actual situation is more complicated than imagined. First of all, the DML implementation of early DRDS completely relies on push down execution, and lacks the corresponding logical plan. There are about 13 types of MySQL DML syntax, each of which needs to be supported. For scenarios that can be pushed down, the push down execution scheme is still retained; Secondly, many details of MySQL are not introduced in the official documents, and they need to be adapted one by one according to the code, such as type conversion, affected_ Rows, implicit default values, etc. In addition, in order to support globally unique indexes, conflict detection processes need to be added, resulting in four additional execution modes of INSERT statements. The above figure shows the execution process of logical multi write. For detailed introduction, refer to the source code interpretation
Data consistency of index creation
The second aspect of ensuring data consistency is to ensure data consistency during index creation. For example, in the picture on the left below, in a distributed scenario, multiple nodes may have a time difference in their perception of metadata. As shown in the figure, a node is known to have an index, so it inserts the index and writes to the main table and index table at the same time. The other node does not know the existence of the index, so it only deletes the contents of the main table, and does not delete the contents of the index table, which results in an additional piece of data on the index table.
In order to solve this problem, PolarDB-X refers to the solution of Google F1 and ensures the smooth transition of metadata by introducing multiple mutually compatible phases. For detailed implementation, refer to this article. At the same time, as the number of metadata version changes increases in the schema change process, we have also optimized the metadata version evolution on a single CN, so that DDL does not affect the read/write execution at all. Please refer to this article for details.
After introducing the above technologies, our entire DDL framework can create global indexes without blocking. It is mentioned that MySQL has supported atomic DDL since version 8.0. PolarDB-X also has its own implementation. See this article for details
Data consistency of index scanning
Because of the concurrency in the data writing process, it is necessary to handle the write conflict. Similarly, because of the concurrent read and write in the data reading process, it is also necessary to handle the read and write conflict. Modern databases basically use MVCC to resolve read/write conflicts. Before the query starts, obtain a version number from the issuer. Use the version number to determine whether the latest version of the data row is visible to the current transaction, so that the read data meets the specified isolation level. PolarDB-X supports TSO based MVCC implementation, which can ensure that the index table and the main table read the same snapshot during the table return process. Refer to this article for MVCC implementation.
The core goal of index selection is that users do not need to manually specify indexes when using GSI. The solution is based on CBO's automatic index selection. The implementation involves how the optimizer evaluates and selects the execution plan that includes index scanning (especially the index scanning on the secondary index, commonly known as IndexScan, IndexSeek, etc., hereinafter collectively referred to as IndexScan). The method of a stand-alone database is to replace TableScan with IndexScan. If the index cannot cover the required columns, then add a step back to the table operation. The optimization of IndexScan mainly involves column clipping and predicate push down. Independent algorithms are used to calculate the costs of IndexScan and back to the table.
A key problem in cost evaluation is how to evaluate the cost of returning to the table. GSI is also a logical table, and returning to the table is equivalent to joining the index table and the primary table on the primary key. Therefore, we have made engineering optimization to adapt the indexing back to the table to the Project+Join operation, so that we can adapt the entire cost evaluation of indexing to the cost evaluation of common query plans.
In order to incorporate plans containing IndexScan into the execution plan enumeration process, index scanning and table returning operators need to be adapted to the existing CBO framework. The specific implementation is shown in the figure above. The implementation plan using GSI is generated through AccessPathRule, and the most appropriate plan is selected by comparing costs in subsequent iterations. Refer to this article for CBO framework. At the same time, because the table return in the distributed database requires network IO, which is more expensive than the table return in the stand-alone database, PolarDB-X also supports such operations as Join/Limit ahead of the table return, and together with the index scan, they are pushed down to the DN for execution, so as to reduce the amount of data returned to the table and reduce network IO. Please refer to this article for details
Overlay index is a special index that allows users to save more column data in the index, so as to meet the requirements of more query statements for reference columns and avoid returning to the table as much as possible. Overriding indexes is a common optimization method in stand-alone databases. For example, Sql Server has long supported optimizing query performance by overwriting indexes.
For distributed databases, the back table may also affect the horizontal scalability of the system. Referring to the example above, the order table is based on buyer_ ID partition, when using the seller_ Full partition scanning is required for ID query. Create a seller_ The GSI on ID is used for optimization. Because the index table only contains partition keys, primary table partition keys and primary keys by default, there is no content column, and it needs to be returned to the table. As the number of orders sold by the seller increases, the back table operation involves more and more partitions, which will eventually become a full partition scan. The goal of avoiding full partition scanning by adding indexes has not been achieved. To avoid this situation, PolarDB-X supports the creation of "overwriting indexes", and adds specified columns to GSI through the COVERING syntax, making it easier for GSI to achieve index coverage.
In addition to the lack of columns, the lack of historical versions may also lead to back to the table. For example, MySQL does not save version information for the secondary index. It only saves the transaction ID of the last write in each page header of the secondary index, which results in back to the table if the historical version needs to be queried. PolarDB-X records the undo log for GSI separately during the writing process. It can read the historical version of the index. It will not generate additional table back operations because of querying the historical version. It also supports the direct issue of Flashback Query to GSI for execution.
Because distributed transactions and logical multiple writes must be used when writing data, there is additional overhead. Write performance needs to be optimized to ensure system throughput. Specifically, distributed transactions rely on two-phase commit to ensure atomicity. Compared with stand-alone transactions, they add the prepare phase and write commit point steps. At the same time, they rely on TSO to obtain the commit timestamp. The throughput of TSO services may also become a bottleneck. For the optimization of distributed transactions, including one-stage commit optimization, single machine multi partition optimization, TSO Grouping, etc., you can refer to the design of distributed transaction implementation and global timestamp service.
Logical multiple writes need to read data to CN first for two reasons. First, PolarDB-X is compatible with MySQL's pessimistic transaction behavior. The write operation uses the current read. For statements that determine the update range based on predicates, such as UPDATE and DELETE, you need to query and lock the data to be modified first to avoid inconsistent snapshots read by different branch transactions. Secondly, for INSERT statements, if there is a unique constraint on the target table, you need to read the data first to detect unique constraint conflicts.
A similar process also exists in a stand-alone database. For example, when MySQL executes DML, the server layer queries and locks the data to be modified from innodb, and then calls ha_ innobase::write_ Row writes data, and MySQL's unique constraint implementation also requires the unique constraint check before INSERT. The difference is that the interaction between the MySQL server layer and the innodb layer occurs in a single machine and only involves memory and disk IO at a lower cost. In the distributed database, CN and DN interact through the network machine at a higher cost.
When PolarDB-X executes DML, it gives priority to push down execution. For scenarios where logical multiple writes must be used, engineering optimization is performed for "querying and locking data to be modified" and "unique constraint conflict detection" respectively
● Parallel read and write: The idea is very simple. The serial execution process of "read cache write" is transformed into a parallel "read" and "write" process of multiple small batches. A key problem to solve is that MySQL transactions are bound to connections. If multiple read connections are created within a transaction, data visibility problems will occur. To solve this problem, we introduced the concept of transaction group, so that multiple connections can share the same ReadView, thereby solving the problem of binding read and write connections within a transaction, so that different batches of reads and writes can be executed in parallel.
● Push down of unique constraint conflict detection: it mainly solves the performance problem in the data import scenario. In order to continue the breakpoint transmission, data import usually uses the INSERT IGNORE statement, but in fact, almost all inserted data are conflict free, so it is not cost-effective to perform conflict detection on each data. Our optimization method is to adopt the idea of optimistic processing, and use the RETURNING statement plus compensation. The performance of INSERT IGNORE is the same as that of INSERT in the data import scenario.
Good DDL compatibility is a must for transparent global indexes. Imagine that if you need to delete the global index referencing this column before modifying the column type every time, and then rebuild the index after the type change is completed, it is a "big head" thing. PolarDB-X is fully compatible with MySQL DDL statements. Statements related to table, column and partition changes will automatically maintain GSI. DDL execution algorithms are divided into Instant DDL and Online DDL. Instant DDL is mainly used for adding columns. Online DDL is based on AOSC and has detailed design for different DDL statements. The following is a brief introduction to the implementation of representative ADD COLUMN and CHANGE COLUMN
PolarDB-X supports the global clustered secondary index (CSI), which is characterized by maintaining the same structure as the primary table all the time to ensure that all queries do not need to return to the table. Therefore, when adding columns to the main table, you need to add a column to the CSI. In general, following the AOSC process can ensure that the index data is consistent in the process of adding columns. However, if the new column contains ON UPDATE CURRENT_ TIMESTAMP attribute may cause problems. For example, in the following scenario, the physical table is added with columns, but CN does not know the existence of the new columns, so the DN fills in the values in the main table and index table independently, resulting in inconsistent data. To solve this problem, after all CNs perceive the metadata update, we will use the backfill process to refresh the value of the new column on the index again to ensure that the index is consistent with the data in the main table.
Changing the column type is the most complex operation in DDL, which has a great impact on writing. For example, MySQL 8.0 still needs to lock the table when changing the column type. PolarDB-X supports Online Modify Column (OMC). Through the method of "adding columns - copying data - modifying metadata mapping", in combination with Instant Add Column, it supports GSI unlocked table column type changes.
The above figure shows the process of executing CHANGE COLUMN on a table without GSI. It is divided into seven stages. First, add a COL that is invisible to users_ B. Populate COL with the same value during write_ A and COL_ B. Then use COL for the existing data in the table_ The value of A is backfilled to COL_ B. Finally exchange COL_ A and COL_ B's metadata mapping, and delete COL_ A. Complete the column type change. The same process is used for scenarios where GSI exists, except that the GSI is treated the same in each step.
DDL compatibility uses the same underlying technology as GSI creation (AOSC, data backfill, logical multi write, asynchronous DDL engine, etc, For this reason, we have implemented special type conversion logic to simulate the behavior of ALTER TABLE in UPDATE. In general, DDL compatibility seems to only support some syntax, but the workload is actually huge.
The impact of global indexes on read and write performance has a lot to do with specific business scenarios. In essence, it is to sacrifice some of the write performance in exchange for a significant improvement in read performance. The following figure shows the impact of GSI on read and write throughput in the Sysbench scenario.
Distributed databases based on storage and computing separation and shared nothing architecture need to support global secondary indexing to eliminate full partition scanning and ensure linear scalability. The stand-alone database has introduced the secondary index for a long time, and the user experience is highly acceptable. A good global index experience should be similar to that of the stand-alone database. It needs to ensure strong consistency of data, support the creation through DDL statements, and support automatic index selection. At the same time, the existence of the index should not hinder the execution of other DDL statements. PolarDB-X ensures strong consistency of index data through distributed transactions and logical multiple writes; Support online schema change, and the index creation process can be parallel to the writing process; Support overwriting indexes, and solve the problem of full partition scanning caused by table returning; Refined the compatibility of DDL statements containing GSI tables. Provide users with a "transparent" index use experience to reduce the impact of distributed databases.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00