400x Faster HTAP Real-time Data Analysis with PolarDB

By Beilou

Preface

Recently, analytic databases have been very popular among the business and technology communities, and many innovative products from start-ups have sprung up. On the one hand, this is because enterprises are increasingly relying on data to bring about growth in demand. On the other hand, the development of cloud-native technology has empowered the evolution and transformation of existing technology systems. The success of products such as Snowflake has proved that it is necessary to use cloud native to reengineer the technology system of analytic databases. This may also bring about new market opportunities.

PolarDB MySQL is a database system that came into being due to the cloud. In addition to on-cloud OLTP scenarios, many customers also have requirements for real-time data analysis performance of PolarDB. To meet this demand, the PolarDB technical team proposed an In-Memory Column Index (IMCI) technical solution to achieve hundreds of times of acceleration in complex analysis and query scenarios.

This article introduces the idea of the technical route behind IMCI and the hard choice we've made between several schemes. At the time of writing, the column storage analysis of PolarDB MySQL is yet to be launched publicly, so stay tuned!

1. HTAP Database Solution of MySQL Ecosystem

MySQL is an open-source database designed mainly for OLTP scenarios. The research and development in the open-source community focus on strengthening its transaction processing capabilities, including single-core performance, multi-core extensibility, and cluster capabilities, to improve availability. As for the capabilities required to process complex queries under large data volumes, such as processing subqueries by the optimizer, high-performance operator HashJoin, and SQL parallel execution, the technology communities have always placed them on a relatively low priority. Therefore, the improvement of MySQL data analysis capability is slow.

As MySQL becomes the most popular open-source database system in the world, users store a large amount of data and run key business logic in it. Real-time data analysis has become a growing demand. When the standalone MySQL cannot meet the requirements, users are expecting a better solution.

1.1 Building Block Scheme of MySQL + Dedicated AP Database

There are many dedicated analytical database products to choose from. One feasible option is to use two systems to meet the OLTP and OLAP requirements, respectively, and to synchronize data in real time through data synchronization tools between the two systems. Users can even add a layer of proxy to automatically route TP load to MySQL and analytic load to OLAP database. By doing so, they can mask the application layer from the deployment topology of the underlying database.

Such an architecture is flexible. For example, both TP database and AP database can choose the best scheme, respectively. This architecture has also completed the isolation of TP and AP loads. Its shortcomings are also obvious. First, it is technically necessary to maintain the two sets of database systems. Second, due to the differences in the processing mechanisms of the two systems, it is also very challenging to ensure real-time consistency of upstream and downstream data. Moreover, due to synchronization latency, data stored in downstream AP systems are often outdated, which fails to meet the requirement of real-time analysis.

1.2 Divergent Design Method Based on Multiple Replicas

Many of the emerging database products that have emerged with the rise of the Internet are compatible with the MySQL protocol. So they are alternatives to MySQL. Most of these distributed database products adopt the distributed Share Nothing solution. One of the core features is using the distributed consistency protocol to ensure data consistency between multiple replicas of a single partition. Since one copy of data is completely independent among multiple replicas, it is easy to use different formats for storage on different replicas to serve different query loads. A typical example is TiDB. Starting from version 4.0, TiDB uses column storage (TiFlash) on one of the replicas in a Raft Group to respond to AP loads. TiDB also uses its intelligent routing capability to automatically select data sources. This allows that one database system can serve both OLTP and OLAP loads.

This practice has been applied in many research and industry fields and has increasingly become a de facto standard scheme for integrated HTAP in distributed data fields. However, the premise is that users have the requirement to migrate to the corresponding NewSQL database system. This often brings various compatibility and adaptation issues.

1.3 Integrated Row-Column Hybrid Storage Solution

The solution that is more advanced than the multi-replica divergent design is that one database instance uses row-column hybrid storage to respond to both TP and AP loads. This is a solution adopted by all traditional commercial databases such as Oracle, SQL Server, and DB2.

Oracle Corporation released the Database In-Memory suite in Oracle Database 12c released in 2013. Its core capability is In-Memory Column Store. This improves OLAP performance by providing technologies such as row-column hybrid storage and advanced query optimization (such as materialized expression and JoinGroup).
On SQL Server 2016 SP1, Microsoft began to provide the feature of Column Store Index. Users can flexibly use various modes such as row storage table, column storage table, row-column hybrid storage table, and column storage table + row storage index according to load characteristics.
In Kepler 10.5 released in 2013, IBM added the DB2 BLU Acceleration component. It greatly improved the performance in analysis scenarios by using column storage, in-memory computing, and DataSkipping technology.

Three leading commercial database enterprises have adopted the technical route of row-column hybrid storage combined with in-memory computing at the same time. This has its underlying technical logic: column storage has better I/O efficiency (compression, DataSkipping, column clipping) and CPU computing efficiency (cache-friendly). Therefore, to achieve the highest analysis performance, column storage must be used. However, the index accuracy issue caused by index sparseness of column storage determines that it cannot become the storage format for TP scenarios. Thus, row-column hybrid storage becomes a required solution. However, in the row-column hybrid storage architecture, there is a performance gap between row storage indexes and column storage indexes when handling random updates. The low read-write latency of DRAM must be used to compensate for the low efficiency of column storage updates. Therefore, with the capabilities of low-latency OLTP and high-performance real-time data analysis, hybrid row-column storage combined with in-memory computing is the only choice.

Let's again take a look at the above three methods. From building block scheme to the divergent design method and the integrated row-column hybrid storage, the integration level is getting higher, and the user experience is getting better. However, the challenges in implementing the kernels are also increasing. The role of basic software is to leave the complexity to itself and simplicity to users, so an integrated approach is in line with technological trends.

2. The Evolution of AP Capability of PolarDB MySQL

The capability stack of PolarDB MySQL is similar to that of open-source MySQL. It is good at handling TP scenarios but weak at AP scenarios. PolarDB provides a storage capacity that one instance can store up to 100 TB of data, and its transaction processing capability far exceeds user-created MySQL. Therefore, PolarDB users tend to store more data on a single instance and run some complex aggregate queries on the data. With the write-once-read-many architecture of PolarDB, users can add read-only RO nodes to run complex read-only queries. This avoids the interference of analytic queries on the TP load.

2.1 Defects of MySQL Architecture in AP Scenarios

There are many reasons for the poor performance of MySQL implementation architecture when executing complex queries. Compared with the dedicated OLAP system, its performance bottleneck is reflected in the following aspects:

The SQL execution engine of MySQL is implemented based on the volcano iterator model. This architecture highly depends on a large number of deep-level function nesting and virtual function calls. When processing massive data, this architecture affects the efficiency of modern CPU pipelines, which leads to low CPU cache efficiency. Besides, the iterator execution model cannot give full play to the SIMD instructions provided by modern CPUs to accelerate execution.
Execution engines can only execute serially and cannot give full play to the parallel processing capability of modern multi-core CPUs. Starting from MySQL 8.0, parallel execution capability has been added to some basic queries such as count(*). However, the parallel execution capability of complex SQL still has a long way to go.
The most frequently used storage engines of MySQL adopt row storage. When analyzing massive data by column, serious I/O bandwidth waste occurs when reading data from disks by row. In addition, the row storage format copies a large amount of unnecessary column data when processing massive data. This also has an adverse impact on the read and write efficiency of memory.

2.2 Parallel Query of PolarDB Breaks through CPU Bottleneck

The Parallel Query framework developed by the PolarDB team can automatically start parallel execution when the amount of queried data reaches a certain threshold. The data is distributed to different threads at the storage layer for parallel computing, and the results are aggregated to the main thread. At last, the main thread summarizes and returns the results to users. This framework has improved query efficiency.

The parallel query feature allows PolarDB to break through the limitation of single-core execution performance. By taking advantage of the parallel processing capability of multi-core CPUs, the time consumption of some SQL queries on PolarDB decreases exponentially.

2.3 Why We Need Column Storage

The parallel execution framework breaks through the limitation of CPU extension capability and brings significant performance improvement. However, limited by the efficiency of row storage and row executor, the single-core execution performance has a ceiling. The peak performance still falls behind that of a dedicated OLAP system. To further improve the analysis performance of PolarDB MySQL, we need to introduce column storage:

In analysis scenarios, it is often necessary to access a large number of records in a column. By storing data in different columns, column storage avoids reading unnecessary columns. Besides, column storage saves columns with the same attributes continuously. Its compression efficiency is much higher than that of row storage, usually more than 10 times higher. Moreover, combined with rough index information such as MIN/MAX, the structure of large block storage of column storage can realize a wide range of data filtering. All these practices greatly improve the I/O efficiency. Under the current storage/compute separation architecture, the reduction in the amount of data read through the network can bring an immediate improvement in the response time of query processing.
Column storage can also improve the execution efficiency of CPU when processing data. First, the compact arrangement of column storage can improve the efficiency of CPU accessing memory and reduce the execution pause caused by L1/L2 cache miss. Second, SIMD technology can be applied in column storage to further improve single-core throughput. This is the common technical route for modern high-performance analysis execution engines (such as Oracle, SQL Server, and ClickHouse).

3. PolarDB In-Memory Column Index

In-Memory Column Index brings column storage and in-memory computing capabilities to PolarDB. This allows users to simultaneously run TP and AP hybrid loads on a set of PolarDB databases. This also ensures the current excellent OLTP performance of PolarDB and greatly improves the performance of PolarDB in running complex queries on large data volumes.

In-Memory Column Index uses row-column hybrid storage and combines the shared storage-based write-once-read-many architecture of PolarDB. It includes the following key technical innovations:

Support for columnar indexes is added to the storage engine (InnoDB) of PolarDB. Users can use DDL to create all or some columns of a table as columnar indexes. Columnar index use column compression storage, and its storage space consumption is much smaller than that of row storage. Columnar indexes reside in memory by default to maximize analysis performance, but they also support persistence to the shared storage when the memory capacity is insufficient.
In the SQL executor layer of PolarDB, we have rewritten the column-oriented executor engine framework. It makes full use of the advantages of column storage. For example, the framework is able to access data of the storage layer in a batch of 4,096 rows and use SIMD instructions to improve the throughput of single-core data processing of CPU. All key operators support parallel execution. In column storage, the performance of the new executor is much better than the original row storage executor of MySQL.
There is an optimizer framework that supports row-column hybrid execution. It determines whether to start column execution based on whether the issued SQL can execute covering query on columnar index and whether the functions and operators it depends on can be supported by the column executor. The optimizer estimates the costs of both the row-storage execution plan and the column-storage execution plan and selects the one with lower cost.
Users can use an RO node in a PolarDB cluster as an analytic node and configure column storage index on the RO node. Complex queries run on the column storage index and use the computing power of all available CPUs. Thus, maximum execution performance is obtained without affecting the available memory and CPU resources of the TP load in the cluster.

The combination of several key technologies makes PolarDB a true HTAP database system. Its performance in running complex queries on large amounts of data can equal that of industry-leading commercial database systems such as Oracle and SQL Server.

4. Technical Architecture of In-Memory Column Index

4.1 Row-column Hybrid Optimizer

PolarDB has a set of native row-oriented optimizer components. After the engine layer adds support for the column storage, the optimizer components need to be enhanced. The optimizer needs to be able to determine whether a query should be scheduled to row storage or column storage for execution. We achieve this through a whitelist mechanism and a framework of execution cost calculation. The system ensures acceleration of supported SQL and is compatible with running unsupported SQL.

How to Achieve 100% MySQL Compatibility?

We use a whitelist mechanism to achieve the compatibility goal. The reasons for using the whitelist are as follows: First, considering the limitation of the available system resources (mainly memory resources), columnar indexes are generally not created on all tables. When a query statement needs to use a column that does not exist in the column storage, the statement cannot be executed on the column storage. Second, considering the performance, we have rewritten a set of column-oriented SQL execution engines, involving all physical execution operators and expression computing. The coverage of supported scenarios is inadequate compared with the native row storage of MySQL. When the issued SQL contains some operator fragments or column types that IMCI execution engine cannot support, the execution engine needs to be able to identify, intercept, and switch back to the row storage for execution.

Query Plan Conversion

The purpose of plan conversion is to convert the way of expression AST of the native logical execution plan in MySQL to the logical plan of IMCI. After a logical plan of IMCI is generated, a physical plan is generated after a series of optimizations. Plan conversion is simple and direct. You only need to traverse the execution plan tree and convert AST optimized by MySQL into a tree structure with relation operators as its nodes in IMCI. This resembles a translation process. However, in this process, other operations will be performed. For example, implicit conversion of types is made to be compatible with the flexible type system of MySQL.

Optimizer for Row-Column Hybrid Execution

With two sets of execution engines (row storage and column storage), the optimizer has more choices when selecting the execution plan. It can compare the cost of row-storage execution plan with that of column-storage execution and use the execution plan with the lower cost.

In PolarDB, besides the native row-storage serial execution of MySQL, there is also the parallel query based on row storage that can give full play to multi-core computing power. Therefore, the optimizer actually chooses between row-storage serial execution, row-storage parallel query, and IMCI. In the current iteration phase, the optimizer executes as follows:

Execute the parsing process in SQL and generate logical plans. Then, call the native optimizer of MySQL to perform some optimizations such as join order adjustment. At the same time, the logical execution plan obtained in this stage will be passed to the compilation module of IMCI execution plans. The module tries to generate an execution plan of column storage (here, it may be blocked by the whitelist and falls back to row storage).
PolarDB optimizer calculates the row-oriented execution cost based on the plan of row storage. If this cost exceeds a certain threshold, the optimizer will try to push it down to the IMCI executor for execution with IMCI_Plan.
If IMCI cannot execute this SQL, PolarDB attempts to compile an execution plan of parallel query and execute it. If the execution plan of PQ cannot be generated, it means IMCI and PQ cannot support this SQL, and it falls back to the row storage for execution.

The above strategy is based on such a judgment: in terms of execution performance, row-storage serial execution < row-storge parallel execution < IMCI. In terms of SQL compatibility: IMCI < row-storage parallel execution < row-storage serial execution. However, the actual situation is more complicated. For example, in some cases, parallel Index Join based on row-storage ordered index coverage has a lower cost than Sort Merge join based on column storage. IMCI column-storage execution may be selected under the current strategy.

4.2 Execution Engine for Column Storage

IMCI execution engine is an implementation with column-oriented optimization and is completely independent of the existing MySQL row executor. The purpose of rewriting the executor is to eliminate the two key bottlenecks leading to the low efficiency issue of the existing row-storage execution engine when executing analytic SQL. The bottlenecks are overhead of virtual function access caused by row-based access and inability to execute in parallel.

Operators Supporting BATCH Parallelism

The IMCI executor engine uses the classic volcano model but also uses column storage and vector execution to improve execution performance.

In terms of the volcano model, in the relational algebra corresponding to the syntax tree generated by SQL, each operation will be abstracted into an operator. The execution engine will construct the entire SQL into an operator tree. The query tree will call the Next() interface from top to bottom, and the data will be pulled from bottom to top. The advantage of this method is that its computing model is simple and straightforward. It is achieved by abstracting different physical operators into iterators. Each operator only cares about its own internal logic. This reduces the coupling between operators and makes it easier to write a logically correct execution engine.

In IMCI execution engine, each operator also uses an iterator function to access data. The difference here is that each call to the iterator returns a batch but not a row of data. So we can regard the execution engine as a volcano model that supports batch processing.

The execution capability of serial execution is limited by single-core computing efficiency, memory access latency, I/O latency, and so on. Several key physical operators (such as Scan, Join, and Agg) in the IMCI executor support parallel execution. The physical operators need to support parallelism, and the IMCI optimizer also needs to support the generation of the parallel execution plan. When determining the access mode of a table, the optimizer will decide whether to enable parallel execution according to the amount of data to be accessed. If it is determined to enable parallel execution, it will refer to a series of state data to determine the degree of parallelism. The data include currently available CPU, memory, and I/O resources in the system, task information currently scheduled and queued, statistics, query complexity, and parameters that can be configured by users. Based on these data, a recommended DOP value is calculated for the operators, and one operator will use the same DOP value internally. DOP can also be set by users through Hint.

Vectorization execution solves the problem of single-core execution efficiency, while parallel execution breaks through the single-core computing bottleneck. The combination of the two makes the IMCI execution speed much faster than traditional MySQL row execution.

Acceleration for SIMD Vectorization Computing

In AP scenarios, SQL often contains many computing processes that involve one or more values, operators, or functions. This belongs to the category of expression computing. The evaluation of expressions is a computing-intensive task, so the computing efficiency of expressions is a key factor affecting overall performance.

The traditional expression computing system of MySQL uses row-by-row operation method. It is generally called iterator model implementation. The iterator abstracts the entire table, and the expressions are implemented as a tree structure. Therefore, the implementation code is easy to understand, and the entire process is very clear.

However, this abstraction brings performance loss. This is because, during the iteration of the iterator, the acquisition of each row of data will trigger multiple layers of function calls. The row-by-row acquisition of data brings too much I/O and is not friendly to the cache. MySQL uses a tree iterator model as it is constrained by the access methods of the storage engine. This makes it difficult to optimize complex logical computing.

For column storage, since the data in each column is stored separately and sequentially, the expression computing in a specific column can be performed in batches. For each expression, its unit of input and output is batch. In the batch processing mode, the computing process can be accelerated by using SIMD instructions. The new expression system has two key optimizations.

It makes full use of the advantages of column storage and uses the batch processing model to replace the iterator model. We used SIMD instructions to rewrite the expression kernel implementation of the most common data types. For example, all the basic mathematical operations (+, -, *, /, abs) of all numeric types (int, decimal, double) have corresponding SIMD instruction implementation. With the help of AVX512 instruction set, the performance of single-core operation will be improved by several times.

Its expression implementation is similar to that of Postgres. In SQL compilation and optimization stages, IMCI expressions are stored in a tree structure (similar to that of the existing row iterator model). However, before execution, the expression tree is traversed in a post-order and converted into a one-dimensional array for storage. In subsequent computing, we only need to traverse the one-dimensional array to complete the operation. The computing is more efficient as it eliminated the recursive process in the tree iterator model. Besides, this method provides a concise abstraction of the computing process and separates data from computing, which is naturally suitable for parallel computing.

4.3 Storage Engine That Supports Row-column Hybrid Storage

Transactional applications and analytic applications have completely different requirements for storage engines. The former requires that indexes can be accurately located to each row and support efficient addition, deletion, and modification. The latter requires support for efficient batch scanning and processing. The design requirements for storage engines in these two scenarios are completely different and sometimes even contradictory.

Thus, it is very challenging to design an integrated storage engine that can serve both OLTP and OLAP loads. At present, only several large enterprises that have decades of experience in research and development do well in HTAP storage engines. For example, Oracle has In-Memory Column Store, SQL Server has In-Memory Column Index, and DB2 has BLU. Others, like TiDB, can only meet HTAP requirements by adjusting one replica in a multi-replica cluster to column storage.

The integrated HTAP storage engine generally uses a row-column hybrid storage scheme. In other words, both row storage and column storage exist in the engine at the same time. Row storage serves TP, and column storage serves AP. Compared with the separate deployment of OLTP database and OLAP database to meet business requirements, a single HTAP engine has the following advantages:

Row-stored data and column-stored data are consistent in real time and can meet many high business requirements. All data can be found in analytic queries once written.
HTAP engine has a lower cost. Users can easily specify which columns or even which range of a table is stored in the column storage format for analysis. Full data continues to be stored by row storage.
Management and O&M are convenient. Users do not need to worry about data synchronization and data consistency between the two systems.

PolarDB uses a row-column hybrid storage technology, similar to that of commercial databases such as Oracle and SQL Server. We call it In-Memory Column Index.

When creating a table, you can specify part of the table or some columns to use column storage. You can use the Alter table statement to add column storage attributes to existing tables. Analytic queries automatically use column storage to accelerate queries.
By default, column-stored data is compressed and stored on disks, and In-Memory Columbia Store Area can be used for cache and query acceleration. The traditional row format is still stored in BufferPool for OLTP load to use.
All additions, deletions, and modifications of transactions are reflected in the column storage in real time. This ensures transaction-level data consistency.

It is technically difficult to implement a row-column hybrid storage engine while adding column-storage support to a mature OLTP load-optimized storage engine like InnoDB faces different situations:

The requirements of the OLTP business must be met. Therefore, adding column storage support cannot have too much impact on TP performance. This requires that the maintaining of the column storage must be sufficiently lightweight. If necessary, we need to sacrifice AP performance to ensure TP performance.
The design of column storage does not need to consider the modification of data by transaction concurrency and the unique check of data. These problems have been solved in the row storage system but are very difficult for individual column storage engines such as ClickHouse.
With a proven row storage system, any problem with the column storage system can be dealt with by switching back to the row storage system to respond to query requests.

The effect of the above conditions is mixed and has affected the design of the row-column hybrid storage solution for PolarDB.

Column Storage Reflected in Index

Under the architecture of the plug-in storage engine framework of MySQL, the simplest solution to add column storage support is to implement a separate storage engine, like ColumnStore of Infobright and MariaDB. PolarDB uses the solution of implementing column storage as a secondary index of InnoDB. This is mainly based on the following considerations:

InnoDB natively supports multiple indexes. Insert, Update, and Delete operations apply to the primary index and all secondary indexes at row granularity and guarantee transactions. The solution of implementing column storage as a secondary index can reuse this transaction processing framework.
The column storage implemented as a secondary index can use the same internal format as other row storage indexes do in terms of data encoding format. Memory copy is enough. There is no need to consider information such as charset and collation. This process is completely transparent to the upper executor.
The secondary index operation is very flexible. You can specify the columns in the index when creating a table. You can also use DDL statements to add or delete columns in a secondary index in subsequent operations. For example, users can add the int, float, and Double columns that need to be analyzed to the column index. The text and blob fields that generally only need point queries but take up a lot of space can be kept in the row storage.
The crash recovery process can reuse the Redo transaction log module of InnoDB. It is seamlessly compatible with existing implementations. It also facilitates the physical replication process of PolarDB and supports generating column storage indexes on independent RO nodes or Standby nodes to provide analysis services.
The secondary index has the same lifecycle as the primary table. It is easy for management.

As shown in the above figure, all primary indexes and secondary indexes in PolarDB are implemented as a B+ tree. The column index is an index by definition, but it is a virtual index used to capture the addition, deletion, and modification operations on the covered columns by the index.

For the above table, the primary index contains five columns (C1, C2, C3, C4, C5) of data, and the secondary index contains two columns (C2, C1) of data. In the common secondary index, C2 and C1 are encoded into one row and stored in the B+ tree. The column storage index contains three columns (C2, C3, C4) of data. In actual physical storage, the three columns are split and stored independently. Each column is converted into column storage format according to their writing order.

Another advantage of implementing column storage as a secondary index is that the engineering implementation of the executor is very simple. The concept of covering index already exists in MySQL, that is, all columns required by a query are stored in a secondary index. The data in this secondary index can be directly used to meet the query requirements. Compared with the primary index, the use of a secondary index can greatly reduce the amount of data read and thus improve the query performance. When all columns required for a query are covered by column indexes, the acceleration of column storage can improve query performance by dozens or even hundreds of times.

Organization of Column-Stored Data

For each column in ColumnIndex, its storage uses the unordered and append write format, combined with label deletion and background asynchronous compaction to realize space reclamation. Its specific implementation has the following key points:

Records in a column index are organized according to RowGroup. Different columns in each RowGroup are packaged to form a DataPack.
Each RowGroup uses append write, and the DataPack that belongs to each column also uses append write mode. For a column index, only one active RowGroup is responsible for accepting new writes. When the RowGroup is full, it freezes. All the Datapacks it contains are compressed and saved in the disk, and the statistics of each data block are recorded to facilitate filtering.
Each new row written to the column-stored RowGroup will be assigned a RowID for positioning. All columns belonging to a row can use the RowID to calculate the position. At the same time, the system maintains the mapping index from PK to RowID to support subsequent deletion and modification operations.
Update operations are supported by label deletion. To perform update operations, we first calculate its original position according to RowID and set the deletion label. Then, write the new data in ActiveRowGroup.
When invalid records in a RowGroup exceed a certain threshold, asynchronous compaction operation is triggered in the background. On the one hand, the asynchronous compaction operation is used to reclaim space. On the other hand, it can make effective data storage more compact and improve the efficiency of analytic query orders.

This data organization method meets the requirement of batch scanning and filtering by columns of analytic queries. Besides, the impact on TP transaction operation is very small. The write operation only needs to append write to the memory by column, and the delete operation only needs to set a deletion label. The update operation is a deletion label combined with an append write. Column storage can support transaction-level updates while not reducing the OLTP performance.

Full and Incremental Row-to-Column

The row-to-column operation occurs in two cases. The first case is to use DDL statements to create column indexes for some columns (generally, this is because the business has new analytic requirements for an existing table). You need to scan the data in the entire table to create column indexes. Another case is that, during the transaction operation, row-to-column operations are performed on columns involved in this process in real time.

In the case of the full-table row-to-column, we use parallel scanning to scan primary keys of InnoDB and convert all the columns involved into column storage format in turn. This operation is very fast, only limited by the I/O throughput speed and available CPU resources to the server. This operation is an online-DDL process and does not block the running of online services.

After the column index is established on a table, all updated transactions will update the row-stored and column-stored data simultaneously to ensure the transactional consistency of the two. The following figure demonstrates the difference when IMCI function is disabled and enabled. When the IMCI function is not enabled, updates to all rows by the transaction are locked first. Then, the data page is modified. All locked records are scaled before the transaction is submitted. After the IMCI function is enabled, the transaction system will create an update cache for column storage. When all data pages are modified, the modification operations of the column storage are recorded. Before the transaction submission ends, the update cache will be applied to the column storage system.

Under this implementation, the column storage provides the same level of transaction isolation as the row storage does. For each write operation, each row in RowGroup records the ID of the transaction that modifies the row. For each deletion label operation, the transaction ID of the operation is also recorded. By writing and deleting transaction IDs, AP queries can obtain a globally consistent snapshot in a very lightweight way.

Rough Index of Column Index

As can be seen from the storage format of the preceding columns, all Datapacks in IMCI are unordered and adopt append write method. So, it is impossible to filter out data that does not meet the requirement as accurately as normal ordered indexes of InnoDB. In IMCI, we use statistics to filter data blocks to reduce the unit price of data access.

When each active Datapack finishes writing, it will perform computing in advance. Then, it generates information, including the minimum value, maximum value, the sum of values, the number of null values, and the total number of records, in Datapack. All the information is maintained in the metadata area of DataPacks Meta and resides in memory. Since data deletion operations also exist in the frozen Datapack, the update and maintenance of statistics will be completed in the background.
For query requests, Datapacks are classified into three categories: relevant, irrelevant, and possibly relevant Datapacks based on query conditions. This reduces actual data block access. For some aggregate query operations such as count and sum, we can perform simple operations on pre-computed statistical values. These data blocks do not even need to be decompressed.

The rough index scheme based on statistics is not very friendly for some queries that need to accurately locate some data. However, in a row-column hybrid storage engine, column indexes only need to assist in accelerating queries that involve the scanning of a large amount of data. In this scenario, columns have significant advantages. For SQL that only access a small amount of data, the optimizer usually calculates based on the cost model to get a lower-cost solution based on row store.

Isolation of TP and AP Resources under Row-Column Hybrid Storage

The row-column hybrid storage of PolarDB supports both AP queries and TP queries in one instance. However, many businesses have high OLTP loads, and sudden OLAP loads may interfere with the response latency of TP businesses. Therefore, it is necessary to support load isolation in HTAP databases. With the write-once-read-many architecture of PolarDB, we can easily isolate AP loads and TP loads. Under the technical architecture of PolarDB, we have the following deployment modes:

In the first mode, row-column hybrid storage is enabled on the RW. This mode can support lightweight AP queries, which is suitable when the TP load is mainly used and AP requests are relatively few. You can also use PolarDB to query reports. However, the data comes from the scenario of batch data import.
In the second mode, RW supports OLTP loads and starts an AP-type RO to enable row-column hybrid storage to support queries. In this deployment mode, CPU resources can be 100% isolated, and memory of the AP-type RO node can be 100% allocated to column storage and executor. However, since the same shared storage is used, I/O will be affected. To solve this problem, we will add support for writing column-stored data to external storage such as OSS in the future. This will realize the isolation of I/O resources and improve I/O throughput rate on AP-type RO.
In the third mode, RW/RO supports OLTP loads, and row-column hybrid storage is enabled on a separate standby node to support AP queries. Since the standby node uses an independent shared storage cluster, this mode can also realize the isolation of I/O resources, except for supporting the isolation of CPU and memory resources.

In addition to the above different deployment architectures that support resource isolation, in PolarDB, Automatic Degree of Parallelism (Auto DOP) is supported in some large queries that need to be executed in parallel. This mechanism takes into account the current system load and available CPU and memory resources and limits the resources used by a single query. This aims to avoid a single query consuming too many resources and affecting the processing of other requests.

5. OLAP Performance of PolarDB MySQL IMCI

To verify the effect of IMCI technology, we tested the PolarDB MySQL IMCI in TPC-H scenarios. We compared it with the native execution engine of row storage of MySQL and ClickHouse. ClickHouse has the highest standalone performance among OLAP engines. The test conditions are as follows:

Data volume: TPC-H 100 GB, 22 queries
CPU Intel(R) Xeon(R) CPU E5-2682 2 socket
The memory is 512 GB, and all data is stored in the memory after startup.

5.1 PolarDB IMCI VS Serial MySQL

In the TPC-H scenario, the IMCI processing latency of all 22 queries is dozens to hundreds of times lower than that of native MySQL. As shown in the figure, Q6 is nearly 400 times better. This reflects the great advantage of IMCI.

5.2 PolarDB IMCI VS ClickHouse

The performance of IMCI in TPC-H scenario is basically at the same level as the performance of ClickHouse, currently the hottest analytic database in the technical community. The performance of processing latency for some SQL requests varies. Users can use IMCI to replace ClickHouse, and its data management is more convenient.

Future Work

IMCI is the first step for PolarDB to enter the data analysis market, and its iteration will not stop. In the future, we will dive deep into the following directions to bring a better experience to our customers.

A system of automatic index recommendation. Currently, the creation and deletion of column storage require users to manually specify, which increases the workload of DBA. At present, we are planning to introduce the automatic recommendation feature to automatically create column storage indexes according to the characteristics of users' SQL requests. This will reduce the maintenance burden.
Separate column storage table and OSS storage. IMCI is only an index at present. For purely analytic scenarios, the removal of row storage can further reduce the storage capacity. IMCI executor supports reading from and writing to OSS. This will minimize the storage cost.
Row-column Hybrid execution. Some fragments of an SQL execution plan are executed in row storage while others are executed in column storage. This can maximize the execution acceleration effect.