About Database Kernel | Accelerate HTAP Long Tail Requests and Analyze PolarDB IMCI Row-Column Fusion

By Nanlong Yu (Gushen)

1. Background

Mixed transaction processing (OLTP) and analytical processing (OLAP) workloads are becoming common in today's business systems. Due to the requirements of real-time and easy operation and maintenance, some business systems use the HTAP database to replace the original OLTP-ETL-OLAP architecture. The HTAP database can simultaneously process TP requests and AP requests efficiently in a database system.

However, since the access mode of TP requests and AP requests are different, efficient processing of the two requests depends on different storage formats and computing modes. Therefore, one type of HTAP database would store data in two different formats and choose different modes to compute the corresponding data when processing different requests.

PolarDB (the cloud-native database of Alibaba Cloud ApsaraDB) uses this method to process mixed workloads. Read-only nodes (RO) that enable the In-Memory Column Index index (IMCI) function to maintain an additional column index and use the vectorized method to compute column data when processing AP requests (column computing). However, when processing TP requests, MySQL's original one-tuple-at-a-time method is still used to compute the InnoDB data of the row-based index (row-based computing). Both column index and row-based index playback the writes of the read and write nodes (RW) through physical replication links. In this system, the storage, executor, and optimizer that process the two requests are independent of each other. TP requests and AP requests are separated on the execution path. An SQL statement either selects column or row-based computing.

2. Long Tail Request Problem

From the users' workload, we observed that for most of the requests in the mixed load, the row-column separation computing method has been able to execute with the optimal plan, but neither column nor row-based computing is optimal for a small number of requests.

On a large wide table, filter a few columns with the where or join conditions or find top k. Then, output the details of all columns for the filtered rows. In such a query, table scan or join only involves a few columns, and column computing is optimal. However, the project needs to obtain all columns, but the column index generates read amplification. As a result, it is optimal to use InnoDB primary index to query details through row-based computing.
A fragment of the execution plan is two large tables joining (t1 joins t2). The filter ratio of the where condition on t1 is high, and there is an InnoDB index. The join condition on t2 does not have an InnoDB index, and the join condition only involves a few columns. In such a query, it is optimal to use the InnoDB index to query t1 through row-based computing and then use the column computing to scan t2 for hash join. (As such, the data skipping of t1 on the column index is accurate enough, and pure column computing can be used.)
It is still in scenario 2. The difference is that the join condition on t2 includes the InnoDB index. Then, it is optimal to join the index for the fragment of t1 joining t2 through pure row-based computing. However, the remaining fragments in the execution plan may be optimal using column computing.

To handle this part of long-tail requests in mixed loads more efficiently, we designed and implemented a row-column fusion execution architecture based on two different indexes in the optimizer and executor of IMCI. This article describes the design of the basic components (the cost model of the optimizer, the multi-engine access of executors, and the log playback and transaction processing of storage engines) of the row-column fusion execution architecture of PolarDB and the HybridProject operator designed for the first type of the requests above. Through the basic components, we can easily implement HybridIndexScan and HybridIndexJoin to handle the remaining two types of requests mentioned above.

3. Main Challenges

The main challenges of implementing row-column fusion in a row-column separation system come from the following three aspects:

Estimate Optimizer Cost: The cost models of the MySQL optimizer and the IMCI optimizer are different. As a result, if we add the cost of row-based execution fragments computed by the cost model of MySQL to the cost of column-based execution fragments computed by the cost model of IMCI, we cannot get the comparable cost of the entire execution plan. We need to set a unified CPU and I/O cost unit and then compute comparable costs based on different algorithm complexity and different statistics of row-based and column indexes.
Executors Access Different Indexes: An executor must access both InnoDB and column indexes. MySQL executor executes an SQL request in a single-threaded mode and accesses InnoDB-related interfaces and contexts single-threadedly. But the IMCI executor processes an SQL request in a single leader + multi-worker mode. Therefore, accessing InnoDB in workers requires additional processing.
Implement Strong, Consistent Read for Row-Based and Column Indexes: Column index and InnoDB replay the write of RW asynchronously. The redo log playback locations of two different indexes may be inconsistent. InnoDB uses the LSN-based read view to determine the visibility of data. Column indexes use the LSM storage engine to determine the visibility of data. In asynchronous playback, row-column indexes can only achieve eventual consistent read. However, inconsistent visible data in row and column execution fragments of row-column fusion would lead to errors in execution results.

4. Basic Components of Row-Column Fusion

4.1 Cost Model of Optimizer

The three types of long-tail requests in the row-column separation execution architecture above would be considered based on the cost, in which AP requests are executed by IMCI. Therefore, we choose an IMCI-based optimizer and introduce cost estimation for row-based execution fragments so these long-tail requests can choose the row-column fusion execution plan.

When introducing row-based execution cost estimation, we use the following two principles:

For an execution fragment, relatively overestimate the cost of using row-based execution and relatively underestimate that of column-based execution because long-tail requests are originally executed in column-only mode. This cost estimation ensures that when a long-tail request is executed in row-column fusion mode, the execution plan is not worse than the original execution plan.
Only refer to the ratio between the cost constants of the MySQL optimizer. As shown in the preceding figure, the cost constants of the MySQL optimizer and the cost constants of IMCI have different orders of magnitude. Therefore, we cannot use their absolute values but refer to the proportional relationship between them.

Based on the principles above, we adopt the following design:

In the IMCI optimizer, io_block_read_cost of InnoDB equals pack io cost of column index. Although a page in InnoDB is 16KB, and the pack of a column index is 65536 rows, which is larger than 16KB, we set the same I/O cost constant after considering the first principle and the concurrent requests issued by PolarFS and the stripe scattering granularity.
Based on the second principle, memory_block_read_cost and row_evaluate_cost must keep in the same proportion. Based on the first principle and vectorized execution, the row_evaluate_cost must be larger than the CPU constant of IMCI.

With the comparable cost constants above, we can compute the comparable cost of the row-column fusion execution plan based on the respective algorithms and statistics of MySQL and IMCI.

4.2 Multi-Engine Access for Executors

After the IMCI optimizer selects the row-column fusion execution plan for long-tail requests, we introduce a new Hybrid operator in the IMCI executor to compute row-type execution fragments.

The new Hybrid operator needs to access InnoDB in the IMCI executor. We refer to the implementation of the server layer of MySQL and interact with the storage engine through the handler in TABLE objects. From the execution process shown in the preceding figure, although the table has been opened in the preparation stage, the related objects and interfaces are all implemented for MySQL single-threaded execution. Therefore, in the workers of IMCI executor, we need to open the table again and clone or reference related objects. The main difficulty of this part lies in the compatibility with the engineering complexity of MySQL logic.

4.3 Log Playback and Transaction Processing of Storage Engines

The asynchronous playback of two different indexes is shown in yellow in the preceding figure. InnoDB updates the latest read view after the playback, while the column index updates its last commit seq. The playback runs every other redo (including several redo log entries). The commit status of InnoDB transactions and IMCI transactions in a redo is the same. Therefore, after a redo segment is played back on two different indexes, we call the read view and commit seq above corresponding because the data visibility based on the updated latest read view and last commit seq on the two different indexes is the same. If the query uses the corresponding read view and commit seq to determine data visibility, strong, consistent reads can be achieved.

However, the playback of row and column indexes is asynchronous. Due to the difference in playback speed, the latest read view and the last commit seq may not be corresponding at any time. As such, we need to find the corresponding read view and commit seq updated recently.

To this end, we introduce the process shown in the blue part of the figure above in the asynchronous playback process.

For a redo, InnoDB would add a dummy log object containing the current latest read view to the log queue of the column index after updating the latest read view.
Column index processes log queue in a single-thread order. It first pushes regular log objects to the playback module and computes the commit seq updated after the playback. Then, it processes the dummy log object and sets the commit seq to the corresponding read view.

When a dummy log object is processed, the previous regular log objects may not be played back completely. Therefore, the commit seq set to the read view may not take effect on the column index. After the playback is completed, the read view and the corresponding commit seq can be used.

After this process is introduced, the row-column fusion query will find a read view in which the corresponding commit seq has taken effect in the read views that have not yet been purged to determine data visibility.

5. HybridProject

We used the preceding basic components to introduce HybridProject to handle the first long-tail case. HybridProject uses the primary key to obtain complete rows from the primary index of InnoDB and outputs the rows after computing relevant expressions. In addition to the capabilities provided by the basic components, HybridProject needs its child operators in the execution plan to output the primary key, which can be implemented in the following two ways:

The optimizer supports column pruning. Make the child operator of HybridProject only output the primary key through column pruning.
Force the child operator of the HybridProject to be ComputeScalar, and the rest of the execution fragments adopt delayed materialization. (If the rest fragments adopt direct materialization, there is no need to select the execution plan of row-column fusion.) ComputeScalar obtains the primary key from the column index based on the row ID of the delayed materialized fragment.

As the IMCI optimizer does not support column pruning, we adopt the second way.

6. Performance Verification

On the OnTime dataset provided by ClickHouse, we verified the effect of the row-column fusion execution architecture and HybridProject.

https://clickhouse.com/docs/en/getting-started/example-datasets/ontime

The table of the OnTime dataset contains 110 columns, which is a typical large wide table. We use the following SQL to simulate the first type of long-tail requests.

select * from ontime order by ArrTime limit 1000;

The three execution plans are all cold-start queries. The results are listed below:

The results show PolarDB can achieve optimal performance for long-tail requests in hybrid workloads using the row-column fusion execution architecture and hybrid operators. The performance is improved by orders of magnitude compared with pure column execution or pure row execution.

7. Summary

The row-column fusion execution architecture can select a better execution plan to handle long-tail requests in mixed workloads. In the future, we will introduce HybridIndexScan and HybridIndexJoin operators based on this architecture to comprehensively improve the performance of long-tail requests.

If you have any questions related to the technology and practice mentioned in this article, please email nanlong.ynl@alibaba-inc.com.

About the Author

Nanlong Yu (Gushen), from Alibaba Cloud Apsara Database - PolarDB New Storage Engine Group, is engaged in the research and development of PolarDB in memory column index (IMCI).

Community

About Database Kernel | Accelerate HTAP Long Tail Requests and Analyze PolarDB IMCI Row-Column Fusion

1. Background

2. Long Tail Request Problem

3. Main Challenges

4. Basic Components of Row-Column Fusion

4.1 Cost Model of Optimizer

4.2 Multi-Engine Access for Executors

4.3 Log Playback and Transaction Processing of Storage Engines

5. HybridProject

6. Performance Verification

7. Summary

About the Author

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution