TopK operator implementation in IMCI - PolarDB - Alibaba Cloud Documentation Center

It is a classic problem to obtain the top K results from a large amount of data. The deep paging problem derived therefrom is especially difficult, which poses a great challenge to analytical databases. This topic describes how the in-memory column index (IMCI) feature provided by PolarDB for MySQL rises to this challenge.

Background information

In business systems, a common scenario exists in which users want to filter records to find a specific set of records, sort the results based on a condition, and then display the results by page. For example, users may want to filter products that are for sale to find the products that are sold by a specific merchant, sort the results based on the sales volume, and then display the results by page. The preceding scenario is usually implemented in databases by using a query statement such as ORDER BY column LIMIT n, m.

If 100 records are displayed per page in a business system, you can execute the following statements to display the records on a specific page:

Execute the ORDER BY column LIMIT 0, 100 statement to display the records on page 1.
Execute the ORDER BY column LIMIT 1000000, 100 statement to display the records on page 10,001.

In the case an index does not exist, such queries are implemented by using the classic heap-based TopK algorithm in databases based on the following logic: The system maintains a heap of size K in the memory. The top element in the heap is the smallest one. The system compares the traversed elements with the top element. If an element is larger than the top element, the top element is replaced with the new element and the heap is rebuilt. After all elements are traversed, the elements in the heap are the first K largest elements. If the page number of the page whose results you want to display is small, such as page 1 in the preceding example, K is relatively small. In this case, it is efficient to use the heap-based TopK algorithm in queries.

However, deep paging queries also exist in business scenarios, in which you want to display the records on a page whose page number is very large, such as page 10,001. In this case, K is very large, and the memory may be insufficient to cache a heap of size K. Therefore, the query results cannot be obtained by using the heap-based TopK algorithm. Even if the memory is sufficient, the efficiency of the heap-based TopK algorithm decreases when the size of the heap is very large due to random memory access during heap maintenance, and the final performance is not satisfying.

Initially, the IMCI feature of PolarDB for MySQL uses the preceding method to implement paged queries. If the memory is insufficient to cache a heap of size K, the system sorts all records in a table and caches the records in the specified range. As a result, the performance of an IMCI in deep paging queries is not as good as expected. To address this issue, PolarDB for MySQL redesigns the Sort/TopK operator of IMCIs based on analysis of the characteristics of deep paging scenarios and the issues of traditional solutions. In a test scenario, the redesigned Sort/TopK operator significantly improves the performance of IMCIs in deep paging scenarios.

Existing solutions to TopK queries

How to implement TopK queries is a classic problem in the industry. Many solutions are available to efficiently implement TopK queries. The core of these solutions is to reduce operations on non-result set data. The following three solutions are the main ones in practice.

TopK algorithm based on a priority queue

For more information, see the Background information section of this topic.

Truncation based on an offset and a limit during merge sort

If the memory is insufficient to cache a priority queue of size K, some databases, such as PolarDB, ClickHouse, SQL Server, and DuckDB, use merge sort to handle this issue. The purpose of a TopK algorithm is only to find the records in the range of [offset, offset + limit). Therefore, the system does not need to sort all the data during each merge sort. The system needs to only generate a new sorted run with a length of offset + limit. The truncation during the merge can ensure the accuracy of the results and reduce operations on non-result set data.

Self-sharpening input filter

This solution was first proposed in a paper of Goetz Graefe. ApsaraDB for ClickHouse adopts this solution. During a TopK query that is based on this solution, the system maintains a cutoff value and excludes records that are larger than the cutoff value from the result set of the TopK query. To generate a new sorted run, the system uses the current cutoff value to filter data. After a new sorted run is generated, if K is less than the length of the new sorted run, the current cutoff value is replaced with the Kth record in the new sorted run. Data in the new sorted run is filtered by using the old cutoff value. Therefore, the new cutoff value is always less than or equal to the old cutoff value. This way, the cutoff value is continuously self-sharpened. At last, the system needs to only merge the filtered sorted runs to obtain the result set.

The following example illustrates the preceding algorithm: In the current TopK query, the value of K is 3. After the first batch of data is read, the first sorted run is (1, 2, 10, 15, 21), and the cutoff value is updated to 10. Next, the cutoff value 10 is used to filter the second batch of data. The second sorted run is (2, 3, 5, 6, 8), and the cutoff value is updated to 5. Then, the cutoff value 5 is used to filter the third batch of data. The third sorted run is (1, 2, 3, 3, 3), and the cutoff value is updated to 3. The subsequent sorted runs and cutoff values are obtained in the same way. The continuously sharpened cutoff value is used to filter more data.

If K is greater than the length of a sorted run, the system accumulates enough sorted runs with more than K records and premerges the accumulated sorted runs to obtain a cutoff value. Next, the system uses the cutoff value to filter data and accumulate enough sorted runs. Then, the system premerges these sorted runs to obtain a smaller cutoff value, and so forth. The entire process is similar to the case in which K is less than the length of a sorted run. The difference lies in that enough sorted runs are premerged to obtain a cutoff value.

Issue analysis

Deep paging queries are special TopK queries. In a deep paging query, K is very large, but the actual result set is small. For example, in the ORDER BY column LIMIT 1000000, 100 statement, K is 1,000,100, but the actual result set contains only 100 records. This brings the following challenges to the solutions described in the previous section:

If the memory is sufficient and the system uses the priority queue-based TopK algorithm, a priority queue of a very large size needs to be maintained. Random memory access of queue operations decreases the memory access efficiency, which deteriorates the actual performance of the algorithm.
If the memory is insufficient and the system performs truncation based on an offset and a limit in merge sort, truncation may not be able to be performed during the early stage of the merge sort because the length of a sorted run may be less than offset + limit. As a result, all data is sorted and the effect of the truncation is affected.

Important

In this topic, sufficient memory indicates that the data structure used to manage at least K records in an algorithm can be cached in the memory, not that the input data of a TopK query can be cached in the memory. In fact, in the scenarios described in this topic, the size of the input data of a TopK query is far more larger than the size of the memory used to execute the query.

In addition, from the perspective of system design, the following items must be considered when you design a solution to deep paging:

Are different solutions required to implement queries in deep paging and shallow paging scenarios? If different solutions are used in the two scenarios, how does the system determine the boundary between deep paging and shallow paging?
How does the system adaptively select the memory or disk algorithm based on the size of the memory?

Solution design

Overall design

Combining the research and analysis described in the preceding section, PolarDB for MySQL redesigns the Sort/TopK operator of IMCIs to resolve the deep paging issue based on the existing solutions.

If the memory is sufficient, the following memory algorithm is used:
- The self-sharpening input filter solution is adopted to prevent issues of low memory access efficiency.
- On this basis, single instruction multiple data (SIMD) instructions are used to improve the filtering efficiency.
- This algorithm is used in both deep paging and shallow paging scenarios without the need to determine the boundary between deep paging and shallow paging.
If the memory is insufficient, the following disk algorithm is used:
- Truncation is performed based on an offset and a limit during merge sort.
- On this basis, a ZoneMap is used in the early stage of the merge sort for pruning. This reduces operations on non-result set data as much as possible.
The memory or disk algorithm is dynamically selected. Instead of relying on a fixed threshold, the algorithm is dynamically selected during execution based on the amount of available memory.

The solutions of self-sharpening input filter and truncation based on an offset and a limit during merge sort have been described in the preceding section. The following sections describe why the two solutions are selected. Also, the following sections describe how SIMD instructions are used to improve data filtering efficiency, how a ZoneMap is used for pruning, and how the memory or disk algorithm is dynamically selected.

SIMD-accelerated self-sharpening input filter

If the memory is sufficient, the self-sharpening input filter solution is directly used for two main reasons:

In the self-sharpening input filter solution, no matter whether the system uses a cutoff value to filter data or premerges sorted runs, the memory access is sequential. This prevents the low memory access efficiency of a priority queue.
The solution provides excellent query performance both in deep paging and shallow paging scenarios. You do not need to consider the boundary between deep paging and shallow paging.

In fact, the self-sharpening input filter solution is similar to the priority queue-based algorithm to some extent. The cutoff value is similar to the top element in the heap, both of which are used to filter the read data. The difference is that the priority queue-based algorithm updates the top element in the heap in real time, whereas the self-sharpening input filter solution accumulates data in sorted runs and updates the cutoff value by using a batch of data.

Using the cutoff value to filter data is an important process in the self-sharpening input filter solution. Data comparison is involved in this process, in which the operations are simple and repetitive but are frequently performed. Therefore, SIMD instructions are used to accelerate this process. The cutoff value filter is similar to the predicate filter used in table scan. Therefore, the expression that processes predicate can be directly reused to improve the filtering efficiency and reduce the TopK calculation time.

ZoneMap-based pruning

If the memory is insufficient, merge sort is used and truncation is performed based on an offset and a limit because of the following reasons:

If the system continues to use the self-sharpening input filter solution when the memory is insufficient, the system needs to save the accumulated sorted runs to disks, and uses an external sort algorithm during premerge, which leads to a large number of disk read and write operations. Compared with the self-sharpening input filter solution used in the scenario in which the memory is sufficient, this causes additional overhead. If K is very large, the external sort used during premerge may involve a large amount of non-result set data. This is because a TopK query needs to only obtain the records in the range of [offset, offset + limit) and the records in the range of [0, offset) are irrelevant.
In this case, merge sort can be used. When sorted runs are generated, the system saves only the sorted runs to disks and then uses statistics for prunning. This prevents unnecessary disk read and write operations and also reduces operations on non-result set data as much as possible.

The following figure shows how statistics are used for prunning. In the figure, the arrow indicates the number axis and rectangles indicate sorted runs. The left and right positions of a rectangle in the axis indicate the minimum and maximum values of the sorted run. The barrier indicates the threshold for pruning. ARRIER

A barrier can divide all sorted runs into three types:
- Type A: the sorted runs whose minimum and maximum values are both less than the barrier. Examples: Run1 and Run2.
- Type B: the sorted runs whose minimum value is less than the barrier but the maximum value is greater than the barrier. Example: Run 3.
- Type C: the sorted runs whose minimum and maximum values are both greater than the barrier. Examples: Run4 and Run5.
For a barrier, if the total number of records in all sorted runs of Type A and Type B is less than the offset value of a TopK query, the records in sorted runs of Type A must be within the range of [0, offset). In this case, sorted runs of Type A can be excluded from the subsequent merges.
For a barrier, if the total number of records in all sorted runs of Type A is greater than the sum of offset + limit of a TopK query, the records in sorted runs of Type C must be within the range of [offset + limit, N). In this case, sorted runs of Type C can be excluded from the subsequent merges.

Based on the preceding rules, statistics are used for pruning in the following process:

Build a ZoneMap that contains the minimum and maximum values of sorted runs.
In the ZoneMap, find Barrier 1 that is as large as possible and meets the following condition: The number of records in sorted runs of Type A and Type B is less than the offset of the TopK query.
In the ZoneMap, find Barrier 2 that is as small as possible and meets the following condition: The number of records in sorted runs of Type A is greater than the sum of offset + limit of the TopK query.
Use Barrier1 and Barrier2 to prune related sorted runs.

Dynamic memory or disk algorithm selection

The memory algorithm is different from the disk algorithm. You may want to use a specific threshold as the basis to select the memory or disk algorithm. For example, you can select the memory algorithm if K is less than the threshold and otherwise select the disk algorithm. In this case, if the size of the memory changes, you must specify another threshold. This causes overhead of manual intervention.

To address this issue, PolarDB for MySQL uses a simple fallback mechanism to dynamically select the algorithm to use based on the size of available memory during the execution. The algorithm is selected based on the following process:

No matter how much memory is available, try to use the memory algorithm for the TopK calculation at first.
During the execution of the memory algorithm, use the memory algorithm to complete the entire calculation if the memory is always sufficient.
During the execution of the memory algorithm, execute the fallback mechanism if the memory is insufficient. For example, the available memory may be insufficient to cache enough sorted runs that contain more than K records, or the available memory is insufficient to complete the premerge.
Fallback mechanism: Collect the minimum and maximum values of sorted runs accumulated in the memory so that a ZoneMap can be used to prune the sorted runs, and then save these sorted runs to disks. These sorted runs are used in the execution of the disk algorithm.
After the fallback mechanism is executed, use the disk algorithm to complete the entire calculation.

The same data structure is used in the memory and disk algorithms. Therefore, the fallback mechanism does not need to reorganize data. The overhead is small. In addition, the memory algorithm filters out only non-result set data. No accuracy issues exist if the sorted runs accumulated in the memory algorithm are directly used in the execution of the disk algorithm.

Others

Late materialization

Late materialization is an optimization in engineering implementation. When a sorted run is generated, only row IDs and the expressions or columns related to ORDER BY are materialized. After the result set of a TopK query is calculated, the output columns are obtained from the storage based on the row IDs in the result set. Compared with materializing the output columns when sorted runs are generated, late materialization has the following two benefits:

Materialized row IDs occupy less space. If the available memory is the same, you can use the memory algorithm to process more data.
The order of data needs to be adjusted during the TopK calculation, which involves the Copy and Swap operations on data. If all output columns are materialized when sorted runs are generated, the Copy or Swap operations for each record must be performed on each column, which causes large overhead. However, if only the row IDs are materialized, the cost of Copy and Swap operations can be reduced.

The disadvantage of late materialization is that some random I/O operations may be required to obtain the output columns from the storage based on the row IDs. However, analysis shows that in deep paging scenarios, although K is very large, the actual result set is small. Therefore, the overhead of the random I/O operations caused by late materialization is small.

Computing pushdown

During the execution of the self-sharpening input filter algorithm, the system pushes the continuously updated cutoff value down to the table scan operator. The cutoff value can serve as a new predicate in SQL statements. When the table scan operator obtains data, the operator can reuse a pruner to filter data in a pack or row group based on the new predicate.

Computing pushdown can improve the performance of TopK queries in two ways:

Reduce I/O operations: During table scan, the system does not read the pack or row group that contains only non-result set data.
Reduce computing: The data in the filtered pack or row group is no longer used for subsequent computing of the upper-layer operators.

Test results

A simple verification of the solution is performed on a TPC-H dataset of 100 GB in size.

select
    l_orderkey,
    sum(l_quantity)
from
    lineitem
group by
    l_orderkey
order by
    sum(l_quantity) desc
limit
    1000000, 100;

The following table describes the test results.

PolarDB IMCI	ClickHouse	MySQL
7.72 sec	23.07 sec	353.15 sec