Breaking Performance Bottlenecks: Deep Dive into PolarDB-X's Columnar Pagination Query Principles

This article introduces how PolarDB‑X accelerates Top‑K and pagination queries using a multi‑level parallel Top‑K engine plus threshold‑based early‑termination of file scans.

By Junqi

Introduction to Pagination Queries

In data processing systems, pagination (Top-K) queries are a common and important operation used to extract the most representative or highest-priority K records from large datasets. Specifically, a Top-K query sorts a data source (for example, a table) by a chosen criterion, such as a numeric value, timestamp, or relevance score, and returns the top K records. Common use cases include:

• Leaderboards (for example, ranking by score or sales)

• Recommendation systems (for example, returning the top K relevant items)

• Real-time queries (for example, the latest K news items or most recent K transactions)

Data for these queries is read from data sources (typically table files), processed through multiple stages (such as filtering, joins, and projection), and finally reduced to the top or bottom K values.

When the data source consists of many files, traditional Top-K approaches usually scan and filter every file completely and then perform time-consuming computations, causing heavy I/O and CPU overhead. This inefficiency is especially pronounced for pagination query scenarios where users page through different segments of the Top-K results.

Existing solutions often process files serially in a single thread, lacking parallel processing and therefore forming Top-K thresholds slowly. They also typically lack a dynamic threshold update mechanism, so they cannot quickly exclude records that will not appear in the final Top-K, which further increases I/O and compute work. PolarDB for Xscale (PolarDB-X) overcomes these limitations by using multi-threaded cooperation, dynamic threshold updates, and optimized file scan policies. These techniques enable early termination of reads and computation, substantially improving pagination efficiency for complex queries.

Syntax of Top-K Queries

In SQL, Top-K queries are typically expressed using the ORDER BY and LIMIT keywords. Any query can append the following clause to the end of the statement to perform a Top-K query:

ORDER BY col1 ASC|DESC, col2 ASC|DESC, ...
LIMIT offset, len

This query sorts the result set by the specified columns (col1, col2, …, colx) in ascending (ASC) or descending (DESC) order, and then returns len rows starting from the offset position.

For example, the SQL statement below filters tables R, S, and T, joins them, and then paginates the final result ordered by R.col1 ASC and R.col2 DESC.

select * from R inner join S on S.A = R.A inner join T on T.B = R.B
where R.condition_1 and S.condition_2 and T.condition_3
order by R.col1 asc , R.col2 desc limit offset, len;

Part 1: Parallel Top-K Implementation

Classical Top-K Algorithm Implementation

Below is the implementation of a classical Top-K algorithm suitable for single-threaded, small-scale scenarios.

Min-heap-based Top-K Algorithm

A min-heap is a binary heap where each node's value is less than or equal to its child nodes' values. The top of the heap (root node) always contains the smallest element in the heap.

A simple Top-K algorithm using a heap can be implemented in the following way:

1. Initialize a min-heap H of size K.

2. Scan the dataset A and insert each element A[i] into the heap until H contains K elements. At this point, the top of the heap H[0] is the smallest among them.

3. Continue scanning the remaining elements in the dataset, compare them with the top element of the heap and perform replacement if needed:

• If the current element is greater than the heap's top element, pop the top element from the heap and insert the current element. The heap will then automatically adjust itself to maintain the heap property.

• Otherwise, skip the current element and proceed to the next.

4. After the scan completes, the K elements in the min-heap are the dataset's Top-K largest elements.

Comparison and Replacement

The core principle behind the algorithm, and specifically Step 3 ("comparison and replacement"), is:

During the finding of the Top-K elements of a set, for any subset of size ≥ K, an element that is smaller than that subset's Kth largest element cannot belong to the final Top-K set.

In the simple Top-K algorithm above, the heap H contains K elements:

The Kth largest element in H is

For each element A_i in the dataset A, form a new set by combining A_i with the current heap elements:

If , considering HK1 as we discussed earlier, there exists a new size-K set H' whose Kth largest element is larger than h_k:

Therefore, h_k can be safely discarded because it cannot be in the final Top-K.

Applying to Top-K Queries

When a database executes an SQL statement containing a Top-K clause, such as: ORDER BY col1 ASC|DESC, col2 ASC|DESC,... LIMIT offset, len, the most straightforward approach is:

1. Solve the Top-K problem with K = offset + len by constructing a min-heap of size K.

2. Build a comparator for all comparison operations including the heap sort: If the query specifies ascending order (ASC), the comparator must use reverse ordering. If the query specifies descending order (DESC), the comparator can use normal ordering.

Pseudocode of the comparator:

comparator(a, b) =
    (s1 * cmp(a.col1, b.col1) != 0) ? s1 * cmp(a.col1, b.col1) :
    (s2 * cmp(a.col2, b.col2) != 0) ? s2 * cmp(a.col2, b.col2) :
    ...
    (sN * cmp(a.colN, b.colN))

Where:

• cmp(x, y) is the basic comparison function (returns -1 if x < y, 1 if x > y, and 0 if x == y).

• si = -1 when the i-th column is sorted in ASC order.

• si = 1 when the i-th column is sorted in DESC order.

All subsequent comparisons are performed using this comparator.

3. After obtaining the Top-K results through the min-heap-based algorithm, the final SQL output is produced by selecting len rows starting from the offset-th record in the sorted Top-K set.

Parallel Top-K Optimization in PolarDB-X

PolarDB-X introduces a parallel Top-K processing solution using a three-level Top-K operator structure: L1, L2, and L3. Each level handles Top-K aggregation and optimization at a different stage, reducing computational complexity compared with simple heap-merging methods. Multiple threads collaboratively update the global Kth largest value, accelerating threshold convergence and improving data filtering efficiency. Tight coordination between upstream and downstream operators further enhances query processing. The internal execution flow of each Top-K operator has also been redesigned to include heap-top cleanup and dynamic filtering steps.

(1) DAG Execution Structure of Parallel Top-K

In the computer cluster of a PolarDB distributed database, there are N compute nodes, with one master node acting as the query initiator and coordinator, while also performing worker tasks. The remaining nodes are worker nodes.

When a user initiates a Top-K query, the database instance builds a directed acyclic graph (DAG) for execution, based on the total number of compute nodes N and total parallelism m. Assuming parallel operators are evenly distributed across the nodes, the execution proceeds in the following way:

Create m Scan operators across the N compute nodes, with m/N operators per node, each responsible for fetching data from its respective partition.
Each node's Scan operators output data to m/N downstream first-level Top-K operators (L1 Top-K).
Each node's m/N L1 Top-K operators maintain a local min-heap to compute the Top-K values of their input data.
After computing their Top-K results, the m/N L1 Top-K operators on each node immediately output to a single second-level Top-K operator (L2 Top-K) within the node. Each node has only one L2 Top-K operator.
The L2 Top-K operator on each node computes the node-level Top-K results after receiving all L1 Top-K outputs. These results are then transferred through a downstream Exchange operator to the master node's third-level Top-K operator (L3 Top-K).
The master node has only one L3 Top-K operator, which aggregates the Top-K results from all worker nodes to compute the final Top-K result.
The L3 Top-K operator sends results to the Limit operator. For example, for a query like "SELECT * FROM table ORDER BY xx LIMIT offset, len," K = offset + len, and the Limit operator outputs len rows starting from the offset position.

At all three levels (L1, L2, and L3), each Top-K operator maintains an internal min-heap of size K.

The diagram below illustrates the parallel scheduling workflow: Scan operators on each node feed data to L1 Top-K operators; L1 results are aggregated to the node-level L2 Top-K operator; L2 results are transferred through the Exchange operator to the master node's L3 Top-K operator, achieving fully parallel Top-K computation.

• Rationale for designing L1, L2, and L3 Top-K operators

L1 Top-K operator: designed to fully exploit parallel computing capabilities, leveraging multi-core performance to optimize overall Top-K efficiency. Detailed design specifics are provided below.
L2 Top-K operator: aggregates the results from the L1 Top-K operators within a single node to produce the node-level Top-K results. This step avoids sending all L1 Top-K results across the network to the master node for final aggregation, significantly reducing network traffic and overhead.
L3 Top-K operator: consolidates the Top-K results from all nodes to compute the final global Top-K result.

• Performance advantages over single-core Top-K processing

For a total data size of n, the time complexity of solving the Top-K problem on a single core is .
Within a single node, assuming the total data size is n and there are m threads, each thread processes approximately n/m rows. Each thread maintains its own min-heap of size K to compute its local Top-K. Additionally, one heap is used to merge the Top-K results from all the m threads to produce the final Top-K result. Time complexity analysis:

Each thread processes n/m rows, maintaining a min-heap of size K for Top-K computation. The time complexity of a single heap operation is O(logK). Therefore, the total time complexity per thread is
Finally, we need to merge m min-heaps of size K each to obtain the global Top-K. The time complexity for this merge step can be analyzed by merging the m heaps (each of size K):
Therefore, the overall complexity for multi-threaded Top-K computation within a single node is

The speedup of multi-core Top-K compared to single-core Top-K can be approximated as nlogK

In our scenario, both n (the total number of elements) and m (the number of cores or threads) are sufficiently large compared to K, so the speedup can be approximated as nlogK1

(2) Collaboration among L1 Top-K Operators

This section describes how L1 Top-K operators collaborate both within a node and across nodes to accelerate the convergence toward the true global Kth largest value, reduce heap insertions, and improve the efficiency of Top-K computation.

All parallel operators jointly maintain a shared variable Kth, which represents the currently known global Kth largest value. This value could be a single-column value (val_col) or a vector composed of multiple columns ({val_col1, val_col2, ..., val_colx}). Through the collaboration of different parallel operators, the value of Kth gradually converges toward the true global Kth largest value Kth'. The faster Kth approaches Kth', the higher the data filtering rate, the fewer heap insertions are required, and the better the overall performance.

1. Initial state: At the time of L1 Top-K operator initialization, the total number of global elements is zero, so the Kth largest value (Kth) cannot yet be determined.

2. Element count updates: Within each node, every time a row is written into an L1 Top-K operator, the total number of elements written across all L1 Top-K operators on that node is updated. This update is atomic, thread-safe, and concurrent.

3. Initialization of Kth value: Once the global total number of elements reaches K, each L1 Top-K operator maintains a min-heap of size K. Then, the heap top elements are collected from all local heaps, and the smallest among them MIN is selected as the current global Kth. This completes the initialization of the Kth value.

Rationale of this step: Suppose there are m threads across multiple nodes, each maintaining its own heap. When the total number of elements across all m heaps equals K, the union of these heaps can be considered a complete Top-K candidate set. Let represent the heap top (minimum value) of the i-th thread.
Then, the global Kth largest value Kth is the minimum among these minimum values , which corresponds to the Kth largest value in the overall Top-K candidate set.

4. Concurrent Kth updates within a node: After the Kth value is initialized, all parallel operators concurrently and safely update the global Kth. Whenever a value is written into a Top-K operator, the operator compares and potentially updates the global Kth, aiming to raise its value and complete convergence as quickly as possible. This facilitates filtering, reducing unnecessary heap insertions and improving performance. The process works in the following way:

When a Top-K operator on a thread receives an input element a, it compares a with the current global Kth.
If a < Kth, the element a is discarded because it cannot appear in the final Top-K result.
If a >= Kth, the operator updates the global Kth with a, using a thread-safe operation. a is then inserted into the local min-heap of the thread's L1 Top-K operator.

The underlying principle behind this step is that heap insertion takes O(logK) time, whereas comparing with the Kth value takes only O(1). Therefore, maximizing the use of the global Kth for early filtering can significantly reduce costly heap operations. While comparing with the local heap's root (top element) can also reduce insertions, the global Kth tends to be greater than any single thread's local heap root, and it is updated more frequently, thus converging faster. The relationship between a new value a and the global Kth can be summarized as:

KTHMAX

To ensure thread safety during updates to Kth, the following pseudocode shows the update logic per thread, using a Compare-And-Swap (CAS) mechanism to guarantee atomicity:

Function update_Kth_value(a):
    Loop forever:
        needUpdate = cmp(a, Kth_value) > 0
        If not needUpdate OR cas_compare_and_set_Kth_value(Kth_value, a):
            Break from loop
    End Loop

5. Cross-node Kth value update: After the Kth values are initialized and continuously updated within each node, each node periodically (usually every few tens of milliseconds) broadcasts its current Kth_i value to other nodes. Each node then computes the global Kth largest value and updates its local Kth accordingly: (N = Total number of nodes).

KTHMAX1

This step further accelerates the convergence of the global Kth value across the entire compute cluster.

(3) Collaboration between Upstream and Downstream Operators

This section describes the collaborative workflow between the L1 Top-K, L2 Top-K, L3 Top-K, and Scan operators.

In the parallel scheduling section, the basic execution flow of the DAG is described in detail. In addition to inputting upstream data and outputting data to downstream operators, there are additional collaborative relationships between upstream and downstream operators. The design of these relationships is one of the key focuses of this approach and plays a critical role in improving the overall Top-K query processing performance.

1. Runtime-filter for filtering data in the Scan operator: The PolarDB distributed database query engine supports the runtime-filter mechanism, which is a dynamic expression built by upstream operators and passed to the Scan operator as a predicate condition to filter scanned data. In this approach, if the user's Top-K query uses the format "ORDER BY col1, col2, ..., colx LIMIT offset, len", in conjunction with the Kth value maintained by the L1 Top-K operator, a predicate condition can be formed as (col1, col2, ..., colx) >= Kth, where Kth is dynamically updated and corresponds to a vector of values across x columns: Kth = {val_col1, val_col2, ..., val_colx}. The process works in the following way:

During the creation of the DAG execution structure, the Kth object shared by the L1 Top-K operator is wrapped as a runtime-filter func={(col1, col2, ..., colx) >= Kth} and passed to all Scan operators.
At the beginning of execution, when Kth has not yet been initialized, the runtime-filter is inactive, and the Scan operator ignores the runtime-filter.
Once Kth is initialized and continuously updated, the runtime-filter becomes active. The Scan operator will recognize the activation of the runtime-filter and the dynamic changes in Kth. Each time, the latest Kth value is used to form the predicate condition (col1, col2, ..., colx) >= Kth to filter the data in the Scan operator, thus improving overall performance.

2. L2 and L3 Top-K operators utilize the order of upstream chunk outputs: Within the L2 Top-K and L3 Top-K operators, the upstream operators are L1 Top-K and L2 Top-K, respectively. Their outputs are ordered sequences of {Chunk_ij}, where Chunk represents the basic data unit in the database executor, typically containing 1,024 rows of data. Here, i represents the thread or operator from which the chunk originates, and j represents the index of the chunk within the i-th thread or operator. Although there is no inherent order between Chunk_ij from different threads, within each Chunk_ij, the rows are ordered in descending order based on the sorting columns. Therefore, when L2 and L3 Top-K operators encounter a row in Chunk_ij that is smaller than the current Kth value, all subsequent rows in that chunk will also be smaller than the Kth value and can be discarded directly, improving processing efficiency.

3. L2 Top-K's Kth_parent for filtering upstream L1 Top-K operators: In the L2 Top-K operator, the current Kth largest value Kth_parent is monotonically increasing and visible to the L1 Top-K operator. The L1 Top-K operator can check the value of Kth_parent to help filter input elements, and the principle is the same as the Kth value filtering. Since the input element sequences for L1 Top-K and L2 Top-K may not be in the same order, the convergence speed of Kth_parent and Kth may differ. Therefore, there is a probability that Kth_parent converges faster and reaches a higher value, resulting in better filtering performance.

The diagram below illustrates the collaboration between parallel operators and the upstream-downstream operators during the parallel Top-K query processing execution:

1. Red line: represents the shared and updated Kth value by the L1 Top-K operator, which is also used to filter data for the Scan operator.

2. Blue line: represents the Scan operator using a runtime filter {col >= Kth} to filter data, with the Kth value being updated and maintained by the L1 Top-K operator as part of the predicate condition.

3. Green line: represents the L2 Top-K operator maintaining the Kth value, which is used by the L1 Top-K operator for additional filtering.

(4) Internal Execution Flow of Top-K Operators

1. Operator initialization.

A comparator is created based on the data types of the ORDER BY columns. The logic of the comparator is consistent with the description in the "Classical Top-K Algorithm – Applying to Top-K Queries" section.
A min-heap is constructed inside the operator, with a fixed capacity of K, corresponding to the Top-K query value.
For L1, L2, and L3 Top-K operators, each level within a compute node independently constructs a shared global Kth object, which is accessible and updated across all parallel Top-K operators at that level.

2. Data write to each parallel Top-K operator.

L1, L2, and L3 Top-K operators share a unified execution logic, with internal flags distinguishing their levels.

1. Heap top cleanup: Before processing the chunk, each Top-K operator begins by cleaning its min-heap:

Repeatedly check the heap's top element: If the current level's Kth object has been initialized, use the comparator to compare the heap top with the Kth value. If the heap top is less than the current Kth, remove it from the heap (namely, perform a pop), and let the heap re-select its top element.
If the top element is not filtered out by the current level's Kth, attempt filtering using the upper-level Kth object using the same comparison logic.
Repeat this process until no more elements are removed, or the heap top is no longer less than any applicable Kth value.

2. Kth-based row filtering: Check each row in the input chunk:

If the current level's Kth is initialized, compare the row against Kth. If it is smaller, the row cannot be in the final Top-K result and is discarded.
If not filtered by the current level, attempt filtering using the upper-level Kth object using the same method.
If the row is filtered out by either step above and the operator has been informed that the input chunk is sorted, all subsequent rows in the chunk can be safely discarded. (For more information, see the "Collaboration between Upstream and Downstream Operators" section.)

3. Data write to the min-heap: If a row passes Kth filtering, wrap it as a heap element a.

If the heap size is less than K, insert a directly.
If the heap already has K elements, compare a with the heap top using the comparator. If a is smaller, discard it. If a is greater than or equal to the heap top, pop the heap top and insert a. The heap will then reselect the top.

4. Kth value update: If the element a was successfully inserted into the heap and the heap size is now ≥ K, retrieve the heap top and update the Kth object with it. For more information, see the "Collaboration among L1 Top-K Operators – Concurrent Kth updates within a node" section.

5. Total element count update: Count how many elements were successfully inserted into the heap, and update the global total element count. Once this count reaches or slightly exceeds K, the Kth value is initialized. At the moment of initialization, the previous step cannot yet update Kth because the heap size is still ≤ K within each operator, and thus not yet eligible for Kth update. For more information, see the "Collaboration among L1 Top-K Operators – Initialization of Kth value" section.

3. Operator completion and output:

For each level of Top-K operator (L1/L2/L3), once all input data has been fully processed, the operator enters a completed state. At this point, the total number of elements in the heap is less than or equal to K, as the processes of heap top cleanup, comparison, and filtering using the Kth value help identify and discard elements that cannot be part of the final Top-K result. The operator then repeatedly pops elements from the heap to form a descending sorted result sequence using the comparator. Let S_sorted = {a1, a2, ..., an}, where n ≤ K, represent the sorted result. This sequence is split into chunks and passed downstream, in the following way:

Each L1 Top-K operator, upon finishing reading and processing its assigned partition, completes its Top-N computation and immediately outputs its sorted result as chunks to the L2 Top-K operator through in-memory transfer.
Each L2 Top-K operator, upon receiving data from all L1 Top-K operators on the node, completes its Top-N computation and initiates network transfer using an Exchange operator to forward data to the L3 Top-K operator.
The L3 Top-K operator on the master node, upon receiving data from all L2 Top-K operators, completes the final Top-N computation and builds the final sorted sequence of size K by continuously popping heap elements. It then returns the result segment starting from the offset position and containing len elements. which corresponds to the final result of the Top-K query: ORDER BY ... LIMIT offset, len

Part 2: Top-K Execution Scheduling and Early Termination

The specific implementation of PolarDB-X's parallel Top-K optimization has been introduced in the previous section. Now, for complex queries involving multi-table joins and pagination, where the row count of tables R, T, and S is in the billions, how can we obtain the execution results within 100 milliseconds?

select * from R inner join S on S.A = R.A inner join T on T.B = R.B
where R.condition_1 and S.condition_2 and T.condition_3
order by R.col1 asc , R.col2 desc limit offset, len;

Clearly, the process of executing the complex query first, materializing the results, and then computing the Top-K is insufficient to meet performance requirements. The database must be designed with scheduling and scanning mechanisms that automatically identify the correct stopping points, so that once the pagination output requirement is met, the scan is immediately terminated. We need to address several technical challenges:

We need to identify pipeline patterns like scan-join-topk and establish the relationship between the Top-K query and the sort key order.
Column-oriented storage can only have one sort key and cannot support multiple covered indexes. Therefore, a universal solution is needed, where a single sort key can accommodate various ORDER BY column combinations and both ascending and descending order variations.
Column-oriented storage lacks global or partition-level ordering and relies on the internal order of each file. Therefore, precise control over file scheduling and scanning is required.

Next, we will discuss PolarDB-X's optimization for early termination in pagination for complex queries.

1. Execution Segment Matching

(1) Match Conditions

To apply the early termination policy, the system first identifies and matches eligible execution segments. An execution segment must meet all of the following conditions:

1. It must be capable of forming a pipeline.

2. The pipeline must start with a Scan operator and end with a Top-K operator.

3. The ORDER BY columns of the Top-K operator (order_by_cols = (order_by_col1 ASC|DESC, order_by_col2 ASC|DESC, order_by_col3 ASC|DESC, …)) and the sort key of the underlying table scanned by the Scan operator (sort_key_cols = (sort_key_col1 ASC|DESC, sort_key_col2 ASC|DESC, sort_key_col3 ASC|DESC, …)) must satisfy one of the following four relationships:

Full ascending match (TOTAL_ORDER_ASC): order_by_cols exactly matches sort_key_cols in both column order and sort direction.
Full descending match (TOTAL_ORDER_DESC): order_by_cols is exactly the reverse of sort_key_cols in both column order and sort direction.
Ascending prefix match (PREFIX_ORDER_ASC): order_by_cols and sort_key_cols share a common prefix, and the sort direction of that prefix is the same in both.
Descending prefix match (PREFIX_ORDER_DESC): order_by_cols and sort_key_cols share a common prefix, and the sort direction of that prefix is the reverse between the two.

Example:

(2) Matching Process

For an SQL query initiated by a user, the query is first parsed by the parser, which performs syntax analysis and converts it into an intermediate representation, typically an abstract semantics tree (AST). The AST is then passed to the query optimizer, which selects the execution plan with the lowest cost and generates an execution plan or execution tree. The execution plan provides a detailed description of how the database engine will execute the query, including the order of operations, the algorithms used, and the data structures involved. The database execution engine then executes the query according to the execution plan and returns the results to the user.

For example, in the case of the query discussed in this section, the following tree structure is generated, which means:

Scan the R, S, and T tables, and apply filters based on condition_1, condition_2, and condition_3, respectively.
The filtered data from the S table serves as the build side for the join (R INNER JOIN S ON S.A = R.A), whereas the filtered data from the T table serves as the build side for the join (R INNER JOIN T ON T.B = R.B).
The filtered data from the R table serves as the probe side, and the two join operations are executed sequentially. Finally, the results of these operations are passed through a Top-K computation, where K = offset + len.

In database systems, a pipeline execution framework (also called an execution pipeline or query pipeline) refers to a policy for organizing and executing the components or operators of a query execution plan in a pipelined manner. Query processing involves multiple steps and a variety of operators, such as join, selection, and projection, each of which may need to process large volumes of data.

Operators can be executed serially, meaning one operator begins only after its predecessor finishes. However, this can be inefficient because downstream operators may remain idle while waiting. By contrast, a pipeline execution framework enables an operator to start processing the upstream operator's output as soon as partial results are available, rather than waiting for the upstream operator to complete processing all input. This approach can significantly improve overall query performance.

The concept of pipeline execution is analogous to instruction pipelining in computer architecture: A downstream stage does not have to wait for the upstream stage to finish processing the entire input before it begins working.

Within a pipeline execution framework, operators are generally classified by their execution behavior into two categories:

1. Blocking execution: An operator must wait until all of its input data has been processed before it can begin execution.

2. Non-blocking execution: An operator can start processing data as soon as it receives output from its upstream operator, without waiting for all input data to be processed.

Pipeline execution substantially improves resource utilization and the degree of parallelism in query processing, because different stages of a query can run concurrently on multiple cores or processors. Most modern database systems, such as PostgreSQL, MySQL, and Oracle, adopt some form of pipeline execution policy to optimize query performance.

(3) Matching Pipelines That Meet the Conditions

Identify pipelines that start with a Scan operator and end with a Top-K operator.
Collect sorting information, including the ordering columns of the Top-K operator (order_by_cols = (order_by_col1 ASC|DESC, order_by_col2 ASC|DESC, order_by_col3 ASC|DESC, …)) and the sort key columns of the table scanned by the Scan operator (sort_key_cols = (sort_key_col1 ASC|DESC, sort_key_col2 ASC|DESC, sort_key_col3 ASC|DESC, …)).
Determine the relationship between the Top-K ordering columns and the Scan operator's sort key columns, identifying whether the relationship is a full ascending match, a full descending match, an ascending prefix match, or a descending prefix match.

2. Global Top-K Threshold Maintenance

This step follows the same concept as in the parallel Top-K implementation. Let's first recall that the Top-K threshold is defined as a value such that, at a given moment, there are more than K elements greater than this threshold. The threshold is greater than or equal to the final Kth largest value (Kth). It evolves dynamically and gradually converges toward Kth over time. Any element smaller than the current threshold cannot appear in the final Top-K result set. Therefore, the Top-K threshold can be used by the Top-K operator during data filtering to exclude data that does not meet the criteria as early as possible.

All parallel operators jointly maintain a shared variable Vk, which represents the currently known global Kth largest value. This value could be a single-column value (val_col) or a vector composed of multiple columns ({val_col1, val_col2, ..., val_colx}). Through the collaboration of different parallel operators, the value of Vk gradually converges toward the true global Kth largest value Kth.

1. Initial state: At the time of Top-K operator initialization, the total number of global elements is zero, so the Kth largest value (Kth) cannot yet be determined.

2. Element count update: Within each node, every time a row is written into a Top-K operator, the total number of elements written across all Top-K operators on that node is updated. This update is atomic, thread-safe, and concurrent.

3. Initialization of Vk value: Once the global total number of elements reaches K, each Top-K operator maintains a min-heap of size K. Then, the heap top elements are collected from all local heaps, and the smallest among them MIN is selected as the current global Vk. This completes the initialization of the Vk value.

Rationale of this step: Suppose there are m threads across multiple nodes, each maintaining its own heap. When the total number of elements across all m heaps equals K, the union of these heaps can be considered a complete Top-K candidate set. Let represent the heap top (minimum value) of the i-th thread. Then, the global Kth largest value Kth is the minimum among these minimum values , which corresponds to the Kth largest value in the overall Top-K candidate set.

4. Concurrent Vk updates within a node: After the Vk value is initialized, all parallel operators concurrently and safely update the global Vk. Whenever a value is written into a Top-K operator, the operator compares and potentially updates the global Vk, aiming to raise its value and complete convergence as quickly as possible. The process works in the following way:

When a Top-K operator on thread i receives an input element a, it compares a with the current global Vk.
If a < Vk, the element a is discarded because it cannot appear in the final Top-K result.
Otherwise, the element a is inserted into the local heap maintained by the Top-K operator on that thread. If the heap contains fewer than K elements, the element a is inserted directly. Otherwise, a is compared with the heap top and will only replace it when a is greater. The heap then automatically re-selects h as its new heap top.
The updated heap top h is then used to perform a thread-safe comparison and update of the global Vk:

VKMAX

In a single-threaded Top-K process, once the heap reaches K elements, its top element naturally serves as the current threshold, and any change to the heap top represents a threshold update. In contrast, under the multi-threaded global Top-K threshold maintenance mechanism described above, multiple threads collaboratively maintain Vk, allowing the threshold to converge toward the final Kth largest value (Kth) more rapidly and to enable earlier pruning of elements that do not meet the Top-K criteria.

For the TPC-H 100GB standard benchmark dataset, experiments were conducted to record the variation of the Top-K threshold during query execution. The query used for testing was SELECT * FROM lineitem ORDER BY l_partkey DESC LIMIT 10000, 10. The changes in the threshold were measured and compared between the following configurations: single-threaded execution, where each thread updates its local threshold independently; multi-threaded execution with 2, 4, and 8 threads, where all threads collaboratively maintain a global threshold. The diagram below shows the results.

(If the SQL query is ORDER BY l_partkey ASC LIMIT 10000, 10, the curve will gradually decrease from the maximum value and converge toward a smaller Kth value.)

3. Threshold-based Early Termination of File Reads

(1) File Read Task Management

To fully leverage the Top-K threshold for early termination during file reads, each thread maintains a work pool to manage its scan tasks. Let the file set be F:

Each thread retrieves scan tasks from its own work pool. Within a thread, an interleaved file read policy is employed to avoid processing a single file sequentially. This prevents large amounts of data that would not contribute to Join or Filter results from being read and subsequently forming Top-K inputs and thresholds.

The following specific policy is used:

1. Iterate over the file set F in a loop. After completing one round, start another iteration.

2. For each file encountered during iteration, extract the next I/O unit sequentially for execution, positioned immediately after the previous I/O unit in the file. Specifically:

If the relationship between the Top-K ordering columns and the file sort keys is a full ascending match or an ascending prefix match, extract I/O units in order from the beginning of the file.
If the relationship is a full descending match or a descending prefix match, extract I/O units in reverse order from the end of the file.

3. For each file, only one I/O unit or task is selected per iteration before moving on to the next file.

4. When a file has no remaining I/O tasks, remove it from the file set F.

5. Continue until there are no files left in the set, at which point the iteration terminates.

This interleaved read policy ensures that, when processing multiple files, a single thread can terminate reading for each file as early as possible based on the current Top-K threshold.

Design of I/O units:

A row-group is the smallest unit of I/O. Each row-group contains a fixed number of r rows, as determined by the storage layer.
Each I/O unit consists of m row-groups.
To ensure that each I/O unit collects a sufficient number of Top-K rows, the total number of rows per I/O unit is designed to be slightly larger than the target K. The selection of m is therefore calculated as an integer multiple of r, such that the total number of rows equals or slightly exceeds K.

By appropriately setting m and r, each I/O unit can approximately meet the target row count, thereby improving the efficiency of Top-K data collection.

Each row-group contains multiple chunks, and each chunk includes multiple columns with 1,024 rows. To extract the values of the ordering columns from a chunk, the corresponding I/O unit of the row-group must be read into memory, after which the column data is parsed.

(2) Threshold-based Early Termination of File Reads

By leveraging the global threshold, early termination during file scanning can be achieved to reduce unnecessary I/O operations and computational overhead. The following specific policy is used:

1. Early Termination Condition for Ascending Scans

If the Top-K ordering columns (order_by_cols) and the file sort keys (sort_key_cols) have a full ascending match or an ascending prefix match relationship, extraction within each row-group starts from the first chunk, in sequential order:

Where, represents the j-th chunk in the row-group, and each chunk contains 1,024 rows. The early termination condition is CMP

Cj[0] represents the first row in the j-th chunk.

The following comparison algorithm is defined:

• Full ascending match relationship: Compare each column sequentially until the first non-zero comparison result is obtained, and then return that result.

• Ascending prefix match relationship: Compare only the columns in the prefix sequentially until the first non-zero comparison result is obtained, and then return that result.

Once the termination condition is met, all remaining chunks within the row-group, as well as all subsequent row-groups in the file, are skipped. No further I/O, extraction, or computation is performed. The file read task is then terminated and removed from the work pool.

The diagram below illustrates, for the TPC-H 100GB test dataset under full ascending match and ascending prefix match scenarios, the following during pagination execution (SELECT * FROM partsupp ORDER BY ps_partkey LIMIT 10000, 5): the variation of the Top-K threshold over time and the first row of each chunk (chunk[0]) produced by different files within the thread over time. If the value of chunk[0] exceeds the current Top-K threshold Vk, the reading process for that file is immediately terminated.

2. Early Termination Condition for Descending Scans

If the Top-K ordering columns (order_by_cols) and the file sort keys (sort_key_cols) have a full descending match or a descending prefix match relationship, extraction within each row-group starts from the last chunk, in sequential order: . Where, C_j represents the j-th chunk in the row-group, and each chunk contains 1,024 rows. The early termination condition is CMP2

Cj[end] represents the last row in the j-th chunk.

The following comparison algorithm is defined:

• Full descending match relationship: Compare each column sequentially until the first non-zero comparison result is obtained, and then return that result.

• Descending prefix match relationship: Compare only the columns in the prefix sequentially until the first non-zero comparison result is obtained, and then return that result.

The diagram below illustrates, for the TPC-H 100GB test dataset under full ascending match and ascending prefix match scenarios, the following during pagination execution (SELECT * FROM partsupp ORDER BY ps_partkey DESC LIMIT 10000, 5;): the variation of the Top-K threshold over time and the last row of each chunk (chunk[end]) produced by different files within the thread over time. If the value of chunk[end] is less than the current Top-K threshold Vk, the reading process for that file is immediately terminated.

4. Overall Process

Based on the steps described above, we can summarize how the PolarDB-X executor handles complex query pagination:

First, for the incoming user SQL, the executor builds a DAG to execute the scan-join-Top-K task.
At each parallelism level of the DAG, a work pool is created, and columnar data files are arranged in round-robin order.
For each file in sequence, an I/O unit is extracted, which contains multiple row-groups, and each row-group contains multiple chunks. The scanning direction of each I/O unit, row-group, and chunk is determined according to the ascending/descending relationship.
The DAG uses the Scan operators to pull data from the work pool and performs hash probing on multiple Join operators.
The Top-K operator in the DAG receives the Join output and concurrently updates the min-heap Top-K threshold to the shared global threshold.
File read tasks at each parallelism level construct dynamic predicate conditions in real time based on the current value of the shared threshold.
Within a file, once a row is filtered out by the dynamic predicate condition (Chunk.row[i] < Kth), the reading process for that file is immediately terminated, and the file is removed from the work pool.
The next round of file scheduling proceeds by extracting the next I/O unit from the remaining files in the work pool.
When all files have been removed from the work pool, the query terminates, producing the final result.

Test Environment

• TPC-H 100GB test dataset

• Specification: 2×16-core and 64 GB memory

Pagination Query of Aggregation Results

• SQL: select sum(l_quantity),l_orderkey from lineitem group by l_orderkey order by sum(l_quantity) desc limit 1000000, 10;

• Test results

PolarDB-X	ClickHouse	MySQL
8.21 sec	23.07 sec	353.15 sec

Multi-table Joins and Pagination Queries

The diagram below presents test results of six different multi-table join query scenarios.

No.	SQL
Q1	select l_shipdate,l_orderkey from lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey order by l_shipdate limit 20000,10;
Q2	select l_shipdate,l_orderkey from lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey order by l_shipdate, l_orderkey limit 20000,10;
Q3	select l_shipdate,l_orderkey from lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey order by l_shipdate, l_orderkey desc limit 20000,10;
Q4	select l_shipdate,l_orderkey from lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey order by l_shipdate desc limit 20000,10;
Q5	select l_shipdate,l_orderkey from lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey order by l_shipdate desc, l_orderkey limit 20000,10;
Q6	select l_shipdate,l_orderkey from lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey order by l_shipdate desc, l_orderkey desc limit 20000,10;

Test results

Summary

PolarDB-X employs a series of designs to enhance the performance of Top-K queries and pagination queries of large-scale data. The key points described in this article include:

Architecture of multi-level parallel Top-K operators
- L1 Top-K operator: deployed on each compute node, responsible for preliminary Top-K filtering on local data, fully leveraging multi-core parallelism.
- L2 Top-K operator: aggregates results from multiple L1 operators within the same node, reducing cross-node data transfer and optimizing network overhead.
- L3 Top-K operator: centrally aggregates all L2 results across nodes to compute the global Top-K, ensuring the accuracy of the final query results.
Multi-threaded collaborative Kth value updates
- Global Kth value maintenance: All threads jointly maintain a global Kth value (the Kth largest value) and update it dynamically through thread-safe and concurrent mechanisms, such as CAS, accelerating its convergence.
- Filtering efficiency improvement: Rapidly converging Kth values can filter out non-qualifying data more effectively, reduce heap write operations, and improve overall query efficiency.
Tight coordination between upstream and downstream operators
- Runtime-filter mechanism: Dynamically updated Kth values are used as filtering conditions and passed in real time to the Scan operator, reducing the scanned data volume.
- Ordered data block processing: L2 and L3 operators exploit the ordering of upstream operator outputs to discard unnecessary data blocks in advance, further optimizing the processing workflow.
Internal execution flow of Top-K operators
- Heap top cleanup: Before writing new data, the heap top is cleaned to ensure that the heap always maintains the best K elements.
- Dynamic filtering and updates: Input data is dynamically filtered based on the current Kth value while simultaneously updating the heap and the Kth value, achieving efficient data processing.
- Early termination of file reads: Based on the global threshold Vk, the operator dynamically decides whether to continue reading a file, significantly reducing I/O overhead.
Threshold-based early termination of file scheduling policies
- Ascending and descending scans: The scanning direction and termination conditions are determined according to the relationship between the ordering columns and the file storage order, enabling efficient data filtering.
- Interleaved file read policy: Avoid concentrating on a single file by repeatedly reading I/O units from multiple files in a round-robin manner, improving the balance and efficiency of data processing.

These designs enable PolarDB-X to return Top-K results efficiently and with low latency to meet performance requirements when dealing with complex queries and large-scale datasets.