All Products
Search
Document Center

PolarDB:Implementation of the TopK operator in the columnstore index

Last Updated:Mar 30, 2026

Deep paging queries — where the page number is large but the result set is small — cause severe performance degradation in analytical databases. PolarDB for MySQL redesigns the Sort/TopK operator in its in-memory column index (IMCI) feature to handle deep paging efficiently, using a SIMD-accelerated self-sharpening input filter for in-memory execution and ZoneMap-based pruning for disk-based fallback. On a 100 GB TPC-H dataset, the redesigned operator completes a deep paging query in 7.72 seconds, compared to 23.07 seconds for ClickHouse and 353.15 seconds for MySQL.

The deep paging problem

A common query pattern in business systems is to filter records, sort by a column, and paginate the results. In SQL:

ORDER BY column LIMIT offset, count

For 100 records per page:

Page

Query

Page 1

ORDER BY column LIMIT 0, 100

Page 10,001

ORDER BY column LIMIT 1000000, 100

In the second query, K (offset + count) equals 1,000,100, even though only 100 records are returned. This asymmetry between K and the result set size defines deep paging — and it exposes the weak points of standard TopK algorithms.

Industry approaches and their trade-offs

Three main approaches are used in practice. Each reduces operations on data outside the result set, but each fails differently under deep paging.

Priority queue (heap-based TopK)

The system maintains a max-heap of size K. For each record scanned, it checks whether the record belongs in the top K: if so, it replaces the top element and rebalances the heap. After a full scan, the heap contains the K largest records.

This works well at small K. At large K — such as 1,000,100 — two problems appear:

  • Memory pressure: the heap may not fit in memory.

  • Cache inefficiency: heap operations require random memory access. At large K, this causes frequent cache misses and degrades throughput.

Truncation during merge sort

Used by PolarDB, ClickHouse, SQL Server, and DuckDB. The system generates sorted runs during sorting and truncates each merged run to offset + limit records. Only records in [offset, offset + limit) matter, so there is no need to sort all data.

Truncation based on an offset and a limit during merge sort

This scales to disk when memory is insufficient. However, truncation only helps once a sorted run is long enough. For deep paging with large K, sorted runs early in the merge are often shorter than offset + limit, so the full dataset must be sorted before truncation takes effect.

Self-sharpening input filter

First proposed by Goetz Graefe and adopted by ApsaraDB for ClickHouse. The system maintains a cutoff value — an upper bound on the records that can appear in the top K result. Records above the cutoff are excluded before they enter a sorted run. After each sorted run is built, if the run length is greater than K, the Kth record becomes the new cutoff. The new cutoff is always less than or equal to the old cutoff, so the filter continuously sharpens.

Example (K = 3):

Batch

Sorted run

Cutoff

1

(1, 2, 10, 15, 21)

10

2

(2, 3, 5, 6, 8) (pre-filtered at 10)

5

3

(1, 2, 3, 3, 3) (pre-filtered at 5)

3

Unlike heap operations, the self-sharpening filter accesses memory sequentially during both cutoff filtering and sorted run accumulation. This avoids the random-access penalty of heap maintenance.

Why deep paging breaks both approaches

In LIMIT 1000000, 100, K is 1,000,100 but only 100 records are returned. This exposes the limits of each approach:

  • Heap-based: maintaining 1,000,100 heap entries generates severe random memory access overhead even when memory is available.

  • Merge sort truncation: a sorted run rarely exceeds 1,000,100 records early in the sort, so truncation cannot take effect and the full dataset is sorted.

Note

"Sufficient memory" here means the data structure that manages K records fits in memory — not that the full input dataset fits. In the scenarios described in this topic, the input data far exceeds available memory.

Two additional design requirements apply:

  • A unified algorithm should handle both shallow and deep paging without a hard threshold between them.

  • The system should select memory or disk execution dynamically based on available memory, not a static configuration.

Solution design

PolarDB IMCI's redesigned Sort/TopK operator combines the strengths of existing approaches while addressing their failures under deep paging.

Memory algorithm: SIMD-accelerated self-sharpening input filter

When memory is sufficient, IMCI uses the self-sharpening input filter instead of a priority queue. The reason not to use a priority queue at large K:

  • Random memory access during heap maintenance degrades performance at large K.

  • The heap size must equal K, so memory pressure grows linearly with K.

The self-sharpening filter avoids both problems:

  • Both cutoff filtering and sorted run accumulation access memory sequentially.

  • The filter works correctly at any K, covering shallow and deep paging without a boundary condition.

SIMD acceleration: Cutoff filtering is simple, repetitive, and frequent — comparing each record against the current cutoff. IMCI accelerates this using single instruction multiple data (SIMD) instructions, which apply the same comparison to multiple records in parallel. The filter reuses the same expression evaluation infrastructure as the table scan predicate filter, so no additional code path is needed.

Disk algorithm: ZoneMap-based pruning with merge sort

When memory is insufficient, IMCI uses merge sort with truncation. The reason not to use the self-sharpening filter on disk:

  • Saving accumulated sorted runs to disk and running external sort during premerge generates large amounts of disk I/O.

  • For large K, the premerge may process a significant amount of data outside [offset, offset + limit) before a useful cutoff is established.

Merge sort with truncation avoids these problems, and ZoneMap-based pruning further eliminates I/O by using the min/max statistics of sorted runs to skip runs that cannot contribute to the result.

How ZoneMap pruning works:

Each sorted run stores its minimum and maximum values. A barrier value divides all sorted runs into three types:

Type

Condition

Examples

Type A

min and max both < barrier

Run1, Run2

Type B

min < barrier, max > barrier

Run3

Type C

min and max both > barrier

Run4, Run5

Sorted run types divided by a barrier

Two pruning rules eliminate sorted runs that cannot affect the result:

  • If total records in Type A + Type B < offset, all Type A records fall in [0, offset). Exclude Type A from subsequent merges.

  • If total records in Type A > offset + limit, all Type C records fall in [offset + limit, N). Exclude Type C from subsequent merges.

Pruning process:

  1. Build a ZoneMap containing the min and max values of each sorted run.

  2. Find Barrier 1 (as large as possible) where records in Type A + Type B < offset.

  3. Find Barrier 2 (as small as possible) where records in Type A > offset + limit.

  4. Use Barrier 1 and Barrier 2 to exclude the corresponding sorted runs from subsequent merges.

Dynamic algorithm selection

Rather than using a fixed K threshold, IMCI selects the algorithm dynamically using a fallback mechanism:

  1. Always start with the memory algorithm.

  2. If memory remains sufficient throughout, complete the calculation in memory.

  3. If memory runs out — for example, not enough space to cache enough sorted runs that contain more than K records, or not enough space to complete the premerge — trigger the fallback:

    • Collect the min/max values of in-memory sorted runs and build a ZoneMap.

    • Save those sorted runs to disk.

    • Switch to the disk algorithm for the rest of the calculation.

  4. Complete the calculation using the disk algorithm.

Both algorithms use the same data structures, so the fallback requires no data reorganization. The sorted runs accumulated during the memory phase are used directly by the disk algorithm without accuracy loss.

Engineering optimizations

Late materialization

When building sorted runs, IMCI materializes only row IDs and the columns or expressions referenced in ORDER BY. Output columns are fetched from storage after the TopK result set is determined.

This reduces overhead in two ways:

  • Row IDs are compact, so more records fit in the same memory budget, extending the range where the memory algorithm applies.

  • Records are reordered frequently during TopK calculation via Copy and Swap operations. Materializing only row IDs minimizes the per-record cost of these operations.

The trade-off: fetching output columns by row ID after sorting may require random I/O. In deep paging scenarios, the actual result set is small (for example, 100 records), so this random I/O is negligible.

Computing pushdown

During self-sharpening filter execution, the current cutoff value is pushed down to the table scan operator as a new predicate. The table scan can then apply this predicate using the existing pruner infrastructure, filtering at the pack or row group level before data reaches the TopK operator.

This reduces overhead in two ways:

  • I/O reduction: packs or row groups that contain only records above the cutoff are skipped entirely.

  • Compute reduction: filtered packs or row groups are not processed by upper-layer operators.

Test results

The following query was run against a 100 GB TPC-H dataset:

SELECT
    l_orderkey,
    sum(l_quantity)
FROM
    lineitem
GROUP BY
    l_orderkey
ORDER BY
    sum(l_quantity) DESC
LIMIT
    1000000, 100;

System

Execution time

PolarDB IMCI

7.72 sec

ClickHouse

23.07 sec

MySQL

353.15 sec