PolarDB-X Optimization Practices on the ClickBench Dataset

This article introduces optimization practices for PolarDB-X on the ClickBench dataset, focusing on enhancing performance for large wide table workloads.

By Hongwei

Background

PolarDB for Xscale (PolarDB-X), as a cloud-native distributed database, provides Hybrid Transactional and Analytical Processing (HTAP), compute-storage separation, global secondary indexes, and other key features. For HTAP workloads, PolarDB-X implements a vectorized execution engine and supports query and caching for column-oriented storage. Previous TPC-H reports have already demonstrated that its query performance ranks among the industry's best. (Reference: PolarDB-X TPC-H 100G White Paper).

ClickBench, developed by ClickHouse, Inc., is a test dataset designed to evaluate Online Analytical Processing (OLAP) systems' performance in large wide table scenarios. The dataset consists of a single table of 70 GB with 105 columns and approximately 99,997,497 (approximately 100 million) rows, and includes 43 SQL queries. Since ClickBench queries cover a wide variety of large wide table scenarios, many OLAP vendors around the globe have competed on this benchmark in recent years.

This article presents PolarDB-X's exploration and optimization practices on the ClickBench dataset. At the optimizer level, new rules were added to rewrite SQL queries for more efficient execution plans. At the executor level, the DISTINCT operator was optimized using HashSet, and an adaptive two-phase aggregation was implemented to automatically determine whether pre-aggregation is beneficial based on the workload. Because ClickBench places high demands on low-level implementation details, additional optimizations were applied to the NoGroupBy aggregate functions and the Group By Long, Long scenarios.

After these optimizations, PolarDB-X achieved top-tier performance on large wide table workloads. Specific performance results can be found in the ClickBench performance white paper, showing that queries on a 4×8-core, 32 GB memory setup completed in just 23 seconds.

In the final section, we analyze the bottlenecks in the ClickBench dataset and discuss two potential optimization directions: Swiss Hash Table and HyperScan. We believe these improvements will further enhance PolarDB-X's performance in the future.

Optimizer Rules

Group By Column Elimination

For Q36, although the GROUP BY clause contains four columns, all of them are expressed as ClientIP minus a constant. Once ClientIP is determined, the values of ClientIP, ClientIP - 1, ClientIP - 2, and ClientIP - 3 are uniquely determined as well. Therefore, the groupId in the GROUP BY operation will be the same, allowing us to safely reduce the aggregate columns to just GROUP BY ClientIP and eliminate unnecessary computation.

The same logic applies to Q35. In the GROUP BY 1, URL clause, the first column is a constant (1), so rewriting it as GROUP BY URL does not affect the final result.

SELECT 1, URL, COUNT(*) AS c FROM hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10;

PolarDB-X's optimizer is implemented based on the Cascade model. Thanks to its highly extensible rule framework, adding a new Rule-Based Optimization (RBO) rule is straightforward. For Q36, the simplified logic is shown below.

Constant Extraction for Multiple Aggregate Calls

In ClickBench, some SQL statements can be rewritten using the optimizer's RBO rules without altering their original semantics. For example, Q30 requires computing the sum of ResolutionWidth plus a constant 90 times. In the executor, this would normally require maintaining 90 separate columns. However, by applying the distributive law, the SQL statement can be rewritten in the following way:

# Q30 before optimization
SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2) \
, SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), ...,  SUM(ResolutionWidth + 89) FROM hits;

# Q30 after optimization
SELECT SUM(ResolutionWidth), SUM(ResolutionWidth) + 1 * count(*)\
, SUM(ResolutionWidth) + 2 * count(*) \
, SUM(ResolutionWidth) + 3 * count(*) \
, SUM(ResolutionWidth) + 4 * count(*) \
, ..., SUM(ResolutionWidth) + 89 * count(*) FROM hits;

This allows the executor to compute only the SUM(ResolutionWidth) column and the COUNT(*) column, significantly reducing the computational overhead.

Executor Optimization

DISTINCT Optimization

In ClickBench, many queries use COUNT(DISTINCT()), such as Q5.

SELECT COUNT(DISTINCT UserID) FROM hits;

In PolarDB-X, DISTINCT is implemented by inserting the groupID together with the DISTINCT column into a hash table to remove duplicates.

For SQL statements with a NoGroupBy clause (example: SELECT COUNT(DISTINCT(UserID)) FROM hits), the input to the DISTINCT map consists of only the UserID column.
For SQL statements with a GROUP BY clause (example: SELECT COUNT(DISTINCT(UserID)) FROM hits GROUP BY RegionID), the input to the DISTINCT map includes both the GroupID column and the UserID column.

In the original implementation of PolarDB-X, DISTINCT and GROUP BY shared the same aggOpenHashMap.

However, since DISTINCT needs to only check for duplicates and does not require tracking groupID, we designed a dedicated HashSet for DISTINCT. In a HashMap, both a key array and a value array are needed: The key array maps hash to groupByKey, and the value array maps hash to groupID. In contrast, a HashSet requires only the key array.

Because the HashSet only checks whether each row is distinct, it does not need to compute any groupID-related information, reducing computational overhead. (In the diagram below, the left side shows the HashSet code example, whereas the right side shows the HashTable code example.)

For example, in ClickBench Q5 (SELECT COUNT(DISTINCT UserID) FROM hits;), using a HashSet instead of a HashMap achieves a 20% performance improvement.

Adaptive Two-phase Aggregation

Two-phase aggregation accelerates the majority of SQL queries in ClickBench. For example, in Q10 with GROUP BY RegionID, the cardinality of RegionID is only 150,000, so pre-aggregation can significantly reduce the amount of shuffled data.

However, for Q32 and Q33, which perform GROUP BY on WatchID (Long) and ClientIP (Int), the combined cardinality is close to the table size. In these cases, performing pre-aggregation does not reduce the amount of data shuffled. It only incurs hash table overhead. In our tests, two-phase aggregation for Q33 was twice as slow as single-phase aggregation.

In the original PolarDB-X implementation, enabling ENABLE_PREFER_AGG lets the optimizer decide whether to use two-phase aggregation based on table cardinality statistics. However, these estimates are often inaccurate, especially for multi-column cardinalities, which can deviate from reality by one to two orders of magnitude.

Therefore, we implemented adaptive two-phase aggregation at the executor level with the following steps:

Split HashAgg into a single-phase PreHashAgg and a second-phase FinalHashAgg.
PreHashAgg is executed in a streaming manner: Once a certain number of chunks are pre-aggregated, it builds the hash table and notifies the upstream driver to pull data. FinalHashAgg remains blocking (namely, it only builds the hash table after all data has been written).
While consuming chunks, PreHashAgg tracks the hash table size and the cardinality of the first n batches of data. If the current hash table exceeds the L3 cache size (where pre-aggregation performance would degrade significantly) and the cardinality of the n batches satisfies the threshold, PreHashAgg falls back to the PlainHashAgg operator.
The PlainHashAgg operator simply initializes aggregator columns for each input chunk without writing to the hash table. Once aggregator initialization is complete, the chunk is placed into the ResultIterator for the upstream driver to fetch results.

In our tests, the adaptive two-phase aggregation correctly determines whether to enable two-phase aggregation for each SQL query. While ensuring that the performance of other queries remains unaffected, it reduced the execution time of Q33 from 5.5 seconds to 3.4 seconds.

MPP Compression

For Q34, the SQL statement is:

SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10

The URL column has a relatively large size (averaging 100 bytes), and we observed that network I/O during node shuffles became a bottleneck. On a 2×16-core and 64 GB memory setup, the measured shuffle data reached 2.4 GB. To address this, we applied LZ4 compression to chunks during network transmission.

However, when compression ratios are low, the shuffle data volume may not decrease, and decompression overhead is introduced. Therefore, we dynamically evaluate the compression ratio for each chunk and only return the compressed chunk upstream if the ratio exceeds a specific threshold.

The main workflow is illustrated in the diagram below.

With compression enabled, the network transmission data for Q34 was reduced from 2.4 GB to 300 MB, achieving a 2-second improvement in execution time.

private SerializedChunk serializeForce(Chunk page) {
    int maxCompressedLength = maxCompressedLength(serializationBuffer.size());
    byte[] compressionBuffer = new byte[maxCompressedLength];
    int actualCompressedLength = compressor.get()
        .compress(serializationBuffer.slice().getBytes(), 0, serializationBuffer.size(), compressionBuffer, 0,
            maxCompressedLength);

    //MINIMUM_COMPRESSION_RATIO is 0.8 by default
    if (((1.0 * actualCompressedLength) / serializationBuffer.size()) > MINIMUM_COMPRESSION_RATIO) {
        return new SerializedChunk(serializationBuffer.slice(), ChunkCompression.UNCOMPRESSED,
            page.getPositionCount(), serializationBuffer.size());
    }

    return new SerializedChunk(
        Slices.copyOf(Slices.wrappedBuffer(compressionBuffer, 0, actualCompressedLength)),
        ChunkCompression.COMPRESSED,
        page.getPositionCount(),
        serializationBuffer.size());
}

GroupBy Long, Long

Applying type-specific optimizations to inputChunk during GroupBy can significantly boost aggregation performance. Since Java does not natively support Int128, it must be simulated using long arrays.

There are two possible implementations:

Two-array approach: Use two long arrays, high and low, to represent the high and low bits of each Int128. The downside is that accessing the high and low arrays requires two random memory accesses, though no additional computational overhead is incurred.
Single-array approach: Use a single long array, where index i2 stores the high bits and index i2+1 stores the low bits of the i-th number. This approach adds extra index calculation overhead, but i2 and i2+1 are guaranteed to reside in the same cache line, reducing memory access latency.

Testing results show that the two-array approach slightly outperforms the single-array approach.

In the implementation, the serializedInputBlock needs to only maintain references to the two long arrays in the chunk without additional memory allocation. (However, a new key array must still be created since the hash table must remain in memory.)

// only maintain reference
serializedBlock = new Int128Array(longBlock1, longBlock2);

Take ClickBench Q9 SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10; as an example. Compared to the default GroupBy implementation, Int128GroupBy achieves a 3× performance improvement.

NoGroupBy Optimization

ClickBench Q3: SELECT MIN(EventDate), MAX(EventDate) FROM hits;

For aggregate functions without a GROUP BY clause, there is no need to compute groupId values or perform accumulation for each group. Therefore, a dedicated aggregation semantics can be designed for such cases.

Specifically for Q3, we can implement a specialized DateMaxMinNoGroupByAccumulator, which simply iterates over each input row to track the minimum and maximum values.

@Override
public void accumulate(Chunk chunk, Chunk inputChunk) {
    Block block = chunk.getBlock(0);
    if (block instanceof DateBlock) {
        int position = block.getPositionCount();
        if (!isVisited) {
            beforeValue = ((DateBlock) block).getPackedLong(0);
            isVisited = true;
        }
        for (int i = 0; i < position; i++) {
            long newValue = ((DateBlock) block).getPackedLong(i);
            boolean cmp = newValue > beforeValue;
            if ((isMin && !cmp) || (!isMin && cmp)) {
                beforeValue = newValue;
            }
        }
    }
}

Dynamic Decimal Accumulator

ClickBench Q4 involves computing the AVG function:

SELECT AVG(UserID) FROM hits;

In PolarDB-X, the AVG function is calculated as AVG

Since the UserID column is of the Long type, SUM(UserID) may exceed LONG_MAX, so it must be stored using the Decimal type.

In a naive decimal implementation, a byte array is maintained in the underlying layer to represent each digit, and addition or subtraction requires handling carry and borrow operations.

This is relatively time-consuming. To optimize this, PolarDB-X introduces Decimal64 and Decimal128 types. Decimal64 is implemented with a single long variable, whereas Decimal128 uses two long variables to represent the low and high digits, respectively.

When computing SUM, Decimal64 is used by default. If it is about to overflow, the system switches to Decimal128. If Decimal128 approaches overflow, it falls back to the naive decimal implementation.

@Override
public void accumulate(int groupId, Chunk inputChunk, int[] groupIdSelection, int selSize) {
    Block inputBlock = inputChunk.getBlock(0);
    DecimalBlock decimalBlock = inputBlock.cast(DecimalBlock.class);

    // Prepare result array and execute summary in vectorization mode.
    results[0] = results[1] = results[2] = 0;
    decimalBlock.cast(Block.class).sum(groupIdSelection, selSize, results);

    // Check sum result state and try to directly append sum result.
    if (results[2] == E_DEC_DEC64) {
        rescale(decimalBlock.getScale());
        long sumResult = results[0];
        accumulateDecimal64(groupId, sumResult);
    } else if (results[2] == E_DEC_DEC128) {
        rescale(decimalBlock.getScale());
        long decimal128Low = results[0];
        long decimal128High = results[1];
        accumulateDecimal128(groupId, decimal128Low, decimal128High);
    } else {
        // Fall back to row-by-row mode
        for (int i = 0; i < selSize; i++) {
            int position = groupIdSelection[i];
            accumulate(groupId, inputBlock, position);
        }
    }
}

Future Plans

After continuous optimization, PolarDB-X now delivers excellent performance on the ClickBench dataset. The current performance bottlenecks are mainly in Q29's REGEXP_REPLACE operations and the GROUP BY STRING operations in Q33 and Q34.

For Q29, Q33, and Q34, our team already has internal solutions, which are currently under testing and development.

Swiss Hash Table

For Q33, the performance bottleneck lies in the GROUP BY STRING operator, whose underlying implementation is an open-addressing hash table for String types. The industry also offers specialized algorithmic optimizations for this scenario. For instance, ClickHouse's paper "SAHA: A String Adaptive Hash Table for Analytical Databases" proposes a dispatch mechanism that handles strings of different lengths separately.

SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10;

As a distributed HTAP database targeting general-purpose workloads, PolarDB-X recognizes the value of type-specific optimizations, but we also aim to focus on higher-level, more broadly applicable improvements.

During internal research, we found the Swiss Hash Table, introduced at CppCon 2017, particularly inspiring. This hash table not only optimizes across all data types but also fully leverages CPU-level SIMD acceleration. Combined with PolarDB-X's prior explorations of SIMD instructions, we believe this algorithm can deliver the greatest performance benefits in general-purpose scenarios.

Specifically, the Swiss Hash Table offers the following advantages:

It supports vectorization for columns of any data type and allows GROUP BY on arbitrary fields. At the underlying level, all data is stored in bytes, fully leveraging CPU SIMD capabilities.
Fixed-length fields such as Long or Int are stored in the RecordTable using a byte[][1024] layout, while variable-length fields like String are stored separately in a VariableData area.
When probing the hash table, a ControlTable is used to perform rough filtering of hash values, enabling comparisons of multiple byte array positions simultaneously. The number of positions, n, depends on the underlying comparison instruction: for AVX512 instructions, n = 64; using long comparisons, n = 8.

Due to space constraints, we will not provide a detailed introduction of the Swiss Hash Table here. Instead, we present a practical design approach for implementing it in PolarDB-X.

The design of the Swiss Hash Table is divided into three layers:

1. ControlTable: used for rough filtering during probing based on the hash values of elements to be matched.

2. RecordTable: stores fixed-length data.

The underlying implementation uses byte[][1024], with the second dimension width set to 1 MB to maximize the likelihood of data being loaded into the L3 cache.
Each entity is structured in the following way:
- POINT (12 bytes, optional): If the GROUP BY field contains a String, this points to the corresponding data in the VariableData area.
- GroupID (4 bytes, required): records the GroupID corresponding to the entity.
- PrecomputedHash (8 bytes, optional): In two-phase aggregation, if the pre-aggregation has low cardinality, precomputed hash values can be used to avoid additional hash computations.
- FixedLength: stores fixed-length data record, each represented as IsNull (1 byte) + Data (8 bytes).

3. VariableData: stores variable-length data.

The main probing process works in the following way:

For any element to be matched, first obtain the lower 8 bits of its hash value and perform rough filtering in the ControlTable. The filtering granularity depends on the bit width of the underlying compare instruction. Using unsafe-style comparisons, 8 slots can be filtered at once. Using the Vector API in JDK17, ideally 64 slots can be filtered in a single operation.
If potential matches remain after the first step, compare the actual values.
If variable-length types are present, use the Point to locate the corresponding data in the VariableData area.

This method filters out a large number of invalid entries in the first step, significantly accelerating the probing process.

HyperScan

For Q29, the performance bottleneck lies in the implementation of REGEXP_REPLACE. Regular expressions can generally be implemented using two approaches: Nondeterministic Finite Automata (NFA) and Deterministic Finite Automata (DFA). The key difference is whether multiple paths are allowed from one state to reach the next.

In the Java ecosystem, three main REGEXP_REPLACE implementations are commonly used:

1. JONI: Originally developed in the JRuby project and evolved from Ruby's Oniguruma library, JONI uses an NFA-based matching strategy. It generally performs well in most cases. However, because NFA matching is driven by the regex pattern, backtracking occurs on the input string when a match fails, which can lead to exponential time complexity in extreme cases.

2. Re2J: Evolved from Google's Re2 project, Re2J uses a DFA-based matching approach. It is driven by the input string, filtering out invalid regex fragments during traversal. This ensures time complexity linear to the input string. However, Re2J is not fully mature, lacks some matching features, and in most scenarios performs worse than JONI.

3. HyperScan-Java: Based on Intel's HyperScan project, first presented in the NSDI 2019 paper "HyperScan: A Fast Multi-pattern Regex Matcher for Modern CPUs," HyperScan has two major advantages over JONI and Re2J:

HyperScan uses a specialized FA algorithm for regex matching. It splits the regex pattern into non-overlapping strings and sub-regex FAs. Independent strings are matched using string-matching algorithms (specifically, the CPU-friendly Shift-Or algorithm rather than the theoretically faster Aho-Corasick automaton), whereas sub-patterns are constructed as NFAs. This achieves DFA-level performance while keeping memory usage comparable to NFA.
HyperScan leverages SIMD instructions extensively, providing high performance on supported platforms, though with reduced system compatibility Its peak performance is limited to x86 architectures.

Mainstream OLAP engines, such as ClickHouse, use HyperScan or its variants (such as VectorScan).

However, HyperScan-Java is currently inefficient due to the high CPU cost of Java-to-C++ object conversions. Moving forward, we plan to integrate PolarDB-X's vectorized expression capabilities and modify HyperScan-Java to perform batch regex replacements, further accelerating performance.

Summary

Through the optimizations described above, PolarDB-X has achieved significant performance improvements on the ClickBench large wide table. Enhancements to the optimizer rules have substantially reduced unnecessary computation, whereas executor-level optimizations, such as the independent HashSet, adaptive two-phase aggregation, and efficient compression mechanisms, have greatly improved the efficiency of DISTINCT operations, adaptive aggregation, and network data transmission. Although bottlenecks remain in regex processing and string-based GROUP BY operations, the PolarDB-X team has developed solutions, including the Swiss Hash Table and HyperScan optimizations, which are expected to further enhance system performance. Overall, these continuous optimizations and innovations ensure that PolarDB-X remains competitive in the distributed HTAP database space, providing users with high-performance, reliable analytical processing capabilities.

Community

PolarDB-X Optimization Practices on the ClickBench Dataset

Background

Optimizer Rules

Group By Column Elimination

Constant Extraction for Multiple Aggregate Calls

Executor Optimization

DISTINCT Optimization

Adaptive Two-phase Aggregation

MPP Compression

GroupBy Long, Long

NoGroupBy Optimization

Dynamic Decimal Accumulator

Future Plans

Swiss Hash Table

HyperScan

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution