Implementation of the GroupJoin operator in In-Memory Column Index - PolarDB

When GroupJoin applies

GroupJoin requires all of the following conditions to be true:

The query performs EQUAL JOIN and GROUP BY.
The GROUP BY key matches the join key on one side of the JOIN (or is functionally dependent on it).
Each aggregate function references only one side of the JOIN — not both. For example, SUM(t1.a + t2.a) is not supported.

In PolarDB for MySQL IMCI-enabled operations, GroupJoin consistently outperforms the sequential combination of HASH JOIN + HASH GROUP BY, with one exception: RIGHT OUTER JOIN combined with GROUP BY on the right side.

Background

Consider this query:

SELECT
  key1,
  SUM(sales) AS total_sales
FROM
  fact_table LEFT JOIN dimension_table ON fact_table.key1 = dimension_table.key1
GROUP BY
  fact_table.key1
ORDER BY
  total_sales
LIMIT 100;

Without GroupJoin, this query requires two separate operations:

HASH JOIN — builds a hash table from dimension_table.key1, then probes it with fact_table.key1.
HASH GROUP BY — builds a second hash table from fact_table.key1 to aggregate results.

Both operations use key1 to build hash tables, which is redundant. For an M-row fact table and an N-row dimension table where key1 is the unique key, HASH JOIN outputs M rows and HASH GROUP BY then builds another M-row hash table.

GroupJoin eliminates the second hash table by combining join and aggregation into a single pass: it builds an N-row hash table from the dimension table and aggregates directly into it, resulting in lower memory consumption and fewer intermediate rows.

How it works

GroupJoin executes in two phases:

Build phase — the smaller left-side table builds a hash table. Aggregate functions run on left-side data during this phase (equivalent to HashGroupBy left_table).
Probe phase — the larger right-side table probes the hash table. Matching rows are aggregated into existing hash table entries. Non-matching rows are omitted.

Limitations

The GROUP BY key must match the join key on one side. In some cases, functional dependency on part of the join key is sufficient.
For RIGHT OUTER JOIN combined with GROUP BY on the right side, the right-side join key must be unique. If uniqueness cannot be guaranteed, convert the join to LEFT OUTER JOIN or group by the left side instead. If neither is possible, GroupJoin cannot be used.
Each aggregate function can reference only one table at a time. GroupJoin does not apply if an aggregate function spans both tables, such as SUM(t1.a + t2.a).

Behavior by JOIN type

INNER JOIN, GROUP BY left side

l_table INNER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY l_table.key1

The following descriptions assume SQL statements execute in the order shown, and the join object is not dynamically switched.

Build the hash table from the left table, running aggregate functions on left-side data. A repeat count tracks how many right-side rows match each hash table entry.
Probe the hash table with the right table. For each matching right-side row, increment the repeat count by 1 and run aggregate functions on right-side data. Discard non-matching rows.
After the join, output only matched hash table entries. Discard unmatched entries.
The final aggregation result equals SUM(expr) × repeat count. For example, if SUM(expr) is 200 and the repeat count is 5, the result is 1,000.

INNER JOIN, GROUP BY right side

l_table INNER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY r_table.key1

Because l_table.key1 equals r_table.key1 in an INNER JOIN, this scenario is equivalent to INNER JOIN with GROUP BY on the left side.

LEFT OUTER JOIN, GROUP BY left side

l_table LEFT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY l_table.key1

Build the hash table from the left table and run aggregate functions on left-side data. Use a repeat count for right-side aggregation.
Probe the hash table with the right table. Matching entries increment the repeat count and run right-side aggregate functions. Non-matching right-side rows are discarded.
After the join, output matched entries. Collect unmatched hash table entries into a separate group with NULL inputs for all right-side aggregate functions.

LEFT OUTER JOIN, GROUP BY right side

l_table LEFT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY r_table.key1

Build the hash table from the left table and run aggregate functions on left-side data. Use a repeat count for right-side aggregation.
Probe the hash table with the right table. Matching entries increment the repeat count and run right-side aggregate functions. Non-matching right-side rows are discarded.
After the join, output matched entries. Collect unmatched entries into a separate group with NULL inputs for all right-side aggregate functions.

RIGHT OUTER JOIN, GROUP BY left side

l_table RIGHT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY l_table.key1

Build the hash table from the left table and run aggregate functions on left-side data. Use a repeat count for right-side aggregation.
Probe the hash table with the right table. For matching entries, increment the repeat count and run right-side aggregate functions. Collect all unmatched right-side rows into a separate group with NULL results for left-side aggregate functions.
After the join, output matched hash table entries. Discard unmatched entries.

RIGHT OUTER JOIN, GROUP BY right side

l_table RIGHT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY r_table.key1

r_table.key1 must be unique. If uniqueness cannot be guaranteed, use the optimizer to convert the join to LEFT OUTER JOIN or change the GROUP BY to the left side.

Build the hash table from the left table and run aggregate functions on left-side data. Use a repeat count for right-side aggregation.
Probe the hash table with the right table. For matching entries, output both left-side and right-side aggregation results. For non-matching right-side rows, output results with NULL values for left-side aggregates.
After the join, GroupJoin is complete. No further hash table entry management is needed.

Data spilling

GroupJoin handles data spilling the same way as the partition-based execution of HASH JOIN and HASH GROUP BY:

GroupJoin uses a partition-based approach.
During the build phase, some partitions are held in memory. Others spill to temporary files on disk. Incremental writes to those partitions also spill to the corresponding temporary files. A Bloom filter is created for each on-disk partition to efficiently filter out non-matching right-side rows during the probe phase.
During the probe phase, in-memory partitions are probed first using the logic described above. For on-disk partitions, the Bloom filter checks whether the queried data belongs to that partition. If yes, the data spills to the corresponding temporary file. If no, the data is omitted or output.
After in-memory partitions are processed, the system processes on-disk partitions sequentially. At least one partition fits on disk without splitting into sub-partitions, and the same join and aggregation logic applies.

Related papers

Two academic papers provide the theoretical and practical foundation for GroupJoin in PolarDB.

Paper 1 (2011): *Accelerating Queries with Group-By and Join by Groupjoin*

This paper establishes the theoretical feasibility of GroupJoin across different query types. It describes applicable scenarios and aggregate function constraints, but the treatment is abstract and light on implementation details.

Paper 2 (2021): *A Practical Approach to Groupjoin and Nested Aggregates*

From the HyPer database team at Ludwig Maximilian University of Munich, this paper covers efficient GroupJoin implementation in in-memory databases. Key contributions:

1. GroupJoin in correlated subquery decorrelation

When a correlated subquery includes GROUP BY, MagicSet can deduplicate tables and introduce a JOIN + GROUP BY combination to decorrelate the subquery — creating a context where GroupJoin applies. PolarDB supports similar decorrelation in IMCI-enabled operations, though execution plans with shared child objects are not yet generated.

2. Eager aggregation

Aggregate left-side data as the hash table is built, rather than retaining payloads per entry for later aggregation. PolarDB supports eager aggregation in IMCI-enabled operations.

3. Memoizing for concurrent hash table contention

In extreme cases, all right-side rows match the same hash table entry during a probe. Repeatedly running an aggregate function like SUM() on the same entry creates contention — a performance bottleneck even for atomic add operations.

The paper's solution: run a CAS (Compare-And-Swap) command on each entry to assign an owner thread. Threads that fail to acquire ownership create a local hash table for their calculations, then merge results into the global hash table.

4. When GroupJoin is not the best choice

When selectivity on the left side is low (most left-side rows are filtered out after the probe), a tradeoff emerges:

With eager aggregation: no payloads to retain, saving memory — but much of the pre-aggregation work is wasted on rows that will not be selected.
Without eager aggregation: memory usage grows, offsetting the benefit of eager aggregation.

The paper recommends that when selectivity is low, defer aggregation: after the join completes, use HASH GROUP BY for local aggregation on the small resulting groups. Selecting the right strategy requires an optimizer to estimate selectivity and cardinality.

PolarDB's position on this tradeoff

PolarDB experts disagree with the paper's conclusions in two areas:

Throughput measurements: The paper uses tuples/second. PolarDB's experiments, using the same 1 GB dataset, show different results because of differences in implementation. PolarDB's data shows GroupJoin consistently improves throughput except for the RIGHT OUTER JOIN + GROUP BY RIGHT combination:

GroupJoin does not apply to Q17 in IMCI-enabled scenarios.

Query	HASH JOIN + HASH GROUP BY	GroupJoin
Q3	130 MB	152 MB
Q13	11 MB	33 MB
Q18	315 MB	1 GB

Memoizing effectiveness: PolarDB's experiments show minimal hash table entry contention in TPC-H queries. Local hash tables created by memoizing are rarely needed in practice, which means memoizing delivers performance similar to HASH JOIN + HASH GROUP BY in PolarDB's environment. The paper's comparisons do not demonstrate a clear advantage for memoizing.

For IMCI-enabled operations in PolarDB, the selectivity tradeoff is less significant: PolarDB typically uses the small table to build the hash table. When selectivity is low and eager aggregation is used, the cost is wasted time on a small table — a minor penalty. This is why GroupJoin is almost always preferable to HASH JOIN + HASH GROUP BY in PolarDB IMCI scenarios.

GroupJoin in TPC-H queries

TPC-H is a standard benchmark for testing analytical processing (AP) capabilities. GroupJoin applies to many of its 22 queries, but most require query modifications to enable it.

Q13

GroupJoin applies directly to Q13 without modification:

SELECT
    c_count,
    count(*) AS custdist
FROM
    (
        SELECT
            c_custkey,
            count(o_orderkey) AS c_count
        FROM
            customer
            LEFT OUTER JOIN orders ON c_custkey = o_custkey
            AND o_comment NOT LIKE '%pending%deposits%'
        GROUP BY
            c_custkey
    ) c_orders
GROUP BY
    c_count
ORDER BY
    custdist DESC,
    c_count DESC;

Execution plan without GroupJoin:

Execution plan with GroupJoin:

Q3

Q3 requires a GROUP BY rewrite before GroupJoin can be used:

SELECT
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) AS revenue,
    o_orderdate,
    o_shippriority
FROM
    customer,
    orders,
    lineitem
WHERE
    c_mktsegment = 'BUILDING'
    AND c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate < DATE '1995-03-15'
    AND l_shipdate > DATE '1995-03-15'
GROUP BY
    l_orderkey,
    o_orderdate,
    o_shippriority
ORDER BY
    revenue DESC,
    o_orderdate
LIMIT 10;

Original execution plan in IMCI:

The GROUP BY keys (l_orderkey, o_orderdate, o_shippriority) differ from all join keys, so GroupJoin does not apply directly. The following chain of functional dependencies enables a rewrite:

The join predicate l_orderkey = o_orderkey means that in any result row, l_orderkey and o_orderkey are equal.
Therefore, GROUP BY l_orderkey, o_orderdate, o_shippriority is equivalent to GROUP BY o_orderkey, o_orderdate, o_shippriority.
Since o_orderkey is the primary key of the orders table, o_orderdate and o_shippriority are functionally dependent on o_orderkey.
Therefore, GROUP BY o_orderkey, o_orderdate, o_shippriority reduces to GROUP BY o_orderkey.

Replacing the GROUP BY clause with GROUP BY o_orderkey enables GroupJoin:

The MySQL optimizer can partially derive functional dependencies but cannot fully reduce this to GROUP BY o_orderkey automatically. SQL Server can perform the full deduction in theory, though this has not been fully validated in IMCI-enabled practice. The same functional dependency deduction applies to Q3, Q4, Q10, Q13, Q18, Q20, and Q21. When the optimizer can perform the full deduction, GROUP BY keys are shortened and aggregation becomes faster.

Q10

Q10 requires two modifications before GroupJoin can be used:

SELECT
    c_custkey,
    c_name,
    sum(l_extendedprice * (1 - l_discount)) AS revenue,
    c_acctbal,
    n_name,
    c_address,
    c_phone,
    c_comment
FROM
    customer,
    orders,
    lineitem,
    nation
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate >= DATE '1993-10-01'
    AND o_orderdate < DATE '1993-10-01' + INTERVAL '3' MONTH
    AND l_returnflag = 'R'
    AND c_nationkey = n_nationkey
GROUP BY
    c_custkey,
    c_name,
    c_acctbal,
    c_phone,
    n_name,
    c_address,
    c_comment
ORDER BY
    revenue DESC
LIMIT 20;

Replace the grouping keys with c_custkey (the primary key of the customer table), using the same functional dependency reasoning as Q3.
Adjust the join order so that the customer table join is outermost.

The first modification always improves performance. The second sometimes introduces side effects.

Q17

Q17 contains a correlated subquery:

SELECT
    sum(l_extendedprice) / 7.0 AS avg_yearly
FROM
    lineitem,
    part
WHERE
    p_partkey = l_partkey
    AND p_brand = 'Brand#44'
    AND p_container = 'WRAP PKG'
    AND l_quantity < (
        SELECT
            0.2 * avg(l_quantity)
        FROM
            lineitem
        WHERE
            l_partkey = p_partkey
    );

Two decorrelation algorithms produce execution plans that do not use GroupJoin:

Using MagicSet for decorrelation introduces an intermediate state that is compatible with GroupJoin:

This approach is also shown in Figure 3 of paper 2:

PolarDB partially supports MagicSet as a decorrelation method in IMCI-enabled scenarios, but does not yet generate execution plans with shared child objects. As a result, GroupJoin cannot be used for Q17 in IMCI-enabled scenarios.

Q18

Q18 applies GroupJoin after simplifying GROUP BY. The example below omits the IN subquery and ORDER BY clause for clarity:

SELECT
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice,
    sum(l_quantity)
FROM
    customer,
    orders,
    lineitem
WHERE
    c_custkey = o_custkey
    AND o_orderkey = l_orderkey
GROUP BY
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice

Apply the following reductions:

c_custkey is the primary key of customer, so c_name is functionally dependent on it. o_orderkey is the primary key of orders, so o_orderdate and o_totalprice are functionally dependent on it. Therefore, GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice reduces to GROUP BY c_custkey, o_orderkey.
The join predicate c_custkey = o_custkey means results only contain rows where c_custkey = o_custkey. Therefore, GROUP BY c_custkey, o_orderkey reduces to GROUP BY o_custkey, o_orderkey.
Since o_orderkey uniquely determines o_custkey, GROUP BY o_custkey, o_orderkey reduces to GROUP BY o_orderkey.

The rewritten query:

SELECT
    ANY_VALUE(c_name),
    ANY_VALUE(c_custkey),
    o_orderkey,
    ANY_VALUE(o_orderdate),
    ANY_VALUE(o_totalprice),
    sum(l_quantity)
FROM
    customer,
    orders,
    lineitem
WHERE
    c_custkey = o_custkey
    AND o_orderkey = l_orderkey
GROUP BY
    o_orderkey

Execution plan without GroupJoin:

Execution plan with GroupJoin:

Shortening the GROUP BY keys also benefits regular execution plans that do not use GroupJoin.

Q20

Q20 contains correlated subqueries similar to Q17. MagicSet decorrelation introduces an intermediate state compatible with GroupJoin before MagicSet is removed:

...
AND ps_availqty > (
    SELECT
        0.5 * sum(l_quantity)  /* scalar aggregation */
    FROM
        lineitem
    WHERE
        l_partkey = ps_partkey         /* correlated predicate 1 */
        AND l_suppkey = ps_suppkey     /* correlated predicate 2 */
        AND l_shipdate >= '1993-01-01'
        AND l_shipdate < DATE_ADD('1993-01-01', INTERVAL '1' YEAR)
)

Other queries

According to paper 1 and paper 2, GroupJoin also applies to Q5, Q9, Q16, and Q21 after modification, but the specific modification steps are not described in those papers. Execution plans for GroupJoin on these queries are also not available in the HyPer WebInterface (https://hyper-db.de/interface.html#).

Query performance

Many TPC-H queries combine JOIN and GROUP BY, making them candidates for GroupJoin optimization.

Paper 1 compared query performance for Q3, Q5, Q9, Q10, Q13, Q16, Q17, Q20, and Q21 using 1 GB of data. GroupJoin reduced total latency from 1,932 ms to 1,295 ms across those queries.

Paper 2 compared Q3, Q13, Q17, and Q18 using 1 GB of data, with three strategies:

Separate — standard JOIN followed by GROUP BY
Eager — GroupJoin with eager aggregation
Memoizing — GroupJoin with memoizing optimization

The paper's line charts show two patterns across Q3, Q13, Q17, and Q18:

Memoizing performance is similar to HASH JOIN + HASH GROUP BY.
Eager aggregation outperforms the other methods only in Q13.

These results support the paper's conclusion that the optimal GroupJoin strategy depends on query characteristics and that optimizers should select the execution plan based on selectivity and cardinality estimates.

PolarDB's experiments reach different conclusions. With the same 1 GB dataset, PolarDB shows that GroupJoin consistently improves throughput (see the table in Related papers), and that minimal contention exists in TPC-H queries. Local hash tables created by memoizing are rarely needed, which is why memoizing delivers performance comparable to HASH JOIN + HASH GROUP BY in PolarDB's implementation.

Conclusion

GroupJoin eliminates redundant work by combining HASH JOIN and HASH GROUP BY into a single operator. In IMCI-enabled operations on PolarDB for MySQL, it consistently improves query performance except when RIGHT OUTER JOIN is combined with GROUP BY on the right side.

GroupJoin is not universally applicable. It requires EQUAL JOIN + GROUP BY with matching keys on one side, constraints on aggregate functions, and specific implementation conditions. These requirements make GroupJoin execution plans narrower in scope and more complex to maintain.

To maximize the benefit of GroupJoin, generalize its execution plans to cover common query patterns — particularly those involving functional dependency simplification of GROUP BY keys, as demonstrated by the TPC-H examples above.