All Products
Search
Document Center

PolarDB:Use GroupJoin to perform IMCI-enabled operations

Last Updated:Dec 04, 2023

This topic describes the limits and methods of using GroupJoin to perform IMCI-enabled operations in PolarDB databases, and introduces papers related to GroupJoin. Before you read this topic, we recommend that you have an understanding of HASH JOIN and HASH GROUP BY.

Background information

SELECT
  key1,
  SUM(sales) as total_sales
FROM
  fact_table LEFT JOIN dimension_table ON fact_table.key1 = dimension_table.key1
GROUP BY
  fact_table.key1
ORDER BY
  total_sales
LIMIT 100;

Traditionally, if you perform the preceding IMCI-enabled query in a PolarDB database, you must perform HASH JOIN and then HASH GROUP BY. In both HASH JOIN and HASH GROUP BY, key1 is used to create a hash table. The following paragraphs describe how key1 is used. Take note that fact_table.key1 is the same as dimension_table.key1.

  1. HASH JOIN: uses dimension_table.key1 to create a hash table, and uses fact_table.key1 to query the hash table and output data.

  2. HASH GROUP BY: uses fact_table.key1 to create another hash table, and aggregates data in this new table.

To improve efficiency, a method is required to integrate the execution of HASH JOIN and HASH GROUP BY into one operator. The operator may use dimension_table.key1 to create a hash table and aggregate data, and fact_table.key1 to query the table and aggregate data. If this operator comes into use, there is no need to create another hash table. Following this idea, the GroupJoin operator is introduced.

GroupJoin helps eliminate the effort of creating another hash table and reduces the data size of intermediate results. The JOIN operation may produce a result set that contains a large number of results. This is because a row of a table may match multiple rows of another table. In the worst cases, the joined result of an N-row table and an M-row table is a result set that contains N × M rows of results. When HASH JOIN and HASH GROUP BY are performed in sequence, an N-row hash table is created, N × M × S rows of data are output, and then a new hash table is created for data aggregation. The procedure can cause a waste of resources. In the formula for calculating data rows, S indicates selectivity and can be a value from 0 to 1. In the preceding code, LEFT OUTER JOIN is performed on an M-row fact table (large table) and an N-row dimension table (small table), and key1 is the unique key. In this case, an N-row hash table is generated, M rows of data are output, and then another hash table is created to contain M rows of aggregated data. However, if GroupJoin is performed on the same two tables, data is joined and aggregated both in the N-row hash table. This costs less time and consumes fewer memory resources.

To exploit the advantages of GroupJoin, PolarDB for MySQL provides GroupJoin for IMCI-enabled operations.

How it works

Overview

In IMCI-enabled operations, GroupJoin is performed as a combined execution of HASH JOIN and HASH GROUP BY.

  1. When GroupJoin is executed, the small left-side table is first used to create a hash table. The data of the left-side table is aggregated when the hash table is created, which produces the same effect as HashGroupby left_table.

  2. Then, the large right-side table is used to match data of the created hash table. Data that meets the matching conditions is aggregated in the hash table, whereas data that does not meet the matching conditions is omitted. This aggregation is considered the aggregation of data related to the right-side table.

This section provides an overview of GroupJoin. The following sections provide details.

Limits

To reduce the complexity of executing GroupJoin, PolarDB for MySQL sets the following limits on the execution:

  1. The GROUP BY key must be the same as the join key used on one side. In some cases, the GROUP BY key is functionally dependent on only a part of the join key.

  2. In scenarios of RIGHT JOIN + GROUP BY RIGHT, the right keys must be unique. Otherwise, the operation may be changed into LEFT JOIN or GROUP BY LEFT, or GroupJoin cannot be used.

  3. An aggr function can reference only the left-side table or the right-side table at a time. GroupJoin does not apply if the aggr function of a GROUP BY operator references both tables at the same time, like in the case of SUN(t1.a+t2.a).

Sample code

INNER JOIN/GROUP BY LEFT

Sample code:

l_table INNER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY l_table.key1
Note

The following descriptions are provided based on the following assumption: The SQL statements are executed in sequence as described, and the object of the JOIN operation is not dynamically switched to another table.

  1. Use the left-side table to create a hash table. When the hash table is created, the aggr function is run to aggregate data of the left-side table. When the aggr function is run to aggregate data of the right-side table, a repeat count is used to count the number of payloads of a hash table entry.

  2. When the tables are joined, the right-side table is used to match entries of the hash table. If no entry matches a row in the right-side table, the row is omitted. If one entry matches, a score of 1 is added to the repeat count of the aggr context for the left-side table, and the aggr function is run for the right-side table.

  3. After JOIN is executed, the aggregation results of only hash table entries that are matched are output, and those that are not matched are omitted.

  4. The aggregation result equals the result of SUM(expr) multiplied by the repeat count. For example, if the result of SUM(expr) is 200 and the repeat count is 5, the aggregation result is 1000.

INNER JOIN/GROUP BY RIGHT

Sample code:

l_table INNER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY r_table.key1

Because l_table.key1 is the same as r_table.key1, the sample code describes the scenario of INNER JOIN and GROUP BY LEFT.

LEFT OUTER JOIN/GROUP BY LEFT

Sample code:

l_table LEFT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY l_table.key1
  1. Use the left table to create a hash table. When the hash table is created, the aggr function is run to aggregate data of the left-side table. Use a repeat count when the data of the right-side table is aggregated.

  2. When the tables are joined, the right-side table is used to match entries of the hash table. If no entry matches a row in the right-side table, the row is omitted. If one entry matches, a score of 1 is added to the repeat count of the aggr context for the left-side table, and the aggr function is run for the right-side table.

  3. This scenario is different from INNER JOIN in the following aspect: After JOIN is executed, the unmatched entries in the hash table are listed in a separate group, and the inputs of the corresponding aggr functions for the right-side table are all NULL.

LEFT OUTER JOIN/GROUP BY RIGHT

Sample code:

l_table LEFT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY r_table.key1
  1. Use the left-side table to create a hash table. When the hash table is created, the aggr function is run to aggregate data of the left-side table. When the aggr function is run to aggregate data of the right-side table, a repeat count is used.

  2. When the tables are joined, the right-side table is used to match entries of the hash table. If no entry matches a row in the right-side table, the row is omitted. If one entry matches, a score of 1 is added to the repeat count of the aggr context for the left-side table, and the aggr function is run for the right-side table.

  3. After JOIN is executed, the matched entries in the hash table are output, the unmatched entries are listed in a separate group, and the inputs of the corresponding aggr functions for the right-side table are all NULL.

RIGHT OUTER JOIN/GROUP BY LEFT

Sample code:

l_table RIGHT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY l_table.key1
  1. Use the left-side table to create a hash table. When the hash table is created, the aggr function is run to aggregate data of the left-side table. When the aggr function is run to aggregate data of the right-side table, a repeat count is used.

  2. When the tables are joined, the right-side table is used to match entries of the hash table. If an entry matches a row in the right-side table, a score of 1 is added to the repeat count of the aggr context for the left-side table, and the aggregation results for the right-side table are calculated. If no entry matches, all the unmatched rows in the right-side table are listed in a separate group, and the corresponding aggregation results for the left-side table are all NULL.

  3. After JOIN is executed, the matched entries in the hash table are output, and the unmatched entries are omitted.

RIGHT OUTER JOIN/GROUP BY RIGHT

Sample code:

l_table RIGHT OUTER JOIN r_table
ON l_table.key1 = r_table.key1
GROUP BY r_table.key1

Limits

r_table.key1 must be unique. Otherwise, the join operation is invalid. If you are not sure whether r_table.key1 is unique, convert JOIN to LEFT OUTER JOIN and GROUP BY to GROUP BY LEFT by using an optimizer.

Procedure

  1. Use the left table to create a hash table. When the hash table is created, the aggr function is run to aggregate data of the left-side table. Use a repeat count when the data of the right-side table is aggregated.

  2. When the tables are joined, the right-side table is used to match entries of the hash table. If an entry matches a row in the right-side table, the aggregation results of the left-side tables and those of the right-side tables are output. If no entry matches, the aggregation results are also output, but the aggregation results of the left-side table are all NULL.

  3. After JOIN is executed, GroupJoin is complete and no hash table entries need to be managed.

Data spilling when GroupJoin is used

When GroupJoin runs, data is spilled in the same way as data spilling for the partition-style execution of HASH JOIN and HASH GROUP BY. The following points describe the details:

  1. The GroupJoin algorithm uses the partition style.

  2. When the left-side table is used to create a hash table, some data partitions are stored in memories. The statements for creating a hash table are provided in the Sample code section of this topic.

  3. Some other data partitions are spilled into temporary files in disks. Incremental data written to these partitions is also spilled into the corresponding temporary files. A bloom filter is created for these partitions to help efficiently filter out unmatched data in the right-side table.

  4. After the hash table is created, use the data of the right-side table to match data of the hash table.

    1. If the data partitions to query exist in the memories, the statements in the Sample code section are executed. If the data partitions are not in the memories, the bloom filter is used to check whether the queried data is in the specified partition. If yes, the data is spilled into the temporary file corresponding to the partition. If no, the data is omitted or output.

    2. After the data partitions in the memories are queried, the system queries the data partitions in disks in sequence. In this example, at least one data partition can be put into disks and does not need to be split into minor partitions. Then, the statements provided in the Sample code section of this topic are executed.

Related papers

In 2011, a paper named "Accelerating Queries with Group-By and Join by Groupjoin" (hereinafter referred to as paper_1) was published. This paper illustrates the feasibility of GroupJoin in different queries from the theoretical perspective, but does not provide enough details about the implementation of GroupJoin. The paper describes the limits of using GroupJoin and the scenarios where GroupJoin applies, such as when various aggr functions need to be run. The description in the paper is relatively abstract and not easy to understand.

In 2021, a paper named "A Practical Approach to Groupjoin and Nested Aggregates" (hereinafter referred to as paper_2) was published. This paper describes how to efficiently implement GroupJoin in in-memory databases. This paper provides the following highlights:

1. How to use GroupJoin in the algorithm of subquery decoupling

image.png

If a subquery such as GROUP BY exits, you can use MagicSet to deduplicate tables and add the combination of JOIN and GROUP BY to decouple correlated subqueries. In this scenario, GroupJoin can be used. Similar decoupling algorithms are used in IMCI-enabled operations of PolarDB. However, such execution plans in which child objects are shared cannot be generated.

2. How to perform eager aggregation

When the hash table is created, the data of the left-side table is aggregated instead of retaining the payloads for each hash table entry and aggregating data later. Eager aggregation can also be performed in IMCI-enabled operations of PolarDB.

3. How to use memoizing to resolve convention during concurrent queries and aggregation for hash table entries

In an extreme case, during a hash probe, all rows of the right-side table match the same entry in the hash table. In this case, aggregation such as SUM(2 x col) must be performed for the entry, and the aggr context must be repeatedly used in the aggregation. For example, if SUM() is run, the same sum_value is used to repeatedly sum up values. The operation performance is limited by contention even in the add-up operation for atomic variables, let alone the use of general aggr functions. To resolve the limitation, this paper introduces a solution. A CAS command is run for each entry to set an owner thread ID. If a thread fails to own an entry, a local hash table is created to perform calculations. Then, the calculated results in all local tables are aggregated into the global hash table.

4. Why GroupJoin does not apply to all scenarios

In some scenarios, the combination of JOIN and GROUP BY is more suitable than GroupJoin. The following case provides an example: The selectivity on the left side is set low. When the hash probe is complete, most rows on the left side are not selected. Therefore, a dilemma occurs.

  • If eager aggregation is performed when the hash table is built, no payloads need to be retained. This helps save memory resources, but most rows will not be selected because of low selectivity, which causes a waste in the pre-performed aggregation.

  • If no aggregation is performed in advance, more time must be spent on memory use, which could have been prevented by using eager aggregation.

Therefore, if the selectivity is low, the following method is recommended: After JOIN is complete, a few groups are generated and HASH BY GROUP is used to perform local aggregation. This paper describes how to implement this method in various scenarios. To figure out the query types suitable for specific scenarios, an optimizer is required to calculate selectivity and cardinality. This paper also recommends some calculation methods used by an optimizer.

For IMCI-enabled operations in PolarDB, the preceding dilemma is not troublesome. This is because:

  • For IMCI-enabled operations in PolarDB, the small table is usually used to create the hash table.

  • If selectivity is low and eager aggregation is performed, memory resources are saved only at the expense of wasting time on a small table.

Therefore, apart from the scenario of using RIGHT JOIN and GROUP BY RIGHT that is mentioned earlier, GroupJoin is always more efficient than the combination of HASH JOIN and HASH GROUP BY for IMCI-enabled operations in PolarDB.

Judging from the authors of and the tests in the papers, the two papers that are mentioned in this topic are both from the hyper database team of the Ludwig Maximilian University of Munich. Apart from hyper databases, no other types of databases are introduced to implement GroupJoin. However, the "shared hash table" operation can be performed in other ways, which will be discussed in the future.

Use GroupJoin in TPC-H queries

TPC-H is a benchmark that is commonly used to test the analytical processing (AP) capabilities of an AP system. GroupJoin applies to many of the 22 queries in TPC-H. However, except for TPC-H Q13, the statements of other queries must be modified to make themselves suitable for GroupJoin.

Q13

GroupJoin can be directly used in TPC-H Q13. Sample code of Q13:

select
    c_count,
    count(*) as custdist
from
    (
        select
            c_custkey,
            count(o_orderkey) as c_count
        from
            customer
            left outer join orders on c_custkey = o_custkey
            and o_comment not like '%pending%deposits%'
        group by
            c_custkey
    ) c_orders
group by
    c_count
order by
    custdist desc,
    c_count desc;
  • The following execution plan is used if GroupJoin is not used in IMCI-enabled operations:

    image.png

  • The following execution plan is used if GroupJoin is used in IMCI-enabled operations:

    image.png

Q3

The following changes must be made before GroupJoin can be used in TPC-H Q3. Sample code of Q3:

select
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
from
    customer,
    orders,
    lineitem
where
    c_mktsegment = 'BUILDING'
    and c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and o_orderdate < date '1995-03-15'
    and l_shipdate > date '1995-03-15'
group by
    l_orderkey,
    o_orderdate,
    o_shippriority
order by
    revenue desc,
    o_orderdate
limit
    10;

The following execution plan for Q3 can be used in IMCI-enabled operations:DERKEY,TEMPTABLE

image.png

In this statement, the grouping keys are l_orderkey, o_orderdate, and o_shippriority. The grouping keys are different from all join keys, and therefore make GroupJoin not applicable. After a series of deductions, the following conclusions can be drawn:

  1. INNER JOIN is performed on the lineitem table and the orders table. The JOIN predicate is l_orderkey=o_orderkey. Therefore, in the result set of the JOIN operation, l_orderkey equals o_orderkey.

  2. Because l_orderkey equals o_orderkey, GROUP BY l_orderkey, o_orderdate, o_shippriority is equivalent to GROUP BY o_orderkey, o_orderdate, o_shippriority.

  3. Because o_orderkey is the primary key of the orders table, o_orderdate and o_shippriority are functionally dependent on each o_orderkey.

  4. Because of the previous conclusion, GROUP BY o_orderkey, o_orderdate, o_shippriority is equivalent to GROUP BY o_orderkey.

Therefore, after you replace the GROUP BY clause in Q3 with GROUP BY o_orderkey, GroupJoin can be used. Sample code: KEY,TEMPTABLE3.SUM(LINETTEM.EXTENDEDPRTCE*1.00-LUNETEM._DLSCOL

image.png

The deduction about functional dependency brings requirements on optimizers. The MySQL optimizer can help deduct part of the functional dependency logic, but cannot help draw the conclusion of using GROUP BY o_orderkey as an equivalence. SQL SERVER can help deduct the use of GROUP BY o_orderkey, which is sufficiently proved in theory but not completely proved in IMCI-enabled practices. This is also the case for Q3, Q4, Q10, Q13, Q18, Q20, and Q21 of TPC-H. If the deduction can be fully made, the grouping keys of GROUP BY are shortened and data aggregation speeds up.

Q10

GroupJoin cannot be directly used in TPC-H Q10. Sample code of Q10:

select
    c_custkey,
    c_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    c_acctbal,
    n_name,
    c_address,
    c_phone,
    c_comment
from
    customer,
    orders,
    lineitem,
    nation
where
    c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and o_orderdate >= date '1993-10-01'
    and o_orderdate < date '1993-10-01' + interval '3' month
    and l_returnflag = 'R'
    and c_nationkey = n_nationkey
group by
    c_custkey,
    c_name,
    c_acctbal,
    c_phone,
    n_name,
    c_address,
    c_comment
order by
    revenue desc
limit
    20;

You must make the following modifications before you use GroupJoin in Q10.

  1. Replace grouping keys with c_custkey, which is the primary key of the customer table. The procedure is similar to that in the Q3 section.

  2. The join order must be adjusted so that the JOIN operation of the customer table is at the outermost layer.

The first modification always has positive effects, but the second modification sometimes has side effects.

Q17

TPC-H Q17 contains correlated subqueries. Sample code of Q17:

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,
    part
where
    p_partkey = l_partkey
    and p_brand = 'Brand#44'
    and p_container = 'WRAP PKG'
    and l_quantity < (
        select
            0.2 * avg(l_quantity)
        from
            lineitem
        where
            l_partkey = p_partkey
    );

Several methods can be used to decouple subqueries. The following two figures show the two execution plans that can be used based on the two decoupling algorithms for scalar aggr in IMCI-enabled scenarios.

image.pngRE

image.png

These execution plans do not apply to the GroupJoin operator. If you use MagicSet to decouple the subqueries, an intermediate state can be entered before you remove MagicSet. The state is suitable for GroupJoin.

image.png

The process is also shown in Figure 3 of paper_2. NERALNESTING:DEEORRELATIONOFDEPENDENTSUB-

image.png

In this case, GroupJoin can be used. PolarDB partly supports MagicSet as a decoupling method in IMCI-enabled scenarios. The support is not complete, because execution plans in which child objects are shared cannot be generated. Therefore, GroupJoin cannot be used for TPC-H Q17 in IMCI-enabled scenarios.

Q18

GroupJoin can be used in TPC-H Q18 only if the execution plan is modified. In this example, the IN subquery and the ORDER BY clause are omitted to simplify and generalize the illustration. Sample code of Q18:

select
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice,
    sum(l_quantity)
from
    customer,
    orders,
    lineitem
where
    c_custkey = o_custkey
    and o_orderkey = l_orderkey
group by
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice

For TPC-H Q18, the following modifications can be made:

  1. Because c_custkey is the primary key of the customer table, c_name can be functionally dependent on c_custkey. Similarly, o_orderkey is the primary key of the orders table, so o_orderdate and o_totalprice can be functionally dependent on o_orderkey. Therefore, the GROUP BY clause in Q18 can be equivalently converted to GROUP BY c_custkey, o_orderkey.

  2. The JOIN predicate for the customer table and the orders table is c_custkey=o_custkey, so the joined results are rows where c_custkey equals o_custkey.

  3. Because c_custkey equals o_custkey, the GROUP BY clause can be equivalently converted to GROUP BY o_custkey, o_orderkey.

  4. Because o_orderkey uniquely corresponds to o_custkey, GROUP BY o_custkey, o_orderkey can be converted to GROUP BY o_orderkey.

The following sample code provides an example on the query after being modified:

select
    ANY_VALUE(c_name),
    ANY_VALUE(c_custkey),
    o_orderkey,
    ANY_VALUE(o_orderdate),
    ANY_VALUE(o_totalprice),
    sum(l_quantity)
from
    customer,
    orders,
    lineitem
where
    c_custkey = o_custkey
    and o_orderkey = l_orderkey
group by
    o_orderkey
  • The following execution is used if GroupJoin is not used:

    image.png

  • The following execution is used if GroupJoin is used:

    image.png

The modified query can be used to formulate regular execution plans because the grouping keys of GROUP BY are shortened.

Q20

The correlated subqueries in TPC-H Q20 are similar to those in Q17. MagicSet is used to decouple subqueries. Before MagicSet is removed, an intermediate state that is suitable for GroupJoin is entered.

select
...
and ps_availqty > (
    select
        0.5 * sum(l_quantity) < ! --- scalar aggr --->
    from
        lineitem
    where
        l_partkey = ps_partkey         < ! --- Correlated item 1 --->
        and l_suppkey = ps_suppkey     < ! --- Correlated item 2 --->
        and l_shipdate >= '1993-01-01'
        and l_shipdate < date_add('1993-01-01', interval '1' year)
)

Other queries

According to paper_1 and paper_2, GroupJoin can be used in Q5, Q9, Q16, and Q21 after the queries are modified. However, the modification methods for the four queries are not described in the papers. Execution plans for GroupJoin are also not provided on the HyPer WebInterface webpage of the hyper database team (https://hyper-db.de/interface.html#).

Query performance

Many TPC-H queries contain JOIN and GROUP BY. Such queries can be optimized by using GroupJoin. In paper_1, the authors compare the query performance before and after using GroupJoin in the following queries: Q3, Q5, Q9, Q10, Q13, Q16, Q17, Q20, and Q21.

image.png

In the comparison, 1GB of data is used. In general, the latency of TPC-H queries is reduced from 1,932 ms to 1,295 ms after GroupJoin is used.

In paper_2, the authors use 1 GB of data to compare the query performance before and after using GroupJoin in Q3, Q13, Q17, and Q18.

image.png

The following information describes the meaning of lines in the preceding line charts:

  1. The separate line represents the scenario where JOIN and GROUP BY are used.

  2. The eager line represents the scenario where eager aggregation is performed.

  3. The memoizing line represents the scenario where memoizing is used for optimization, which is described in a previous section. The line charts show the following phenomena in Q3, Q13, Q17, and Q18:

    1. The query performance after the use of memoizing is similar to the performance when HASH JOIN and HASH GROUP BY are used.

    2. Eager aggregation delivers the best query performance in only Q13.

The line charts prove that the performance of a query method varies in different query scenarios. This agrees with the idea from the paper that statistics must be provided for optimizers to select the optimal execution plan for GroupJoin. According to the paper, optimizers help ensure that GroupJoin and the corresponding algorithms are properly used.

However, the experts of PolarDB disagree with the idea from the paper in the following aspects:

  1. The paper uses throughput (tuples/s) to measure the quality of algorithms. The indicator is also used in the experiment of PolarDB for IMCI-enabled scenarios, but produces a different result. The following table describes the throughput of GroupJoin in Q3, Q13, and Q18.

    Query

    HASH JOIN + HASH GROUP BY

    GroupJoin

    Q3

    130 MB

    152 MB

    Q13

    11 MB

    33 MB

    Q18

    315 MB

    1 GB

    Note

    GroupJoin does not apply to Q17 in IMCI-enabled scenarios.

    The volume of data used in the experiments of the paper and PolarDB is the same, but the throughput for each query is different in the two experiments. This may be caused by the differences in implementation methods. The experiment of PolarDB shows that GroupJoin improves query performance except when the combination of RIGHT JOIN and GROUP BY RIGHT is used.

  2. In terms of the conclusion described in 3.a, the experiment of PolarDB shows that only a small amount of contention exists in TPC-H queries. Therefore, local hash tables are seldom used in actual scenarios, which makes the memoizing-enabled method deliver similar query performance as the HASH JOIN + HASH GROUP BY method. The comparison in paper_2 cannot demonstrate the effectiveness of the memoizing-enabled method. In the experiment of PolarDB, contention is measured when a direct lock is performed.

Conclusion

GroupJoin helps eliminate repeated work during runtime, and therefore improves query performance in some scenarios. This effect is verified in actual use. From this perspective, GroupJoin can be useful.

However, GroupJoin is not suitable for all query scenarios. GroupJoin is applicable only when the following conditions are met: Both EQUAL JOIN and GROUP BY are performed, the grouping keys are the same as the join keys on one side, and specific limits are set on aggr functions and implementation methods. This makes GroupJoin execution plans hard to implement and maintain. For developers, they may prefer a method that can be widely used to improve SQL query efficiency in all scenarios. From this perspective, GroupJoin is not as effective as expected.

When you use GroupJoin, we recommend that you generalize the execution plans of GroupJoin to let them apply to common scenarios. This way, the usage of GroupJoin can be maximized.