This topic describes how to use JOINs and subqueries. A JOIN is a process of joining multiple tables based on one or more columns to retrieve associated data. These tables are associated with each other based on the same column. A subquery is a query in which another SELECT statement is nested within the WHERE or HAVING clause of the parent query.

Basic concepts

JOIN operations are common in SQL queries. Logically, the semantics of a JOIN operation is equivalent to producing the Cartesian product of two tables and then retaining the data that meets filter conditions. In most cases, JOIN operations are those performed based on equality conditions, that is, Equi-Join. An Equi-Join operation joins the data of two tables based on the value of the specified column.

A subquery is a query block that is nested within an SQL statement. The result of the subquery is populated to an outer query as the input to compute the result of the outer query. The subquery can appear in various positions in an SQL statement. For example, the subquery can serve as the output data in the SELECT clause, an input view in the FROM clause, and a filter condition in the WHERE clause.

This topic describes JOIN operators that are not pushed down. If a JOIN operator is pushed down to LogicalView, the storage layer MySQL determines how to execute the JOIN operator.

JOIN types

DRDS supports three general JOIN types: Inner Join, Left Outer Join, and Right Outer Join.

JOIN types

The following examples show different types of JOIN:

/* Inner Join */
SELECT * FROM A, B WHERE A.key = B.key;
/* Left Outer Join */
SELECT * FROM A LEFT JOIN B ON A.key = B.key;
/* Right Outer Join */
SELECT * FROM A RIGHT OUTER JOIN B ON A.key = B.key;

In addition, DRDS supports Semi-Join and Anti-Join. Semi Join and Anti Join cannot be directly expressed in SQL statements, but are usually obtained by converting EXISTS or IN subqueries that contain correlated items.

The following examples show Semi-Join and Anti-Join:

/* Semi Join - 1 */
SELECT * FROM Emp WHERE Emp.DeptName IN (
   SELECT DeptName FROM Dept
)
 /* Semi Join - 2 */
SELECT * FROM Emp WHERE EXISTS (
  SELECT * FROM Dept WHERE Emp.DeptName = Dept.DeptName
)
/* Anti Join - 1 */
SELECT * FROM Emp WHERE Emp.DeptName NOT IN (
   SELECT DeptName FROM Dept
)
 /* Anti Join - 2 */
SELECT * FROM Emp WHERE NOT EXISTS (
  SELECT * FROM Dept WHERE Emp.DeptName = Dept.DeptName
)

JOIN algorithms

DRDS supports JOIN algorithms, such as Nested-Loop Join, Hash Join, Sort-Merge Join, and Lookup Join (BKAJoin).

Nested-Loop Join (NLJoin)

Nested-Loop Join is usually used for non-equi joins. Nested-Loop Join works in the following ways:

  1. Pull all the data from an inner table (right table that usually has a smaller amount of data) and cache the data in the memory.
  2. Traverse the data of an outer table. For each row in the outer table:
    • For each record of inner table data that is cached in the memory:
    • Construct a result row and check whether JOIN conditions are met. If the conditions are met, return the result.

The following example shows Nested-Loop Join:

> EXPLAIN SELECT * FROM partsupp, supplier WHERE ps_suppkey < s_suppkey;

NlJoin(condition="ps_suppkey < s_suppkey", type="inner")
  Gather(concurrent=true)
    LogicalView(tables="partsupp_[0-7]", shardCount=8, sql="SELECT * FROM `partsupp` AS `partsupp`")
  Gather(concurrent=true)
    LogicalView(tables="supplier_[0-7]", shardCount=8, sql="SELECT * FROM `supplier` AS `supplier`")

Generally, Nested-Loop Join is the least efficient JOIN operation. It is used only when JOIN conditions do not include equality conditions (as shown in the preceding example) or the inner table contains an extremely small amount of data.

You can use the following Hint to force DRDS to use Nested-Loop Join and determine the JOIN order:

/*+TDDL:NL_JOIN(outer_table, inner_table)*/ SELECT ...

where, inner_table and outer_table can also be the JOIN result of multiple tables. The following example is provided:

/*+TDDL:NL_JOIN((outer_table_a, outer_table_b), (inner_table_c, inner_table_d))*/ SELECT ...

The following other Hints work in the same way.

Hash Join

Hash Join is one of the most general JOIN algorithms for Equi-Join. Hash Join works in the following ways:

  • Pull all the data from the inner table (right table that usually has a smaller amount of data) and write the data into a hash table in the memory.
  • Traverse the data of the outer table. For each row in the outer table:
    • Query the hash table based on the JOIN Key that is specified in the equality condition to retrieve 0 to N matched rows that have the same JOIN Key.
    • Construct a result row and check whether JOIN conditions are met. If the conditions are met, return the result.

The following example shows Hash Join:

> EXPLAIN SELECT * FROM partsupp, supplier WHERE ps_suppkey = s_suppkey;

HashJoin(condition="ps_suppkey = s_suppkey", type="inner")
  Gather(concurrent=true)
    LogicalView(tables="partsupp_[0-7]", shardCount=8, sql="SELECT * FROM `partsupp` AS `partsupp`")
  Gather(concurrent=true)
    LogicalView(tables="supplier_[0-7]", shardCount=8, sql="SELECT * FROM `supplier` AS `supplier`")

Hash Join is often used in complex queries that join large amounts of data and cannot be improved by index lookup. In this case, Hash Join is the optimal choice. For example, HashJoin is suitable for the preceding example because a full table scan is performed on the partsupp and supplier tables and a large amount of data is scanned.

The inner table of Hash Join must be used to construct a hash table in the memory. Therefore, the inner table usually contains a smaller amount of data than the outer table. In most cases, the optimizer can automatically select the optimal JOIN order. If manual control is required, you can use the following Hint.

You can use the following Hint to force DRDS to use Hash Join and determine the JOIN order:

/*+TDDL:HASH_JOIN(table_outer, table_inner)*/ SELECT ...

Lookup Join (BKAJoin)

Lookup Join is another general JOIN algorithm for Equi-Join and is usually used in scenarios where a small amount of data is involved. Lookup Join works in the following ways:

  1. Traverse the outer table (left table that usually has a smaller amount of data). For each batch (for example, every 1,000 rows) of data in the outer table:
  2. Splice JOIN Keys of this batch of data into an IN (....) condition, and add the condition to the query on the inner table.
  3. Run the query on the inner table to obtain the rows that match the JOIN condition.
  4. For each row in the outer table, find the matched row in the inner table based on the hash table. Combine and return the results.

The following example shows Lookup Join (BKAJoin):

> EXPLAIN SELECT * FROM partsupp, supplier WHERE ps_suppkey = s_suppkey AND ps_partkey = 123;

BKAJoin(condition="ps_suppkey = s_suppkey", type="inner")
  LogicalView(tables="partsupp_3", sql="SELECT * FROM `partsupp` AS `partsupp` WHERE (`ps_partkey` = ?)")
  Gather(concurrent=true)
    LogicalView(tables="supplier_[0-7]", shardCount=8, sql="SELECT * FROM `supplier` AS `supplier` WHERE (`s_suppkey` IN ('?'))")

Lookup Join is usually used when the outer table contains a relatively small amount of data. For example, in the preceding example, only a few rows of data is pulled from the left table named partsupp due to the filter condition ps_partkey = 123. In addition, the s_suppkey IN ( ... ) query on the right table hits the primary key index. This further reduces the query cost of Lookup Join.

You can use the following Hint to force DRDS to use LookupJoin and determine the JOIN order:

/*+TDDL:BKA_JOIN(table_outer, table_inner)*/ SELECT ...
Note The inner table for Lookup Join can be only a single table rather than the JOIN result of multiple tables.

Sort-Merge Join

Sort-Merge Join is another JOIN algorithm for Equi-Join. Sort-Merge Join depends on the order of inputs from the left and right sides. The inputs must be sorted by JOIN Key. Sort-Merge Join works in the following ways:

  1. Before Sort-Merge Join starts, the inputs must be sorted by using MergeSort or MemSort.
  2. Compare the input rows in the left table and the right table, and continuously consume the inputs from the left and right sides in the following way:
    • If the JOIN Key in the left table is smaller, consume the next data entry in the left table.
    • If the JOIN Key in the right table is smaller, consume the next data entry in the right table.
    • If JOIN Keys in the left table and the right table are equal, one or more data entries are matched. Check whether the JOIN conditions are met and return the result.

The following example shows Sort-Merge Join:

> EXPLAIN SELECT * FROM partsupp, supplier WHERE ps_suppkey = s_suppkey ORDER BY s_suppkey;

SortMergeJoin(condition="ps_suppkey = s_suppkey", type="inner")
  MergeSort(sort="ps_suppkey ASC")
    LogicalView(tables="QIMU_0000_GROUP,QIMU_0001_GROUP.partsupp_[0-7]", shardCount=8, sql="SELECT * FROM `partsupp` AS `partsupp` ORDER BY `ps_suppkey`")
  MergeSort(sort="s_suppkey ASC")
    LogicalView(tables="QIMU_0000_GROUP,QIMU_0001_GROUP.supplier_[0-7]", shardCount=8, sql="SELECT * FROM `supplier` AS `supplier` ORDER BY `s_suppkey`")

Pay attention to the MergeSort operator in the preceding execution plan and ORDER BY that is pushed down. This ensures that the inputs from the left and right sides in Sort-Merge Join are sorted by JOIN Key, that is, s_suppkey (ps_suppkey).

Generally, Sort-Merge Join is not optimal because Sort-Merge Join requires additional sorting steps. However, in some cases where client queries require sorting by JOIN Key (as shown in the preceding example), Sort-Merge Join is a good choice.

You can use the following Hint to force DRDS to use Sort-Merge Join:

/*+TDDL:SORT_MERGE_JOIN(table_a, table_b)*/ SELECT ...

JOIN orders

In multi-table join scenarios, one of the most important tasks for the optimizer is to determine the join order of tables. This is because the sizes of intermediate result sets vary based on the join orders and the overall cost of the plan is affected.

For example, for a JOIN on four tables (pushdown is not considered), JOIN Tree has the following three forms, and 4! = 24 sort orders are available for the tables. In this case, a total of 72 possible JOIN orders are available.

JOIN orders

For a given JOIN on N tables,DRDS generates an optimal JOIN plan by using adaptive policies:

  • If the value of N (not pushed down) is relatively small, an optimal plan is selected from all the JOIN orders by using the enumeration strategy based on Bushy.
  • If the number of tables that are not pushed down is relatively large, an optimal execution plan for Zig-Zag or Left-Deep is selected by using the enumeration strategy based on Zig-Zag (zigzag) or Left-Deep (left-deep tree). This way, the times and cost of enumerations can be reduced.

DRDS uses a Cost-Based Optimizer (CBO) to select the JOIN order that has the minimum total cost. For more information, see Introduction to the query optimizer.

In addition, different JOIN algorithms have different preferences for inputs from the left and right sides. For example, in Hash Join, the right table is used as the inner table to create a hash table. Therefore, you must put the table that has a smaller amount of data on the right side. The CBO also takes these preferences into account.

Subqueries

Subqueries can be classified into non-correlated subqueries and correlated subqueries based on whether a correlated item exists. A non-correlated subquery is executed independently of the variables of its outer query and is usually computed only once. However, a correlated subquery includes variables that are referenced from an outer query and logically requires input of the corresponding variables each time. Therefore, such a subquery is computed multiple times.

/* Example: non-correlated subquery */
SELECT * FROM lineitem WHERE l_partkey IN (SELECT p_partkey FROM part);

/* Example: correlated subquery for which l_suppkey is the correlated item */
SELECT * FROM lineitem WHERE l_partkey IN (SELECT ps_partkey FROM partsupp WHERE ps_suppkey = l_suppkey);

DRDS supports most syntax of subqueries. For more information, see SQL limits.

For most general subquery forms,DRDS can rewrite these forms into efficient SemiJoin or similar JOIN-based compute ways. The benefits are obvious. If a large amount of data exists, different parameters do not need to be specified for loop iterations. This significantly reduces the execution cost. This query rewrite technology is called subquery unnesting.

The following two examples show subquery unnesting. You can see that subqueries are replaced with JOIN in the execution plans.

> EXPLAIN SELECT p_partkey, (
      SELECT COUNT(ps_partkey) FROM partsupp WHERE ps_suppkey = p_partkey
      ) supplier_count FROM part;
Project(p_partkey="p_partkey", supplier_count="CASE(IS NULL($10), 0, $9)", cor=[$cor0])
  HashJoin(condition="p_partkey = ps_suppkey", type="left")
    Gather(concurrent=true)
      LogicalView(tables="part_[0-7]", shardCount=8, sql="SELECT * FROM `part` AS `part`")
    Project(count(ps_partkey)="count(ps_partkey)", ps_suppkey="ps_suppkey", count(ps_partkey)2="count(ps_partkey)")
      HashAgg(group="ps_suppkey", count(ps_partkey)="SUM(count(ps_partkey))")
        Gather(concurrent=true)
          LogicalView(tables="partsupp_[0-7]", shardCount=8, sql="SELECT `ps_suppkey`, COUNT(`ps_partkey`) AS `count(ps_partkey)` FROM `partsupp` AS `partsupp` GROUP BY `ps_suppkey`")
> EXPLAIN SELECT p_partkey, (
      SELECT COUNT(ps_partkey) FROM partsupp WHERE ps_suppkey = p_partkey
      ) supplier_count FROM part;

Project(p_partkey="p_partkey", supplier_count="CASE(IS NULL($10), 0, $9)", cor=[$cor0])
  HashJoin(condition="p_partkey = ps_suppkey", type="left")
    Gather(concurrent=true)
      LogicalView(tables="part_[0-7]", shardCount=8, sql="SELECT * FROM `part` AS `part`")
    Project(count(ps_partkey)="count(ps_partkey)", ps_suppkey="ps_suppkey", count(ps_partkey)2="count(ps_partkey)")
      HashAgg(group="ps_suppkey", count(ps_partkey)="SUM(count(ps_partkey))")
        Gather(concurrent=true)
          LogicalView(tables="partsupp_[0-7]", shardCount=8, sql="SELECT `ps_suppkey`, COUNT(`ps_partkey`) AS `count(ps_partkey)` FROM `partsupp` AS `partsupp` GROUP BY `ps_suppkey`")

In rare cases,DRDS cannot unnest subqueries. In this case, an iterative execution method is used. If a large amount of data is to be queried for the outer query, iterative execution may be very slow.

In the following example, the subquery cannot be unnested due to the condition OR l_partkey < 50. Therefore, the iterative execution method is used.

> EXPLAIN SELECT * FROM lineitem WHERE l_partkey IN (SELECT ps_partkey FROM partsupp WHERE ps_suppkey = l_suppkey) OR l_partkey IS NOT

Filter(condition="IS(in,[$1])[29612489] OR l_partkey < ? 0")
  Gather(concurrent=true)
    LogicalView(tables="QIMU_0000_GROUP,QIMU_0001_GROUP.lineitem_[0-7]", shardCount=8, sql="SELECT * FROM `lineitem` AS `lineitem`")

>> individual correlate subquery : 29612489
Gather(concurrent=true)
  LogicalView(tables="QIMU_0000_GROUP,QIMU_0001_GROUP.partsupp_[0-7]", shardCount=8, sql="SELECT * FROM (SELECT `ps_partkey` FROM `partsupp` AS `partsupp` WHERE (`ps_suppkey` = `l_suppkey`)) AS `t0` WHERE (((`l_partkey` = `ps_partkey`) OR (`l_partkey` IS NULL)) OR (`ps_partkey` IS NULL))")

In this case, we recommend that you rewrite the SQL statement to delete the OR condition from the subquery.