You can use semijoins to optimize subqueries. Semijoins can be used to reduce the number of queries and improve query performance. This topic provides basic information about semijoins. This topic also describes how to use semijoins in parallel queries.

Prerequisites

Your PolarDB cluster is an cluster of PolarDB for MySQL that runs MySQL 8.0 and the revision version of your cluster is V8.0.1.1.2 or later. For more information about how to view the version, see Query the kernel version number.

Background information

MySQL 5.6.5 and later support semijoins. A semijoin returns a row from an outer table if the row in the outer table matches at least one row in an inner table. If the row in the outer table matches more than one row in the inner table, the semijoin returns the row only once. For a query that contains subqueries and does not use semijoins, subqueries are performed for each tuple that matches the conditions in the outer table. This reduces query efficiency. To optimize subqueries, you can use semijoins to reduce the number of subqueries and improve query performance. When semijoins are used, a subquery is converted to a join and the inner table is pulled out to the outer query. This way, the inner and outer tables are in parallel. After the system finds a tuple match between the inner table and the outer table, the system returns the result. This significantly improves query efficiency.

1

Semijoin strategies

The following semijoin strategies are used:

  • DuplicateWeedout Strategy

    A temporary table that uses the row IDs as the unique IDs is created. The system runs the semijoin in the same manner as if the system ran a join and removes duplicates by using the row IDs.

    2
  • Materialization Strategy

    Nested tables are materialized into an indexed temporary table that is used to perform a join, and the index is used to remove duplicates. The index can also be used later for lookups when the system joins the temporary table with the outer tables.

    3
  • Firstmatch Strategy

    When the system scans the inner tables for row combinations and multiple instances of a given value group exist, you must select one value rather than returning all values. This way, all values are scanned and duplicates are eliminated.

    4
  • LooseScan Strategy

    The system scans a subquery table by using an index based on which a single value can be selected from each value group of the subquery.

    5

Syntax

In most cases, a semijoin uses the IN or EXISTS clause as the join condition.

  • IN
    SELECT *
    FROM Employee
    WHERE DeptName IN (
      SELECT DeptName
      FROM Dept
    )
  • EXISTS
    SELECT *
    FROM Employee
    WHERE EXISTS (
      SELECT 1
      FROM Dept
      WHERE Employee.DeptName = Dept.DeptName
    )

Run semijoins in parallel to improve performance

PolarDB supports all semijoin strategies and accelerates the parallel queries that use semijoin strategies. In this way, each semijoin task is split into multiple subtasks and a multi-threading model is used to run the subtasks in parallel. This improves the capabilities of removing duplicates and the query performance. TPC Benchmark™H (TPC-H) provides 22 SQL queries. In the following example, the twentieth SQL query is used.

SELECT
   s_name,
   s_address 
FROM
   supplier,
   nation 
WHERE
   s_suppkey IN 
   (
      SELECT
         ps_suppkey 
      FROM
         partsupp 
      WHERE
         ps_partkey IN 
         (
            SELECT
               p_partkey 
            FROM
               part 
            WHERE
               p_name LIKE '[COLOR]%' 
         )
         AND ps_availqty > ( 
         SELECT
            0.5 * SUM(l_quantity) 
         FROM
            lineitem 
         WHERE
            l_partkey = ps_partkey 
            AND l_suppkey = ps_suppkey 
            AND l_shipdate >= date('[DATE]') 
            AND l_shipdate < date('[DATE]') + interval '1' year ) 
   )
   AND s_nationkey = n_nationkey 
   AND n_name = '[NATION]' 
ORDER BY
   s_name;

In the example, the system first materializes the nested tables at a degree of parallelism (DOP) of 32. The materialized tables are used in the subsequent process. This ensures full use of the CPU processing capabilities and maximizes the parallel processing capabilities of outer queries. In the following execution plan, 1 GB of data represented by TPC-H scale factor 1 is used. After parallel semijoins are enabled, the parallel processing capabilities are achieved in the materialization and semijoin processes.

Note This implementation of TPC-H is derived from the TPC-H Benchmark and is not comparable to published TPC-H Benchmark results, as this implementation does not comply with all the requirements of the TPC-H Benchmark.
2

The following figure shows the amount of time consumed to execute the statements in a serialized manner.

1

The following figure shows the amount of time consumed to execute the statements in parallel.

2

For example, parallel semijoins are used to execute the following statements and max_parallel_degree is set to 32. In this example, the amount of time consumed to execute the statements is reduced from 2.59s to 0.34s.

mysql> SELECT c1,d1 FROM t1 WHERE c1 IN ( SELECT t2.c1 FROM t2 WHERE t2.c1 = 'f'      OR t2.c2 < 'y' ) AND t1.c1 AND d1 > '1900-1-1' LIKE "R1%" ORDER BY t1.c1 DESC, t1.d1 DESC;
Empty set, 1024 warnings (0.34 sec)
mysql> SET max_parallel_degree=0;
Query OK, 0 rows affected (0.00 sec)
mysql>  SELECT c1,d1 FROM t1 WHERE c1 IN ( SELECT t2.c1 FROM t2 WHERE t2.c1 = 'f'      OR t2.c2 < 'y' ) AND t1.c1 AND d1 > '1900-1-1' LIKE "R1%" ORDER BY t1.c1 DESC, t1.d1 DESC;
Empty set, 65535 warnings (2.69 sec)
mysql> EXPLAIN SELECT c1,d1 FROM t1 WHERE c1 IN ( SELECT t2.c1 FROM t2 WHERE t2.c1 = 'f'      OR t2.c2 < 'y' ) AND t1.c1 AND d1 > '1900-1-1' LIKE "R1%" ORDER BY t1.c1 DESC, t1.d1 DESC;
+----+--------------+-------------+------------+--------+---------------+------------+---------+----------+--------+----------+---------------------------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref      | rows   | filtered | Extra                                                   |
+----+--------------+-------------+------------+--------+---------------+------------+---------+----------+--------+----------+---------------------------------------------------------+
|  1 | SIMPLE       | <gather1>   | NULL       | ALL    | NULL          | NULL       | NULL    | NULL     |  33464 |   100.00 | Merge sort                                              |
|  1 | SIMPLE       | t1          | NULL       | ALL    | NULL          | NULL       | NULL    | NULL     |  62802 |    30.00 | Parallel scan (32 workers); Using where; Using filesort |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 103     | sj.t1.c1 |      1 |   100.00 | NULL                                                    |
|  2 | MATERIALIZED | t2          | p0,p1      | ALL    | c1,c2         | NULL       | NULL    | NULL     | 100401 |    33.33 | Using where                                             |
+----+--------------+-------------+------------+--------+---------------+------------+------