You can use semijoins to optimize subqueries. This reduces the number of queries and improves query performance. This topic describes the information about semijoins and how to use semijoins in parallel queries.

Prerequisites

The version of the PolarDB cluster is ApsaraDB PolarDB MySQL-compatible edition 8.0 and the revision version is 8.0.1.1.2 or later. For more information about how to check the version, see Query the kernel version number.

Background information

MySQL 5.6.5 supports semijoins. If a row in the outer table matches one or more rows in the inner table, the row of the outer table is returned. Only the row that exists in the outer table is returned even if multiple rows are matched in the inner table. For the queries with subqueries, one subquery is performed for each tuple of the outer table that matches the conditions. This reduces query efficiency. In this case, you can use semijoins to optimize subqueries. This reduces the number of queries and improves query performance. When semijions are used, a subquery is converted to a join, the inner table is pulled out to the outer query. This way, the inner and outer tables are in parallel. The result is returned if one tuple match is found between the inner table and the outer table. This greatly improves query efficiency.

1

Semijoin strategies

Semijoins use the following strategies:

  • DuplicateWeedout Strategy

    A temporary table that uses the row ID as the unique ID is created. The system runs the semijoin as if it were a join and removes duplicate records by using the row ID.

    2
  • Materialization Strategy

    The nested tables are materialized into an indexed temporary table that is used to perform a join, where the index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables.

    3
  • Firstmatch Strategy

    When the system scans the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This shortcuts scanning and eliminates duplicates.

    4
  • LooseScan Strategy

    The system scans a subquery table by using an index that enables a single value to be chosen from each value group of the subquery.

    5

Syntax

Semijoins use the IN or EXISTS clause as join conditions.

  • IN
    SELECT *
    FROM Employee
    WHERE DeptName IN (
      SELECT DeptName
      FROM Dept
    )
  • EXISTS
    SELECT *
    FROM Employee
    WHERE EXISTS (
      SELECT 1
      FROM Dept
      WHERE Employee.DeptName = Dept.DeptName
    )

Improve parallel performance for semijoins

PolarDB supports all semijoin strategies and accelerates the parallel queries that use semijoin strategies. PolarDB splits semijoin tasks and uses a multi-threading model to run task sets in parallel. This enhances capabilities to remove duplicates and improves query performance. TPC Benchmark™H (TPC-H) provides 22 SQL queries. In the following example, the twentieth SQL query is used.

SELECT
   s_name,
   s_address 
FROM
   supplier,
   nation 
WHERE
   s_suppkey IN 
   (
      SELECT
         ps_suppkey 
      FROM
         partsupp 
      WHERE
         ps_partkey IN 
         (
            SELECT
               p_partkey 
            FROM
               part 
            WHERE
               p_name LIKE '[COLOR]%' 
         )
         AND ps_availqty > ( 
         SELECT
            0.5 * SUM(l_quantity) 
         FROM
            lineitem 
         WHERE
            l_partkey = ps_partkey 
            AND l_suppkey = ps_suppkey 
            AND l_shipdate >= date('[DATE]') 
            AND l_shipdate < date('[DATE]') + interval '1' year ) 
   )
   AND s_nationkey = n_nationkey 
   AND n_name = '[NATION]' 
ORDER BY
   s_name;

In the example, the materialization is first processed and a degree of parallelism (DOP) of 32 is used. The materialized tables are used in the subsequent process. This makes full use of the processing capabilities of the CPUs and maximizes the parallel processing capabilities of outer queries. In the following execution plan, 1 GB of data that is represented by TPC-H scale factor 1 is used. The parallel processing capabilities are achieved in the materialization and semijoin processes after parallel scans are used.

2

The following figure shows the execution time of the statement when the sequential query is performed.

1

The following figure shows the execution time of the statement when the parallel query is used.

2

In the following custom SQL statements, parallel semijoins are used and max_parallel_degree is set to 32. In this case, the execution time of the statements is reduced from 2.59 seconds to 0.34 seconds.

mysql> SELECT c1,d1 FROM t1 WHERE c1 IN ( SELECT t2.c1 FROM t2 WHERE t2.c1 = 'f'      OR t2.c2 < 'y' ) AND t1.c1 AND d1 > '1900-1-1' LIKE "R1%" ORDER BY t1.c1 DESC, t1.d1 DESC;
Empty set, 1024 warnings (0.34 sec)
mysql> SET max_parallel_degree=0;
Query OK, 0 rows affected (0.00 sec)
mysql>  SELECT c1,d1 FROM t1 WHERE c1 IN ( SELECT t2.c1 FROM t2 WHERE t2.c1 = 'f'      OR t2.c2 < 'y' ) AND t1.c1 AND d1 > '1900-1-1' LIKE "R1%" ORDER BY t1.c1 DESC, t1.d1 DESC;
Empty set, 65535 warnings (2.69 sec)
mysql> EXPLAIN SELECT c1,d1 FROM t1 WHERE c1 IN ( SELECT t2.c1 FROM t2 WHERE t2.c1 = 'f'      OR t2.c2 < 'y' ) AND t1.c1 AND d1 > '1900-1-1' LIKE "R1%" ORDER BY t1.c1 DESC, t1.d1 DESC;
+----+--------------+-------------+------------+--------+---------------+------------+---------+----------+--------+----------+---------------------------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref      | rows   | filtered | Extra                                                   |
+----+--------------+-------------+------------+--------+---------------+------------+---------+----------+--------+----------+---------------------------------------------------------+
|  1 | SIMPLE       | <gather1>   | NULL       | ALL    | NULL          | NULL       | NULL    | NULL     |  33464 |   100.00 | Merge sort                                              |
|  1 | SIMPLE       | t1          | NULL       | ALL    | NULL          | NULL       | NULL    | NULL     |  62802 |    30.00 | Parallel scan (32 workers); Using where; Using filesort |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 103     | sj.t1.c1 |      1 |   100.00 | NULL                                                    |
|  2 | MATERIALIZED | t2          | p0,p1      | ALL    | c1,c2         | NULL       | NULL    | NULL     | 100401 |    33.33 | Using where                                             |
+----+--------------+-------------+------------+--------+---------------+------------+------