Subqueries that use IN or EXISTS clauses can cause redundant execution: without optimization, the database evaluates the subquery once per matching row in the outer table. Semijoins eliminate this by converting the subquery into a join that returns each outer row at most once, regardless of how many inner rows match. PolarDB extends this with parallel execution, running each semijoin strategy concurrently across multiple threads and nodes.
Prerequisites
Before you begin, ensure that you have:
A PolarDB for MySQL 8.0 cluster with a revision version of 8.0.1.0.5 or later, or 8.0.2.2.7 or later
To check your cluster's engine version, see the Query the engine version section of the "Engine versions" topic.
How semijoins work
Consider a query that finds all employees who work in a department listed in the Dept table:
SELECT * FROM Employee
WHERE DeptName IN (SELECT DeptName FROM Dept);Without semijoin optimization, the database evaluates the subquery for every row in Employee that matches the outer condition. If many employees share the same department, the subquery runs repeatedly.
An INNER JOIN approach has a different problem: it returns one row per match. If an employee's department matches three rows in Dept, that employee appears three times in the result.
A semijoin fixes both problems. It returns a row from the outer table (Employee) if it matches at least one row in the inner table (Dept). Even when multiple matches exist, the outer row appears only once.
To achieve this, the optimizer converts the subquery into a join and pulls the inner table up to the outer query level so both tables are evaluated together. As soon as a match is found, the system returns the result—no repeated subquery execution, no duplicate rows.

MySQL 5.6.5 and later support semijoins.
Semijoin strategies
PolarDB supports all four MySQL semijoin strategies. Each uses a different approach to eliminate duplicates.
DuplicateWeedout
Creates a temporary table and uses unique row IDs to filter out duplicate results before returning them.

Materialization
Materializes the subquery result into an indexed temporary table, then uses that table to perform the join. The index removes duplicates and supports efficient lookups when joining with the outer table.

FirstMatch
Executes the subquery and short-circuits as soon as the first match is found, skipping remaining rows in the inner table. This avoids producing duplicates at the source.

LooseScan
Groups inner table data by index, selects one representative row per value group in the subquery, then joins that row with the outer table. This produces a duplicate-free result set without a temporary table.

Syntax
Semijoins are triggered by IN or EXISTS subqueries in the WHERE or ON clause.
IN:
SELECT * FROM Employee
WHERE DeptName IN (
SELECT DeptName FROM Dept
);EXISTS:
SELECT * FROM Employee
WHERE EXISTS (
SELECT 1 FROM Dept
WHERE Employee.DeptName = Dept.DeptName
);Parallel semijoin execution
PolarDB splits each semijoin task into subtasks and runs them concurrently using a multi-threading model, accelerating all four semijoin strategies.
In PolarDB 8.0.2.2.7 and later, multi-phase parallel query for semijoin Materialization is also supported. The inner subquery first builds materialized tables in parallel across multiple workers; then the outer query runs in parallel against that materialized table. This approach takes full advantage of CPU resources and maximizes throughput for complex subquery workloads.
Example: Q20 with multi-phase parallel query
The following example uses Q20 from the TPC-H benchmark to show how multi-phase parallel query works with semijoin Materialization.
This test is based on the TPC-H benchmark but does not meet all TPC-H benchmark requirements. Results may differ from published TPC-H benchmark figures.
SELECT
s_name,
s_address
FROM
supplier,
nation
WHERE
s_suppkey IN (
SELECT ps_suppkey
FROM partsupp
WHERE ps_partkey IN (
SELECT p_partkey
FROM part
WHERE p_name LIKE '[COLOR]%'
)
AND ps_availqty > (
SELECT 0.5 * SUM(l_quantity)
FROM lineitem
WHERE l_partkey = ps_partkey
AND l_suppkey = ps_suppkey
AND l_shipdate >= date('[DATE]')
AND l_shipdate < date('[DATE]') + interval '1' year
)
)
AND s_nationkey = n_nationkey
AND n_name = '[NATION]'
ORDER BY s_name;Both the inner subquery and the outer query run as parallel queries with a degree of parallelism (DOP) of 32.
Query plan
The following query plan is generated when elastic parallel query is enabled. The test uses a standard TPC-H hot data scenario at 100 GB scale.
-> Sort: <temporary>.s_name (cost=5014616.15 rows=100942)
-> Stream results
-> Nested loop inner join (cost=127689.96 rows=100942)
-> Gather (slice: 2; workers: 64; nodes: 2) (cost=6187.68 rows=100928)
-> Nested loop inner join (cost=1052.43 rows=1577)
-> Filter: (nation.N_NAME = 'KENYA') (cost=2.29 rows=3)
-> Table scan on nation (cost=2.29 rows=25)
-> Parallel index lookup on supplier using SUPPLIER_FK1 (S_NATIONKEY=nation.N_NATIONKEY), with index condition: (supplier.S_SUPPKEY is not null), with parallel partitions: 863 (cost=381.79 rows=631)
-> Single-row index lookup on <subquery2> using <auto_distinct_key> (ps_suppkey=supplier.S_SUPPKEY)
-> Materialize with deduplication
-> Gather (slice: 1; workers: 64; nodes: 2) (cost=487376.70 rows=8142336)
-> Nested loop inner join (cost=73888.70 rows=127224)
-> Filter: (part.P_NAME like 'lime%') (cost=31271.54 rows=33159)
-> Parallel table scan on part, with parallel partitions: 6244 (cost=31271.54 rows=298459)
-> Filter: (partsupp.PS_AVAILQTY > (select #4)) (cost=0.94 rows=4)
-> Index lookup on partsupp using PRIMARY (PS_PARTKEY=part.P_PARTKEY) (cost=0.94 rows=4)
-> Select #4 (subquery in condition; dependent)
-> Aggregate: sum(lineitem.L_QUANTITY)
-> Filter: ((lineitem.L_SHIPDATE >= DATE'1994-01-01') and (lineitem.L_SHIPDATE < <cache>((DATE'1994-01-01' + interval '1' year)))) (cost=4.05 rows=1)
-> Index lookup on lineitem using LINEITEM_FK2 (L_PARTKEY=partsupp.PS_PARTKEY, L_SUPPKEY=partsupp.PS_SUPPKEY) (cost=4.05 rows=7)Key operators in this plan:
| Operator | Role |
|---|---|
Gather (slice: 1; workers: 64; nodes: 2) | Phase 1: collects results from 64 parallel workers across 2 nodes to build the materialized subquery table. This is the parallel inner-subquery phase. |
Materialize with deduplication | Materializes the inner subquery result into a temporary table and removes duplicates using an index—this is the Materialization semijoin strategy. |
Gather (slice: 2; workers: 64; nodes: 2) | Phase 2: collects results from 64 parallel workers for the outer query, which looks up matching rows in the materialized table. |
Parallel table scan on part, with parallel partitions: 6244 | Splits the part table into 6,244 partitions, each scanned by a separate worker in parallel. |
The two Gather steps represent the two phases of multi-phase parallel query: phase 1 builds the materialized table in parallel, and phase 2 runs the outer join in parallel against that table.
Performance result

With elastic parallel query enabled, execution time drops from 43.52 seconds to 2.29 seconds—a 19x improvement.