All Products
Search
Document Center

PolarDB:Hash joins in parallel queries

Last Updated:Mar 25, 2024

Hash join is a new join execution method introduced in MySQL Community Edition 8.0. It can significantly improve the execution performance of analytic queries. PolarDB for MySQL 8.0 supports the parallel execution of hash joins. More parallel execution policies are under development and will be released in the future. This topic describes how to use hash joins for parallel queries in PolarDB.

Simple hash join

Prerequisites

A PolarDB for MySQL 8.0 Cluster Edition cluster is used and the revision version is 8.0.2.1.0 or later. For more information about how to check the version, see Query the engine version.

Parallel execution policy并行策略

In the preceding execution plan, the degree of parallelism (DOP) is set to 4, which indicates that PolarDB uses four workers to perform parallel queries. First, four workers scan table t1 in parallel. Each worker uses only part of the data in table t1 to create a hash table. Then, a JOIN operation is performed between the four hash tables and table t2.

Usage

  • Syntax:

    In PolarDB, you can use only the EXPLAIN FORMAT=TREE statement to check whether hash joins are used.

  • Examples:

    In the following example, two tables are created and data is inserted into the tables.

    CREATE TABLE t1 (c1 INT, c2 INT);
    CREATE TABLE t2 (c1 INT, c2 INT);
    INSERT INTO t1 VALUES (1,1),(2,2),(3,3),(5,5);
    INSERT INTO t2 VALUES (1,1),(2,2),(3,3),(5,5);

    The following example shows the execution plan for these two tables when hash joins are used:

    EXPLAIN FORMAT=TREE
     EXPLAIN
      -> Gather (slice: 1; workers: 4)  (cost=10.82 rows=4)
        -> Parallel inner hash join (t2.c2 = t1.c1)  (cost=0.57 rows=1)
            -> Parallel table scan on t2, with parallel partitions: 1  (cost=0.03 rows=1)
            -> Parallel hash
                -> Parallel table scan on t1, with parallel partitions: 1  (cost=0.16 rows=1)

    In the preceding execution plan, the DOP is set to 4, which indicates that PolarDB uses four workers to perform parallel queries. First, four workers scan table t1 in parallel. Each worker uses only part of the data in table t1 to create a hash table. Then, a JOIN operation is performed between the four hash tables and table t2. Finally, a leader gathers the joined results to generate the final query results.

Parallel hash join

Prerequisites

A PolarDB for MySQL 8.0 Cluster Edition cluster is used and the revision version is 8.0.2.2.0 or later. For more information about how to check the version, see Query the engine version.

Parallel execution policy并行策略

In the preceding execution plan, the DOP is set to 4, which indicates that PolarDB uses four workers to perform parallel queries. First, four workers scan table t1 in parallel. Each worker uses only part of the data in table t1 data, but writes the build results to the same lock-free shared hash table. After the build phase is complete, the four workers scan table t2 table to scan the t2 and probe the data in parallel. Finally, a leader gathers the joined results to generate the final query results.

Usage

  • Syntax:

    In PolarDB, you can use only the EXPLAIN FORMAT=TREE statement to check whether hash joins are used.

  • Examples:

    In the following example, two tables are created and data is inserted into the tables.

    CREATE TABLE t1 (c1 INT, c2 INT);
    CREATE TABLE t2 (c1 INT, c2 INT);
    INSERT INTO t1 VALUES (1,1),(2,2),(3,3),(5,5);
    INSERT INTO t2 VALUES (1,1),(2,2),(3,3),(5,5);

    The following example shows the execution plan for these two tables when hash joins are used:

    EXPLAIN FORMAT=TREE
    
    EXPLAIN
    | -> Gather (slice: 1; workers: 1)  (cost=12.46 rows=4)
        -> Parallel inner hash join (t2.c1 = t1.c1)  (cost=2.06 rows=4)
            -> Parallel table scan on t2, with parallel partitions: 1  (cost=0.09 rows=4)
            -> Parallel hash
                -> Parallel table scan on t1, with parallel partitions: 1  (cost=0.65 rows=4)

Shuffle Hash Join

Prerequisites

A PolarDB for MySQL 8.0 Cluster Edition cluster is used and the revision version is 8.0.2.2.0 or later. For more information about how to check the version, see Query the engine version.

Parallel execution policy并行策略

Parallel hash join implements parallel execution in both the build and probe phases. However, if the shared hash table is too large, I/O operations are performed to dump the data to a disk, which affects the parallel query efficiency. To resolve this issue, partition hash join is used. In the preceding execution plan, four workers scan table t1 in parallel, partition and distribute the data to workers of the next phase based on join keys, and build a small hash table in each partition. After the build phase is complete, the four workers scan table t2, partition the data based on join keys, and distribute the data to workers that have completed the building of small hash tables. After the probe operation is complete, a leader gathers the joined results to generate the final query results.

Usage

  • Syntax:

    PolarDB allows you to execute only the EXPLAIN FORMAT=TREE statement to check whether hash joins are used.

  • Examples:

    In the following example, two tables are created and data is inserted into the tables.

    CREATE TABLE t1 (c1 INT, c2 INT);
    CREATE TABLE t2 (c1 INT, c2 INT);
    INSERT INTO t1 VALUES (1,1),(2,2),(3,3),(5,5);
    INSERT INTO t2 VALUES (1,1),(2,2),(3,3),(5,5);

    The following example shows the execution plan for these two tables when hash joins are used:

    EXPLAIN FORMAT=TREE
    
    EXPLAIN
    | -> Gather (slice: 1; workers: 2)  (cost=33.38 rows=4)
        -> Inner hash join (t2.c1 = t1.c1)  (cost=23.08 rows=2)
            -> Repartition (hash keys: t2.c1; slice: 2; workers: 1)  (cost=11.35 rows=2)
                -> Parallel table scan on t2, with parallel partitions: 1  (cost=0.65 rows=4)
            -> Hash
                -> Repartition (hash keys: t1.c1; slice: 3; workers: 1)  (cost=11.35 rows=2)
                    -> Parallel table scan on t1, with parallel partitions: 1  (cost=0.65 rows=4)