Use Hash Match to perform IMCI-enabled operations - PolarDB

By default, the In-Memory Column Index (IMCI) executor of PolarDB uses line numbers to indicate intermediate results. If the amount of data that is required by a large query cannot be fully stored in memories, a large number of random and repeated I/O operations may occur, and the execution efficiency is affected. To solve this issue, the IMCI executor implements a set of operators based on the materialization of intermediate results. This topic describes the implementation of the Hash Match operator by using the Hash Join operator to materialize intermediate results.

Execution plan

The implementation of the Hash Match operator consists of two phases: the build phase and the probe phase. In the first phase, join predicates are taken as the keys to build each row of the left table into a hash table. In the second phase, the right table is traversed and data entries are matched based on the join predicates of the right table and the hash table. Then, the result set is output based on the matching results of different types of joins.

In the build phase, you can use one hash table to contain all the data in the left table. However, a hash table that contains all data is too large. Serious conflicts may occur, or you have to continuously scale out or scale up during the build process. Therefore, in the build phase, you can partition the data in the left table based on specific rules and build independent hash tables for each partition. In the probe phase, the system searches the hash table in the partition where the rows of the right table reside.

Build phase

In IMCI-enabled operations, the build feature of the Hash Match operator is implemented in doOpen, which consists of two phases: doBuild and doMerge. Each phase uses a thread group to perform concurrent processing.

DoBuild

In the doBuild phase, worker threads in a thread group build independent hash tables for each partition by using the data in the left table.

Worker\Partition	Partition0	Partition1	...	PartitionN
Worker0	HashMap00	HashMap01	...	HashMap0N
Worker1	HashMap10	HashMap11	...	HashMap1N
...	...	...	...	...
WorkerM	HashMapM0	HashMapM1	...	HashMapMN

A hash table is built for each partition by using a worker thread. Besides, a group of chunks are generated to store the materialization results. The UInt64 type value of a hash table only indicates the position of the chunk corresponding to the current key. The UInt64 value can be split into the following three parts by bit: UInt16, UInt16, and UInt32. The first UInt16 value indicates the worker thread to which the chunk belongs. The second UInt16 value indicates the offset of the chunk. The UInt32 value indicates the array indexing of the chunk. Worker threads retrieve tuples from the left table in parallel and insert tuples into the hash table corresponding to the worker thread and partition without locking based on partition rules. This process is repeated until all the data of the left table is retrieved.

DoMerge

After the doBuild phase is complete, a hash table is built for each partition by using a worker thread. The hash tables that are built in the build phase are used in the probe phase to determine whether searches are matched. Therefore, if data is partitioned in the build phase, data must be searched in the corresponding hash tables based on partition rules in the probe phase. In the doBuild phase, hash tables are built for each partition. They can be searched one by one in the probe phase. However, the process is not convenient and diminishes the query performance. Therefore, the Hash Match operator merges all hash tables in a partition into one in the doMerge phase.

Build\Partition	Partition0	Partition1	...	PartitionN
Merge	HashMap0	HashMap1	...	HashMapN

The doMerge phase is performed by a thread group. In order to prevent meaningless lock synchronization, the hash tables in a partition are merged by a worker thread independently. The number of partitions is greater than the number of worker threads. Therefore, in the doMerge phase, the workload of each worker thread is basically the same.

Data spilling in the build phrase

The workload of each worker thread is basically the same. Therefore, we can assume that the amount of data that each worker thread processes is the same and the average value of the total memory can be taken as the memory quota for each worker thread.

The memory size is limited and the Hash Match operator cannot store the hash tables and chunks of all partitions in the memory. Hash tables and chucks are all divided by partitions. Therefore, excess data is spilled to the disk by partitions if the memory is insufficient.

This way, the operations in the build phase and probe phase can be normally performed in the partitions in the memory. Currently, the Hash Match operator starts from the highest partition to spill data of the entire partition until the partitions in the memory can be normally processed. If no partition can be processed, the OutOfMemoryError exception is thrown.

In the doBuild phase, if key-value pairs cannot be added to a hash table or the data of a chunk cannot be stored in a worker thread due to insufficient memory, the data in the highest partition of the worker thread must be spilled to the disk. The chunks are written to a temporary file, the memory for chunks is released, and the hash table is deleted. When the partition needs to be processed, the chunks in the temporary file are read, and the hash table of the partition is built by using the chunks. The number of the highest partition for a worker thread is visible to other worker threads. If the number of the highest partition for a worker thread is higher than that of other worker threads, the worker updates the number of the highest partition and releases the memory. In the doBuild phase, the worker thread does not build a hash table in the partition whose partition number is greater than the existing highest number. The data is stored in a chunk and is spilled to the disk after the chunk is full.

Probe phase

In the build phase, the left table is read and used to build hash tables. In the probe phase, the data of the right table is read and output based on the matching result of the hash tables that are built in the build phase. The data is partitioned in the build phase. Therefore, the operations in the probe phase must be performed based on the same partition rules.

DoFetch

In the probe phase, a thread group is used to process data and is driven by the memory fetch operation on parent nodes. In the doFetch phase, the worker threads of the Hash Match operator fetch the data of the right table and search the tuples that are fetched in hash tables of specific partitions. The searched tuples are processed based on matching results. This process is repeated until all worker threads fetch data from the right table.

Data spilling in the probe phrase

If the memory is insufficient to store all partitions in the build phase, the memory and disks in the probe phase must be separately partitioned.

In the doFetch phase, after the worker thread fetches data from the right table, if the partition corresponding to the tuple is in the memory, the worker thread directly searches the hash table for matching. If the partition is in the disk, the tuple needs to be stored in the chunk of the partition to which the worker thread belongs. If the chunk is full, the data must be flushed to disk and the memory of the chunk must be released. After all worker threads fetch data from the right table and the probe phase is complete, the partition data in the memory is processed and can be released.

After all partitions in the memory are processed, the partitions in the disk are processed. The data of the partitions in the disk is stored in different temporary files by partition. Therefore, to prevent lock synchronization, in the probe phase, each disk partition is independently processed by a worker thread. The number of partitions is greater than the number of worker threads. Therefore, the workload of each worker thread is basically the same.

The processing of partitions in the disk by using workers also consists of two phases: the build phase and the probe phase.

In the build phase, worker threads read data from the temporary files of partitions of the left table and serialize chunks. Then, worker threads built hash tables based on the data of chunks. This process is repeated until all worker threads read data from the left table.
In the probe phase, worker threads read data from the temporary files of partitions of the right table and serialize chunks. Then, worker threads search and process tuples based on matching results. This process is repeated until all worker threads read data from the left table.

After partitions are processed, the hash match operator is complete. The processing of memory partitions and disk partitions is described differently in this reference. However, the implementation is unified in a set of code.

Probe-related process

The probe-related process consists of the following steps: ProbeMem, ProbeLeft, and ProbeDisk. All steps are performed by using the probe function.

In the ProbeMen stage, data is read from the right table and processed in the memory or disk based on data partitions. If you do not call the probe function in the memory, store data in temporary files. This way, in the ProbeDisk step, the data in a specific disk partition can be loaded and processed by using the probe function.
In the ProbeLeft stage, LEFT JOIN operations such as LEFT OUTER JOIN, LEFT SEMI JOIN, and LEFT ANTI SEMI JOIN are performed. All key-value pairs in a hash table are traversed and matched or unmatched tuples are filtered out.
In the ProbeDisk stage, the probe operations are performed on disk partitions by partition. When a disk partition is processed, the chunk in a temporary file is loaded and processed by the probe function. If a LEFT JOIN operation is performed, the JOIN operation must be called to process the partition.

Join logic

The Hash Match operator supports INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, LEFT SEMI JOIN, LEFT ANTI SEMI JOIN, RIGHT SEMI JOIN, RIGHT ANTI SEMI JOIN, and the PostFilter feature. All JOIN operations consist of the build phase and probe phase. The build phase is basically the same except for the processing of a NULL value. The probe phase is different. The following section is an overview of the processing logic of different types of JOIN operations.

Inner

If a tuple in the right table is non-NULL and matches the hash table that is created based on the left table, the tuples in the left and right tables are output.

LeftOuter

If a tuple in the right table is non-NULL and matches the hash table that is created based on the left table, the tuples in the left and right tables are output. All tuples in the left table that are not matched are output, and the corresponding position in the right table is set to NULL. If a PostFilter exists in LEFT OUTER JOIN, the matched hash tables that are created based on the left table must be verified by using the PostFilter feature.

RightOuter

If a tuple in the right table is non-NULL and matches the hash table that is created based on the left table, the tuples in the left and right tables are output. Otherwise, the tuple is output and the corresponding position in the left table is set to NULL. If a PostFilter exists in RIGHT OUTER JOIN, the matched hash tables that are created based on the left table must be verified by using the PostFilter feature.

LeftSemi

LEFT SEMI JOIN is similar to LEFT OUTER JOIN. Tuples in the left and right tables are not output. The tuples in the left table and the NULL, TRUE, or FALSE value are output based on the following truth table, or only the tuples in the left table are output, or no value is output. If a PostFilter exists in LEFT SEMI JOIN, the matched hash tables that are created based on the left table must be verified by using the PostFilter feature.

  //+------------------------------+--------------+----------------+
  //|           mathched           | semi_probe_  | ! semi_probe_  |
  //+------------------------------+--------------+----------------+
  //| normal true                  | (left, TRUE) |   (left, ONLY) |
  //+------------------------------+--------------+----------------+
  //+------------------------------+--------------+----------------+
  //|        ! mathched            | semi_probe_  | ! semi_probe_  |
  //+------------------------------+--------------+----------------+
  //|NULL v.s. (empty set)         |              |                |
  //|e.g., NULL IN (empty set)     | (left, FALSE)|  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+
  //|NULL v.s. (set)               |              |                |
  //|e.g., NULL IN (1, 2, 3)       | (left, NULL) |  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+
  //|left_row v.s. (set with NULL) |              |                |
  //|e.g., 10 IN (1, NULL, 3)      | (left, NULL) |  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+
  //|normal false                  |              |                |
  //|e.g., 10 IN (1, 2, 3)         | (left, FALSE)|  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+

LeftAntiSemi

LEFT ANTI SEMI JOIN is similar to LEFT OUTER JOIN. Tuples in the left and right tables are not output. Only the tuples in the left table are output based on the following truth table, or no value is output. If a PostFilter exists in LEFT ANTI SEMI JOIN, the matched hash tables that are created based on the left table must be verified by using the PostFilter feature.

  //+------------------------------+----------------+
  //|        ! mathched            | ! semi_probe_  |
  //+------------------------------+----------------+
  //|NULL v.s. (empty set)         |                |
  //|e.g., NULL NOT IN (empty set) |   (left, ONLY) |
  //+------------------------------+----------------+
  //|NULL v.s. (set)               |                |
  //|e.g., NULL NOT IN (1, 2, 3)   |   (left, ONLY) |
  //+------------------------------+----------------+
  //|left_row v.s. (set with NULL) |                |
  //|e.g., 10 NOT IN (1, NULL, 3)  |   (left, ONLY) |
  //+------------------------------+----------------+
  //|normal false                  |                |
  //|e.g., 10 NOT IN (1, 2, 3)     |   (left, ONLY) |
  //+------------------------------+----------------+

RightSemi

RIGHT SEMI JOIN is similar to RIGHT OUTER JOIN. Tuples in the left and right tables are not output. The tuples in the right table and the NULL, TRUE, or FALSE value are output based on the following truth table, or only the tuples in the right table are output, or no value is output. If a PostFilter exists in RIGHT SEMI JOIN, the matched hash tables that are created based on the left table must be verified by using the PostFilter feature.

  //+------------------------------+--------------+----------------+
  //|           mathched           | semi_probe_  | ! semi_probe_  |
  //+------------------------------+--------------+----------------+
  //| normal true                  | (right, TRUE)|  (right, ONLY) |
  //+------------------------------+--------------+----------------+
  //+------------------------------+--------------+----------------+
  //|        ! mathched            | semi_probe_  | ! semi_probe_  |
  //+------------------------------+--------------+----------------+
  //|NULL v.s. (empty set)         |              |                |
  //|e.g., NULL IN (empty set)     |(right, FALSE)|  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+
  //|NULL v.s. (set)               |              |                |
  //|e.g., NULL IN (1, 2, 3)       |(right, NULL) |  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+
  //|left_row v.s. (set with NULL) |              |                |
  //|e.g., 10 IN (1, NULL, 3)      |(right, NULL) |  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+
  //|normal false                  |              |                |
  //|e.g., 10 IN (1, 2, 3)         |(right, FALSE)|  NO_OUTPUT     |
  //+------------------------------+--------------+----------------+

RightAntiSemi

RIGHT SEMI JOIN is similar to RIGHT OUTER JOIN. Tuples in the left and right tables are not output. The tuples in the right table are output based on the following truth table, or only the tuples in the right table are output, or no value is output. If a PostFilter exists in RIGHT ANTI SMEI JOIN, the matched hash tables that are created based on the left table must be verified by using the PostFilter feature.

  //+------------------------------+----------------+
  //|        ! mathched            | ! semi_probe_  |
  //+------------------------------+----------------+
  //|NULL v.s. (empty set)         |                |
  //|e.g., NULL NOT IN (empty set) |  (right, ONLY) |
  //+------------------------------+----------------+
  //|NULL v.s. (set)               |                |
  //|e.g., NULL NOT IN (1, 2, 3)   |  (right, ONLY) |
  //+------------------------------+----------------+
  //|left_row v.s. (set with NULL) |                |
  //|e.g., 10 NOT IN (1, NULL, 3)  |  (right, ONLY) |
  //+------------------------------+----------------+
  //|normal false                  |                |
  //|e.g., 10 NOT IN (1, 2, 3)     |  (right, ONLY) |
  //+------------------------------+----------------+

Implementation of the Hash Match operator

The Hash Match operator defines a process that includes the processing of the memory and disk, multiple JOIN operations, and the PostFilter feature.

HashMap

A write operation and a query operation are provided to implement the Hash Match operator.

size_t PutValue(uint64_t hash_code, const char *key_buf, uint64_t key_len, const uint64_t tuple);
ValueIterator FindValue(uint64_t hash_code, const char *key_data, const uint64_t key_len, const bool need_mark = false);

Two iterators are provided to traverse the entire hash table:

  enum IteratorType { Normal = 0, NoneMark = 1, Mark = 2, END };

  class TableIterator {
   public:
    void Next();
    bool IsValid() const { return valid_; }
    ValueIterator GetIterator(IteratorType type);
	
   private:
	IteratorType type_ = IteratorType::END;
  };

  class ValueIterator {
    struct Listener {
      virtual void BlockEvent() {}
    };
    void SetListener(Listener *listener) { listener_ = listener; }
	
	void Next();
    bool IsValid() const { return valid_; }

   private:
    IteratorType type_ = IteratorType::Normal;
    Listener *listener_ = nullptr;
  };

The TableIterator iterator traverses all key-value pairs in hash tables, and the ValueIterator iterator traverses all data blocks of key-value pairs. Both iterators support the following three iterative models that apply to different JOIN operations: Normal, NoneMark, and Mark.

The TableIterator iterator traverses all hash tables and is mainly applied to LEFT_OUTER, LEFT_ANTI_SEMI, and LEFT_SEMI.

Info

HMInfo stores global data shared by all workers, such as memory partition numbers and partition objects. A partition stores the merged hash table of the current partition, the chunk set of the left and right tables, and the temporary files of the left and right tables. The Hash Match operator generates a temporary file for each partition. Workers can perform atomic read and write operations by using the pread function, pwrite function, and the offset atomic variable.

Local Info

HMLocalInfo stores the private data of the current worker, such as the memory partition number of the current worker and the left and right HMLocalPartitions. Each HMLocalPartition stores the hash table of the current partition of the worker, the chunk set, and the chunk object that is being written.

Fetcher

The Hash Match operator supports multiple fetch methods, including fetching data from the left table, fetching data from the right table, reading chunk objects from the chunk set of the Info or LocalInfor memory, and reading and serializing chunk objects from temporary files. These methods can all be used to fetch data from chunks in the build and probe phases.

class HashMatchFetcher final {
  bool Fetch(Context &context, TupleChunk *&mat_chunk);
  // fetch from left or right child
  bool FetchMem(Context &context);
  // fetch from info chunks (include load from temp files)
  bool FetchDisk(Context &context, TupleChunk *&mat_chunk);

  size_t part_index_ = 0;
  TupleChunk chunk_;
};

Builder

Builder processes the memory partition and disk partition in a centralized manner in the doBuild phase.

MemBuilder: The data in the left table is fetched and stored in the chunk set. If a tuple belongs to the memory partition, a hash table is written. If a tuple belongs to a disk partition, the data of chunks is spilled to the disk.
DiskBuilder: The chunk set is read from a temporary table and is used to build the hash table of this partition.

image..png

class HashMatchBuilder {
  void Build();
  virtual void ChunkResult(const size_t offset, const bool is_null,
                           const size_t part_index, const uint64_t hash_val,
                           const char *key_data, const size_t key_len) = 0;
  virtual void ChunkDone() = 0;

  HashMatchFetcher fetcher_;
};

class HashMatchMemBuilder final : public HashMatchBuilder {
  void ChunkResult(const size_t offset, const bool is_null,
                   const size_t part_index, const uint64_t hash_val,
                   const char *key_data, const size_t key_len) override;
  void ChunkDone() override;
  
  TupleChunk origin_chunk_;
};

class HashMatchDiskBuilder final : public HashMatchBuilder {
  void ChunkResult(const size_t offset, const bool is_null,
                   const size_t part_index, const uint64_t hash_val,
                   const char *key_data, const size_t key_len) override;
  void ChunkDone() override;

  const size_t part_index_ = 0;
};

Prober

The ProbeMen, ProbeLeft, and ProbeDisk operations in the probe phase are all performed by using the probe feature of Prober.

image..png

class HashMatchProber final {
 public:
  void ProbeResult(TupleChunk *tpchunk, size_t &chunk_off, const size_t chunk_size);
  bool ProbeIter(Context &context, TupleChunk *tpchunk, size_t &chunk_off, const size_t chunk_size);
  bool Probe(Context &context, TupleChunk *tpchunk, size_t &chunk_off, const size_t chunk_size, const bool disk);

 private:
  const HashMatch &join_;
  HMInfo *info_ = nullptr;
  HMLocalInfo *local_info_ = nullptr;
  size_t part_index_ = 0;

  PostFilter filter_;

  LeftIterator lit_;
  RightIterator rit_;
  TraverseIterator tit_;  // used for probe left
};

HashMatchProber::PostFilter processes the probes of the JOIN operations with the PostFilter feature. The result set fetched after the probe phase must be verified by PostFilter.

  struct PostFilter final {
    bool Evaluate();
    bool Probe(TupleChunk *tpchunk, size_t &chunk_off, const size_t chunk_size);

    const HashMatchProber &prober_;
    const RTExprTreePtr &post_expr_;
    const HashMatchExpr &left_expr_;
    const HashMatchExpr &right_expr_;
    std::shared_ptr<Expressions::ExprEnv> post_env_ = nullptr;
  };

Prober supports the following iterators: LeftIterator, RightIterator, and TraverseIterator.

  struct Iterator {
    virtual void InitExpr() {}
    virtual void FiniExpr() {}
    virtual void Init(const size_t part_index);
    virtual void Fini();
    virtual bool Valid(Context &context) { return false; }
    virtual void Next() = 0;

    HashMatchProber &prober_;
    PostFilter &filter_;
  };

The HashMatchProber::LeftIterator uses HashMap::ValueIterator to traverse all values of a specific key in a hash table and locate the specific chunk tuple based on the values. This way, all tuples of the specified key are provided to handle different types of JOIN operations and the PostFilter feature.

  // for Probe
  struct LeftIterator final : public Iterator, public ValueIterator::Listener {
    void BlockEvent() override;

    bool Valid(Context &context) override;
    void Next() override;
    bool Find(const size_t part_index, const uint64_t hash_val,
              const char *key_data, const uint64_t key_len);

    ValueIterator it_;
  };

HashMatchProber::RightIterator continuously uses HashMatchFetcher to fetch chunk sets from the right table or temporary files and provides the feature of traversing all chunks and all tuples. All JOIN operations use RightIterator in the ProbeMem or ProbeDisk stage to fetch chunks, and then search for hash tables in the partition. If a hash table exists, a LeftIterator object is established to traverse all tuples of the key. In addition, LEFT JOINs also need to be processed by using ProbeLeft.

  // for Probe
  struct RightIterator : public Iterator {
    bool Valid(Context &context) override;
    void Next() override;

    HashMatchFetcher fetcher_;
    TupleChunk origin_chunk_;
    size_t chunk_size_ = 0;
  };

HashMatchProber::TraverseIterator is mainly used for LEFT JOINs, such as Left OUTER JOIN, LEFT SEMI JOIN, and LEFT ANTI SEMI JOIN. It uses HashMap::TableIterator to traverse the entire hash table and filter out matched or unmatched keys, and then uses LeftIterator to traverse all tuples of the key. LEFT JOINs use ProbeMem or ProbeDisk to search hash tables and perform the matching. If the hash table matches, the key-value pairs in the hash table are marked. Then, ProbeLeft uses TraverseIterator to filter out the matched or unmatched key-value pairs.

  // for ProbeLeft
  struct TraverseIterator final : public Iterator {
    bool Valid(Context &context) override;
    void Next() override;

    TableIterator tit_;
    LeftIterator lit_;
    IteratorType it_type_ = IteratorType::END;
  };

Test

A test is run to compare the performance of the Hash Join operator and of the Hash Match operator. In the test, both operators are used to perform Q14 of TPC Benchmark-H (TPC-H). The data to query is 1 TB in size. The two operators use similar algorithms. The difference is whether the intermediate results are materialized.

select
	100.00 * sum(case
				when p_type like 'PROMO%'
				then l_extendedprice * (1 - l_discount)
				else 0
				end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
	lineitem,
	part
where
	l_partkey = p_partkey
	and l_shipdate >= date '1995-09-01'
	and l_shipdate < date '1995-09-01' + interval '1' month;

The query is performed with a LRU cache and executor memory of 100 GB.
Query(TPCH1T)
HashJoin
HashMatch
Q14
23.96 seconds
12.56 seconds
The query is performed with a Least Recently Used (LRU) cache and executor memory of 32 GB.
Query(TPCH1T)
HashJoin
HashMatch
Q14
More than 10 minutes
35.73 seconds

Query(TPCH1T)	HashJoin	HashMatch
Q14	23.96 seconds	12.56 seconds