×
Community Blog OceanBase Rewriting Series: Subquery Promotion

OceanBase Rewriting Series: Subquery Promotion

Part 2 of the OceanBase Rewriting Series focuses on the strategy of rewriting subqueries in existential determination predicates into SEMI/ANTI JOIN.

Columnist Introduction

The OceanBase Optimizer Team, led by Xifeng (OceanBase Senior Technical Expert) and Shanwen (Technical Expert), is dedicated to building the world's leading distributed query optimizer.

Preface

In the previous article, we overviewed the OceanBase query rewriting practice. We briefly introduced the rewriting methods, the types of rewriting algorithms, the main issues when implementing each rewriting algorithm, and how multiple rewriting rules are organized in a rewriting framework. This article is the second in the SQL Rewriting Series, which mainly explains the subquery promotion. Using subqueries allows users to write complex SQL statements with clear meaning in a concise and clear manner. This feature is user-friendly but not database-friendly. From a database perspective, processing subqueries is inefficient. Database systems often try to rewrite SQL to improve the processing of subqueries and eliminate subqueries.

Subqueries typically have two types of usage:

  1. It is used in some existential predicates, such as IN, EXISTS, ANY/ALL comparison, etc. The result of a subquery in such scenarios is a set.
  2. It is used to calculate a specific value. For example, it appears in SELECT to calculate a projection item. In such scenarios, the result of a subquery is a specific value.

OceanBase uses different methods to rewrite these two subqueries.

This article will first introduce the rewrite strategy for the first type of subquery, which is the subquery that appears in the existential predicate.

Existential determination predicates are common in SQL statements. It is mainly used to determine whether there are records that meet the conditions in a set or not. Let’s take the movie table MOVIES and the schedule table PLAY as examples. The following SQL Q1 uses a typical existential determination predicate EXISTS (SELECT ...).

-- Movie table
MOVIES(movie_id, movie_name, rate);
-- Schedule table
PLAY(movie_id, time, price)

-- Q1: Find out the movies that are scheduled in 19: 00. 
SELECT movie_name
FROM   MOVIES
WHERE  EXISTS (SELECT timeFROM   PLAY
               WHERE  PLAY.movie_id = MOVIES.movie_id
                      AND PLAY.time = '19:00'); 

If the calculation form of the subquery is maintained, you need to scan the schedule table to check whether there is a '19:00' film schedule for each movie in the above query. This process is similar to the NEST-LOOP JOIN process. The number of times a subquery is executed is equal to the number of rows read from the MOVIES table. OceanBase rewrites the preceding subquery into a join to enrich the space selected by the plan.

Subquery Promotion

OceanBase rewrites the preceding Q1 query to a format similar to Q2. This change is equivalent. The semantics of semi-join in Q2 is – for each row in the MOVIES, if it can be successfully joined with any row in the PLAY according to the PLAY.movie_id = MOVIES.movie_id AND PLAY.time = '19:00', the row will be output. Otherwise, it will not be output. Intuitively, the semantics of Q1 and Q2 are the same.

-- Q2: Promote to SEMI JOIN.
SELECT movie_name
FROM   MOVIES 
       SEMI JOIN PLAY
                 ON PLAY.movie_id = MOVIES.movie_id
                    AND PLAY.time = '19:00'; 

Rewriting a subquery into a SEMI JOIN has the following benefits:

  • MOVIES and PLAY can use richer join algorithms. For example, use merge-join or hash-join. In the original subquery, the PLAY table is scanned multiple times. If there is no suitable index, the scanning cost may be high. After the rewrite, when you use merge-join or hash-join, only the PLAY needs to be scanned once.
  • SEMI JOIN can also be converted into INNER JOIN under certain circumstances. The query above can be transformed into the form of Q3. In Q3, the query optimizer can select V as the driver table and MOVIES as the NEST LOOP JOIN of the driven table. When the number of rows after PLAY.time = '19:00' filtering is small, and the number of MOVIES rows is large, this execution plan works well.
-- Q3: The SEMI JOIN is rewritten to the INNER JOIN.
SELECT movie_name
FROM MOVIES, (SELECT DISTINCT move_id FROM PLAY WHERE PLAY.time ='19:00') V
WHERE V.movie_id = MOVIES.movie_id;

The EXISTS/ANY comparison predicate can be rewritten to SEMI JOIN. This type of predicate mainly determines whether a record exists that meets the filter conditions. Similarly, the NOT EXISTS/ALL comparison predicates are widely used in businesses. These predicates are used to determine non-existence. OceanBase attempts to rewrite such predicates to ANTI JOIN. For example, Q4 uses a NOT IN predicate (equal to != ALL), which can be rewritten to Q5.

-- Q4: Find the movies that are not scheduled at 19:00.
SELECT movie_name
FROM MOVIES M
WHERE movie_id NOT IN
  (SELECT movie_id
   FROM PLAY P
   WHERE P.time = '19:00');

-- Q5: The NOT IN predicate is rewritten to ANTI JOIN.
SELECT movie_name
FROM MOVIES M
ANTI JOIN PLAY P ON (M.movie_id = P.movie_id
                     OR M.movie_id IS NULLOR P.movie_id IS NULL)
                    AND P.time = '19:00'

The semantics of ANTI JOIN here is – for a row in table M, if it can be joined with any row in table P, the row is filtered. Otherwise, the row is output.

It is worth noting that Q5 has a special ANTI JOIN join predicate: M.movie_id = P.movie_id OR M.movie_id IS NULL OR P.movie_id IS NULL. This is mainly to handle the case where the movie_id value is NULL. Let’s consider the following scenario: given the m1 of a row in the M table, where the m1.movie_id value is NULL, and the result set of the subquery is not empty. In Q4, the determination result of NOT IN is unknown, and the m1 is filtered out. In Q5, the OR join predicate is determined to be true. The ANTI JOIN can be successful, and the m1 is also filtered out. Most of the time, we may mistakenly believe that Q4 can be directly rewritten into the form of Q7 below. However, the semantics of the two are different. In Q7, due to m1.movie_id is NULL, the join condition of ANTI JOIN is unknown. If the connection is not successful, the m1 will not be filtered out. Therefore, Q4 and Q7 are not equivalent. Similarly, you can analyze the query behavior when the P.movie_id value is NULL. This semantic feature is worth mentioning. The particularity of NOT IN predicate when dealing with NULL values is easily ignored. We will analyze this further below.

During the Semantic and Performance Differences between NOT EXISTS and NOT IN in the process of business support, we noticed that many businesses did not realize the semantic difference between NOT EXISTS and NOT IN when processing NULL values. In some scenarios, NOT EXISTS can be used, but NOT IN is used, resulting in slow query performance or high resource overhead. Q6 uses the NOT EXISTS predicate, which is close to Q4. A big difference between the two is that the result set of Q6 will retain the result with the value of M.movie_id as NULL. Q4 will not.

Comparing Q5 with Q7, which are the rewrite results of Q4 and Q6 (respectively), we can further analyze the performance difference between the two predicates. In Q5, ANTI JOIN has no equivalent join predicate, while Q7 has an equivalent join predicate. When there is no equivalent join predicate, the query optimizer can only select nest loop join to join each row in M and P in pairs. If an equivalent join predicate exists, the query optimizer can select merge join/hash join or push the equivalent join predicate down to a filter predicate on the right table by nest loop join. In comparison, when there is an equivalent join predicate, the query can choose a richer join algorithm for better performance.

-Q6: Find movies that are not scheduled at 19:00.
SELECT movie_name
FROM MOVIES M
WHERE NOT EXISTS
  (SELECT movie_id
   FROM PLAY P
   WHERE P.time = '19:00' AND M.movie_id = P.movie_id);
   
-- Q7: The NOT EXISTS predicate rewrites the ANTI JOIN.
SELECT movie_name
FROM MOIVES M 
ANTI JOIN PLAY P ON P.time = '19:00' AND M.movie_id = P.movie_id;

Finally, we can discuss the optimization of Q5 additionally. OceanBase has designed other strategies for Q5 in terms of optimization and execution to improve query performance. Two mechanisms are briefly introduced here:

  1. If there is a non-empty constraint on movie_id in the MOVIES and PLAY tables, when rewriting, we will remove the movie_id is null determination and directly obtain the rewriting result of Q7.
  2. When the final rewrite result of Q4 is Q5, OceanBase implements the null-aware hash join algorithm to avoid generating pure NEST LOOP JOIN. The optimizer can choose null-aware hash join to join M and P to avoid repeated scans of the P table.

Summary

This article mainly introduces the strategy of rewriting subqueries in existential determination predicates into SEMI/ANTI JOIN.

This type of subquery mainly appears in the existential determination predicate (EXISTS/NOT EXISTS, ANY/ALL comparison). Its typical feature is that the result of the subquery is a set. Rewriting such subqueries into join allows the query optimizer to select a richer join algorithm. It can work with other rewriting strategies to generate a richer join order. On the whole, this is a rule-based rewriting strategy. In addition to such subqueries, many subqueries are used to calculate a single value in SQL. The next article will explain how to rewrite this type of subquery.

0 1 0
Share on

OceanBase

16 posts | 0 followers

You may also like

Comments

OceanBase

16 posts | 0 followers

Related Products