Learning about Distributed Systems - Part 20: Rule-Based Optimization (RBO)

Detail Is a Burden, but Getting Rid of It Has a Price

In the previous articles, we mentioned that after SQL was introduced to MapReduce and Spark, Hive and Spark SQL were created, thus significantly improving the development performance.

Then, inspired by the performance optimization of Shuffle, we optimized the performance of SQL Join.

When talking about the performance optimization in Join, I didn't say it clearly, but you may feel a bad trend.

A declarative language like SQL reduces development costs by hiding implementation details. However, this prevents automatic performance optimization.

To a large extent, the performance of SQL is no longer under the control of developers.

This is acceptable. After all, the level of developers varies. Some of them can write a good MapReduce program and perform RDD operations well, but most of them probably need a lot of time to accumulate experience.

Therefore, it is a good choice to give the initiative to the SQL engine and then brainstorm to implement a common optimization experience in a unified way.

Detail is a burden, and getting rid of it has a price, but we can make up for the price with experience.

In SQL, experience is summarized as rules, and this type of performance optimization is done in the Query Optimizator, which is what we will talk about as Rule-Based Optimization (RBO).

Learn from Old Technology

The fact is that many of the new things in the computer industry, especially in the Internet industry, are just a combination of old technology and new scenarios. Then, developers give these new things cool names, the so-called old wine in a new bottle.

Although we have to solve many of the problems mentioned in our previous articles, SQL in the distributed field is essentially the implementation of SQL of traditional relational databases under the multi-machine expansion scenario.

Fortunately, this means many years of experience in traditional relational databases can be used in distributed SQL. RBO is also included.

The preceding figure shows some RBO paths supported by Oracle. Oracle parses and matches query statements based on the supported rules and then executes the query statement with the highest priority. The smaller the number, the higher the priority.

Take Path 1 and Path 15 as an example. When we query a piece of data through Rowid, since the primary key has an index, it is clear that using the index will be much faster than the full table scan, so Path 1 will be automatically selected instead of Path 15.

As long as these tried-and-tested rules are implemented again, distributed SQL will have a basic RBO.

Let's take a look at Hive and Spark SQL we focused on earlier.

Since RBO is fixed, Hive, as one of the open-source projects of Apache, does not choose to reinvent the wheel. Instead, it uses the famous Apache Calcite as an optimizer (including CBO, which will be discussed in the next article). I have written several articles about Calcite previously. Links will be posted at the end of the article, so I will not discuss it here.

The following figure shows some rules currently supported by Calcite. You can check the implementation if interested. You can find it in Calcite's HepPlanner and org.apache.calcite.plan.hep packages.

As for Spark, as I said in the previous article, it is a very ambitious project. Instead of using the ready-made Calcite, it builds a wheel called Catalyst.

In the org.apache.spark.sql.catalyst.optimizer.Optimizer Spark SQL class, we can see the following rules defined:

With these rules, Hive and Spark SQL can implement a lot of common optimizations. With the most basic performance guarantee available, the performance of SQL written by the worst novice writer can be guaranteed.

How Do the Rules Come into Effect?

From the compilation principle, we know that after a parser processes a program, it will generate an abstract syntax tree (AST). SQL is also a language that eventually has to be compiled into something the underlying execution engine can handle.

Therefore, the general process of RBO can be understood as the equivalent conversion of AST by applying different rules.

I discussed Join in the previous article. Let's look at an example of Join and match it with the most conventional rule, Predicate Pushdown.

First, there are two tables:

create table user(id int, name string);
create table order(id int, uid int, item_id int);

The query statement is listed below:

select
  u.id as user_id, u.name as user_name, o.id as order_id
from
  user u
join
  order o
on
  u.id = o.uid
where
  u.id > 100;

The initial parsed execution plan is to scan the two tables, Join to match data, and filter out the data with u.id > 100.

Predicate pushdown, as the name implies, is to push down judgment conditions to the storage engine to filter out unnecessary data in advance. This process can significantly improve performance.

The following diagram shows this process:

Let's use the EXPLAIN statement to see the actual execution plan.

In Hive, we can see that the Filter operation is performed on both tables before they are joined.

The process can be seen more clearly in the following figure of Spark SQL:

Filter is performed after Join in the Analyzed Logical Plan, but the two tables are filtered first and then joined in the Optimized Logical Plan.

There are many simple and practical rules like Predicate Pushdown. For example, Projection Pushdown removes irrelevant columns when scanning tables, and Filter Combine combines multiple Filters and then filters them. Due to space limitations, no more examples will be given.

This way, through the application of a series of rules, equivalent transformations are continuously made to the AST until the optimal logical plan is found. Then, the optimal logical plan can be converted into a physical execution plan and handed over to the execution engine to execute.

Limitations of RBO

RBO can play a very good optimization effect because it brings together a lot of experience and integrates it into rules that can be executed. However, these rules lead to limitations.

Since the rules are rigid:

Rules cannot be exhaustive. Although they can be gradually improved, there are always scenarios that cannot be covered.
Rules may be contradictory. Independent rules will not consider the relationship with other rules. Once they are contradictory, it is difficult to choose.
It is difficult to determine the priority of rules. Different priorities may lead to different optimization effects.
Rules can contribute to a huge gap in the effect of different data.

For example, there is a very simple rule. If a column is indexed, when it is used as a query condition, the index should be used instead of a full table scan.

However, suppose a company has a personnel table with an indexed field of "gender." If we need to check the regional distribution of male employees and find that over 90% of the employees are male, using the index to find all male employees and then checking their regional distribution would result in much poorer performance compared to a direct full table scan.

RBO does not solve all performance optimization problems, and it may even be more of a hindrance than a help.

The root cause is that any rules, not just those in SQL optimization, as the accumulation of experience, cannot solve all problems because experience is the generalization of common things. There must be insufficient samples, information loss, and overfitting in this process. It is the limitation of experience, which is reflected in the appearance of inflexibility or occasional malfunction.

TL;DR

This article offers a rough understanding of what cost-based optimization is all about. Let's briefly summarize the main points:

SQL improves development efficiency by hiding implementation details but loses the possibility of automatic performance optimization.
Fortunately, there are some common performance optimization methods that can be implemented uniformly: rule-based optimization (RBO).
As traditional relational databases (such as Oracle) have accumulated a wealth of RBO experience, distributed SQL only needs to do the same.
RBO is essentially an equivalent transformation of the AST by applying rules until the optimal plan is found.
RBO is experience-based, and any experience is flawed. For example, different data content will produce disparate results.

Therefore, we need another idea to continue SQL performance optimization, one not as indirect as RBO and one that is more flexible.

In the next article, let's learn about cost-based optimization (CBO).

This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!

Community

Learning about Distributed Systems - Part 20: Rule-Based Optimization (RBO)

Detail Is a Burden, but Getting Rid of It Has a Price

Learn from Old Technology

How Do the Rules Come into Effect?

Limitations of RBO

TL;DR

Read previous post:

Read next post:

Alibaba Cloud_Academy

You may also like

Comments

Alibaba Cloud_Academy

Related Products

Hybrid Cloud Distributed Storage

Storage Capacity Unit

Quick Starts

Hybrid Cloud Storage