Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统. All rights reserved to the original author.
In the previous articles, we mentioned that after SQL was introduced to MapReduce and Spark, Hive and Spark SQL were created, thus significantly improving the development performance.
Then, inspired by the performance optimization of Shuffle, we optimized the performance of SQL Join.
When talking about the performance optimization in Join, I didn't say it clearly, but you may feel a bad trend.
A declarative language like SQL reduces development costs by hiding implementation details. However, this prevents automatic performance optimization.
To a large extent, the performance of SQL is no longer under the control of developers.
This is acceptable. After all, the level of developers varies. Some of them can write a good MapReduce program and perform RDD operations well, but most of them probably need a lot of time to accumulate experience.
Therefore, it is a good choice to give the initiative to the SQL engine and then brainstorm to implement a common optimization experience in a unified way.
Detail is a burden, and getting rid of it has a price, but we can make up for the price with experience.
In SQL, experience is summarized as rules, and this type of performance optimization is done in the Query Optimizator, which is what we will talk about as Rule-Based Optimization (RBO).
The fact is that many of the new things in the computer industry, especially in the Internet industry, are just a combination of old technology and new scenarios. Then, developers give these new things cool names, the so-called old wine in a new bottle.
Although we have to solve many of the problems mentioned in our previous articles, SQL in the distributed field is essentially the implementation of SQL of traditional relational databases under the multi-machine expansion scenario.
Fortunately, this means many years of experience in traditional relational databases can be used in distributed SQL. RBO is also included.
The preceding figure shows some RBO paths supported by Oracle. Oracle parses and matches query statements based on the supported rules and then executes the query statement with the highest priority. The smaller the number, the higher the priority.
Take Path 1 and Path 15 as an example. When we query a piece of data through Rowid, since the primary key has an index, it is clear that using the index will be much faster than the full table scan, so Path 1 will be automatically selected instead of Path 15.
As long as these tried-and-tested rules are implemented again, distributed SQL will have a basic RBO.
Let's take a look at Hive and Spark SQL we focused on earlier.
Since RBO is fixed, Hive, as one of the open-source projects of Apache, does not choose to reinvent the wheel. Instead, it uses the famous Apache Calcite as an optimizer (including CBO, which will be discussed in the next article). I have written several articles about Calcite previously. Links will be posted at the end of the article, so I will not discuss it here.
The following figure shows some rules currently supported by Calcite. You can check the implementation if interested. You can find it in Calcite's HepPlanner and
As for Spark, as I said in the previous article, it is a very ambitious project. Instead of using the ready-made Calcite, it builds a wheel called Catalyst.
org.apache.spark.sql.catalyst.optimizer.Optimizer Spark SQL class, we can see the following rules defined:
With these rules, Hive and Spark SQL can implement a lot of common optimizations. With the most basic performance guarantee available, the performance of SQL written by the worst novice writer can be guaranteed.
From the compilation principle, we know that after a parser processes a program, it will generate an abstract syntax tree (AST). SQL is also a language that eventually has to be compiled into something the underlying execution engine can handle.
Therefore, the general process of RBO can be understood as the equivalent conversion of AST by applying different rules.
I discussed Join in the previous article. Let's look at an example of Join and match it with the most conventional rule, Predicate Pushdown.
First, there are two tables:
create table user(id int, name string); create table order(id int, uid int, item_id int);
The query statement is listed below:
select u.id as user_id, u.name as user_name, o.id as order_id from user u join order o on u.id = o.uid where u.id > 100;
The initial parsed execution plan is to scan the two tables, Join to match data, and filter out the data with u.id > 100.
Predicate pushdown, as the name implies, is to push down judgment conditions to the storage engine to filter out unnecessary data in advance. This process can significantly improve performance.
The following diagram shows this process:
Let's use the EXPLAIN statement to see the actual execution plan.
In Hive, we can see that the Filter operation is performed on both tables before they are joined.
The process can be seen more clearly in the following figure of Spark SQL:
Filter is performed after Join in the Analyzed Logical Plan, but the two tables are filtered first and then joined in the Optimized Logical Plan.
There are many simple and practical rules like Predicate Pushdown. For example, Projection Pushdown removes irrelevant columns when scanning tables, and Filter Combine combines multiple Filters and then filters them. Due to space limitations, no more examples will be given.
This way, through the application of a series of rules, equivalent transformations are continuously made to the AST until the optimal logical plan is found. Then, the optimal logical plan can be converted into a physical execution plan and handed over to the execution engine to execute.
RBO can play a very good optimization effect because it brings together a lot of experience and integrates it into rules that can be executed. However, these rules lead to limitations.
Since the rules are rigid:
For example, there is a very simple rule. If a column is indexed, when it is used as a query condition, the index should be used instead of a full table scan.
However, suppose a company has a personnel table with an indexed field of "gender." If we need to check the regional distribution of male employees and find that over 90% of the employees are male, using the index to find all male employees and then checking their regional distribution would result in much poorer performance compared to a direct full table scan.
RBO does not solve all performance optimization problems, and it may even be more of a hindrance than a help.
The root cause is that any rules, not just those in SQL optimization, as the accumulation of experience, cannot solve all problems because experience is the generalization of common things. There must be insufficient samples, information loss, and overfitting in this process. It is the limitation of experience, which is reflected in the appearance of inflexibility or occasional malfunction.
This article offers a rough understanding of what cost-based optimization is all about. Let's briefly summarize the main points:
Therefore, we need another idea to continue SQL performance optimization, one not as indirect as RBO and one that is more flexible.
In the next article, let's learn about cost-based optimization (CBO).
This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!
Alibaba Cloud_Academy - August 25, 2023
Alibaba Cloud_Academy - July 24, 2023
Alibaba Cloud_Academy - August 25, 2023
Alibaba Cloud MaxCompute - September 18, 2019
Alibaba EMR - April 21, 2022
ApsaraDB - July 26, 2023
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
Plan and optimize your storage budget with flexible storage servicesLearn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
More Posts by Alibaba Cloud_Academy