The Past and Present Lives of PolarDB Parallel Query
Background
1. PolarDB
The rise of the cloud has brought new development opportunities to the old and stubborn database market. According to Gartner's prediction, by 2022, 75% of all databases will be deployed or migrated to the cloud platform. Cloud service providers provide an excellent opportunity to overtake on a corner. If you look at Babelfish released by AWS at Re:invent 2020, you can smell how ambitious it is in the database market.
The paper [1] on Aurora published by AWS in 2017 has led the development trend of cloud-native relational databases. As the earliest manufacturer of cloud computing in China, Alibaba Cloud also launched its own cloud-native relational database in 2018. The database PolarDB is consistent with the concept of Aurora. PolarDB deeply integrates the infrastructure on the cloud. The goal is to provide customers with cloud-specific scalability, elasticity, and high availability while having lower response latency and higher concurrency. Throughput, its basic architecture is as follows:
The underlying distributed shared storage breaks through the limitation of single-machine storage capacity, and can automatically expand elastically as the user's data volume grows. The computing layer is a typical topology of one write multiple reads, using the high-speed remote access capabilities provided by RDMA to offset computing storage Additional network overhead caused by separation.
2 challenges
As can be seen from the above figure, the storage layer will allow data capacity much larger than that of a single machine (currently 128T), and even some users will appear online, and the capacity of a single table will reach the level of xx T, which is based on the traditional deployment of MySQL master-slave replication is unimaginable. At the same time, a large number of users will demand real-time analysis of business data, such as statistics, reports, etc., but everyone's intuitive impression of MySQL is: small transaction processing is fast, concurrency is strong, and analysis ability is weak. For these real-time analysis queries, how should How to deal with it?
3 options
The first thing to say is that with the development of the Internet and the explosion of data volume, certain data analysis capabilities and heterogeneous data processing capabilities have become standard configurations for transactional databases. The MySQL community also supports its own query processing in version 8.0 Capabilities have been enhanced, including transformation, hash join, and window function support for subqueries. At the same time, the PolarDB MySQL optimizer team has also done a lot of work to improve the processing capabilities of complex queries, such as enhanced statistical information and more subqueries. Such transformation, query cache, etc.
Parallel query (Parallel Query) is a query acceleration function equipped with PolarDB MySQL at the beginning of its launch. In essence, it solves a core problem: MySQL query execution is single-threaded, which cannot make full use of modern multi-core large memory hardware resources. . Through multi-threaded parallel execution to reduce the processing time including IO and CPU calculation, the response time is greatly reduced. After all, for users, if a query can be completed with 10 cores in 1 minute, it is more meaningful than 1 core in 10 minutes. In addition, all mature commercial databases also have the ability to query in parallel.
Two parallel query introduction
1 Features
Parallel query can be said to be the most important and complex functional component of PolarDB MySQL at the computing layer. With the launch of PolarDB, it has been running stably online for many years and has been continuously evolving. It has the following features:
Based entirely on the MySQL codebase, the native MySQL is 100% compatible, including
syntax compatible
type compatible
behavior compatible
0 additional cost, features shipped with product release
No additional storage resources required
No additional compute nodes required
0 maintenance cost, there is no difference between the use and ordinary query, but the response is faster
Deployed with the cluster, out of the box
No intrusion to business
Single configuration parameter (degree of parallelism)
Real-time analysis, a native part of PolarDB, benefits from the low latency of REDO physical replication
Unify underlying transactional data
submit to see
Extreme performance, with the continuous improvement of PQ, the ability to support analytical operators and complex query structures has been continuously improved
full operator parallelism
Efficient assembly line
Complex SQL structure support
Stable and reliable, as an enterprise-level feature, this is beyond doubt
Extend MySQL test system
Years of online accumulation
Complete diagnostic system
The above sounds like advertising slogans, but it is indeed the core competitiveness of parallel query.
2 evolution
The function of parallel query is continuously accumulated. From the initial PQ1.0 to PQ2.0, it has entered the research and development stage of cross-node parallelism and will be released online soon. Here we will not introduce the cross-node parallelism capability, but only focus on the existing online situation.
PQ1.0
The basic idea of the first released parallel query capability is to push down calculations, distributing as many calculations as possible to multiple workers to complete in parallel, so that heavy operations such as IO can be performed at the same time, but it is different from the general share- The nothing distributed database is different. Due to the underlying shared storage, data fragmentation in PolarDB parallelism is logical rather than physical. Each worker can see the full amount of table data. The executor part will be introduced later on logical fragmentation.
The typical form of the plan for parallel splitting is as follows:
It can be seen that there are several characteristics:
The execution mode is simple scatter-gather, that is, there is only one plan slice, and multiple workers complete the same function and aggregate it to the leader
Push down operators to workers as much as possible
The leader is responsible for completing calculations that cannot be pushed down
This solution can solve many online slow query problems and get a good acceleration effect, but it also has certain limitations.
The plan form is single, resulting in a single parallel mode of operators, such as group by + aggregation, which can only be completed through two-stage aggregation: the worker performs partial aggregation first, and the leader performs final aggregation
Once the aggregation operation is completed on the leader, if there are distinct / window function / order by, etc., it can only be completed on the leader, forming a single-point bottleneck
If there is data skew, some workers will have no work to do, resulting in poor parallel scalability
In addition, there are still some areas to be improved in the implementation, for example, a small number of operators do not support parallelism, and some complex query nesting structures do not support parallelism
Generally speaking, the parallel form of PQ1.0 is similar to the solution of the PostgreSQL community, and there is still room for improvement. After all, the parallel form of all commercial databases must be more flexible and complex.
PQ2.0
PQ2.0 makes up for the limitations mentioned above, aligns SQL Server from the execution mode, and realizes more powerful multi-stage parallelism.
Typical plans are as follows:
The first change you see is that there are multiple worker groups here. The execution plan of PQ2.0 is multi-stage, and the plan will be split into several pieces (plan slices). Each slice is completed in parallel by a group of workers. The intermediate results are passed between the slices through the exchange data channel, and the pipeline execution of the subsequent slices is triggered. Some of these enhancements include:
The new Cost-based parallel optimizer determines the optimal plan form based on statistical information and cost
Parallel support for all operators, including the complex multi-layer nested structure mentioned above, can also be completely parallel
Introduce the exchange operator, that is, support data distribution operations such as shuffle/broadcast
Introduce certain self-adaptive capabilities, even if parallel optimization is completed, dynamic adjustments can be made according to resource load conditions, such as rolling back serial or reducing parallelism
What do these changes mean? Let's look at a simple and practical example:
For the above simple query, after optimization, PQ1.0 will generate the execution plan in the figure.
In the join table set, look for a table that can be used for logical sharding. If the three tables are not enough to split enough shards, then choose the one with the most. For example, t2 is selected here, it may Twelve shards were split, but the parallelism requirement of 16 was still not met, resulting in 4 workers idling because they could not read the data.
The aggregation operation first performs partial aggregation on the worker, and aggregates on the leader. If the aggregation of groups on each worker is not good, the leader will still receive a large number of groups from below, and the leader will still have heavy aggregation calculations. If the leader calculates slowly, it will be too late to receive worker data, thereby counter-pressing the worker's execution speed and causing the overall query to slow down.
The implementation plan of PQ2.0 is as follows
Although data sharding can only be done on t2, 12 workers only need to complete the operation of t1 join t2. After the join is completed, the amount of data will generally expand, and more intermediate results will be distributed to subsequent ones through Shuffle (Repartition). slice, so as to complete the join with t3 with a higher degree of parallelism
After each worker completes the local aggregation, if there are still many groups, you can do a Shuffle based on the group by key to disperse the data to the next slice. The next group of workers will complete the heavy aggregation operation in parallel, and the subsequent order by partial Sorting, the final leader only needs to do a summary of merge sort
This solves the scalability problems caused by single-point bottlenecks and insufficient data volume, and achieves linear acceleration.
Why is linear scaling so important?
As can be seen from the above figure, as the parallelism increases, the E2E response time decreases linearly, which has two important effects on customers:
As the business grows and the data continues to expand, use matching computing resources by increasing the degree of parallelism accordingly to continuously obtain stable and predictable query performance
Consistently fast time-to-analysis drives fast business decisions, keeping businesses competitive in a rapidly changing market environment
The perfect linear acceleration is, Parallel RT = Serial RT / CPU cores, of course, this is not realistic
3 Architecture
The overall architecture of the parallel query component is as follows
The core part is included in 3 layers, from top to bottom are:
Cost-based Parallel Optimizer, embedded in the MySQL optimizer framework, completes the parallel optimization part
Parallel Plan Generator, based on the abstract parallel plan description, generates a physical execution plan that can be executed by workers
Parallel Executor, a parallel executor component, including some parallel functions within operators and data distribution functions, etc.
The implementation of each component will be described in detail later.
4 performance
Since it is a personal article, the specific execution time is hidden here (you can search it online), mainly look at the query acceleration capability of PQ2.0, here the parallelism is 32 (some students may wonder why the speedup ratio of Q6/Q12 exceeds 32 , will be mentioned in detail later)
The total number is: 100% of SQL can be accelerated, and the total speedup ratio is 18.8 times.
5 usage
From the perspective of ease of use, users only need to set one parameter to enable parallel query:
If you want to view the parallel execution plan, you only need to execute EXPLAIN / EXPLAIN FORMAT=TREE like a normal query.
Explain has made necessary enhancements to display parallel-related information, including cost, parallel mode, distribution method, etc.
Three parallel query implementation
The above is some general content, without any technical details, and the following chapters will dive into each module in turn.
1 Parallel optimizer
In PQ2.0, since the planning forms will become more diverse, it is difficult to obtain the optimal solution if the split plan only relies on simple rules and simple statistics. Therefore, we have re-implemented a set of parallel optimizers based entirely on cost.
The basic process is to further perform parallel splitting after MySQL serial optimization. Here, some students may wonder why it is not integrated like Greenplum, that is, the serial/parallel execution strategy is uniformly considered in the optimization process. The reason is that in the optimization process of MySQL, there is no clear boundary between sub-steps, and the deeply recursive join ordering algorithm and the semi-join optimization strategy selection embedded in it all make the code logic and structure more complicated, making it difficult to Integrated optimization can be achieved without a large amount of intrusion into the native code. Once the community code is seriously damaged, it is impossible to follow the community's subsequent version iterations and enjoy the community dividend.
Therefore, a two-step optimization process is adopted, which is also a common method in the industry. For example, Spark, CockroachDB, SQL Server PDW, Oceanbase, etc. have adopted similar solutions.
Enhancements to the cost model
Since it is cost-based optimization, it is necessary to be able to obtain the cost information of parallel execution of each operator in the process. For this reason, PolarDB has also done a lot of statistical information enhancement work:
Stats are automatically updated
In the serial optimization process, make reinforcements for parallel execution, such as modifying the table scanning method, etc., which is why Q6/Q12 in the above performance data will have a super-linear speedup ratio
Full operator statistical information derivation + cost calculation, supplemented with a series of cost formula and cardinality estimation derivation mechanism
Here we can only show the effect of statistical information enhancement. The benefit is not only parallel query, but also serial execution.
Adaptive Execution Policy
In earlier versions, there was a certain coupling between serial optimization and parallel optimization, and between parallel optimization and parallel plan generation. Too many, it will occupy a lot of worker threads at the same time and cause the CPU to explode. The new parallel optimizer solves this problem.
Serial optimization is decoupled from parallel optimization. Parallel optimization will rebuild the abstract operator tree and start enumeration with this as input
Parallel optimization is decoupled from parallel plan generation, and the result of optimization is an abstract description of plan sub-fragments, which are output as plan generation
In this way, the flexibility of the execution strategy is possible, and it is allowed to either return to the serial, or reduce the degree of parallelism, or enter the scheduling queue to queue up resources in the case of insufficient resources.
Cost-based exhaustive enumeration
This is a relatively large topic. Generally speaking, parallel optimization is a bottom-up, exhaustive enumeration process based on dynamic programming. The implementation idea refers to the SQL Server PDW paper[2]. Sub, enumerate possible parallel execution methods and data distribution methods, and construct physical equivalence classes based on the phsical property (distribution + order) of the output data, so as to do local pruning, obtain the optimal solution of local sub-problems and pass them to the upper layer , and finally go to the root operator to obtain the global optimal solution.
The following figure is a brief example of the enumeration process for the operator t1 NLJ t2:
After the overall enumeration is completed, a series of physical operator trees with data distribution Exchange Enforcers will be generated in the plan space, and the optimal tree can be selected based on the cost. Then, a sub-plan can be constructed with the Enforcer as the split point of the sub-plan. The abstract description of the execution plan of the series is output to the plan generator.
2 Parallel plan generation
From the perspective of engineering implementation, parallel plan generation can be said to be the part with the highest complexity and the most pitfalls in the entire component. The mechanism of physical plan clone is adopted here, that is, according to the parallel plan description generated by the optimizer, the physical execution plan of each plan fragment is cloned from the original serial plan.
Why use this method? It is still related to the mechanism of MySQL itself. MySQL optimization and execution are coupled together, and there is no clear boundary, that is, related execution structures are built during the optimization process. Therefore, there is no way to directly construct each physical execution structure based on an independent plan description, and it can only be "clone" from the serial plan, which can be said to be the root of all complexity.
The execution structure of MySQL is very complex. The cross-reference of expression (Item) and query block (SELECT_LEX), the association of inner and outer queries (Item_ref), etc., all make this task more difficult. During the process, the team also gained a deep understanding of the optimized execution structure of MySQL, and found many bugs in the community...
The simple query in the above figure is an example
Although the community has refactored the executor based on the Iterator model, in essence, the physical execution plan is still a sequence composed of QEP_TAB, in which group by+aggr is completed by a tmp table1, and order by is completed by a tmp table2.
When doing plan generation, there are two core operations:
According to the description of the serial physical plan and sub-slices, the corresponding structure is cloned into each worker thread, as shown in the lower right part of the figure above, the t1 join t2 and push-down aggregation operations performed on the worker are cloned.
refix
The original serial plan needs to be converted into a leader plan, so it is necessary to replace unnecessary execution structures and adjust some reference relationships, as shown in the upper right part of the figure above, since the t1 join t2 and some aggregation operations have been pushed down, the leader needs to remove unnecessary structure, and replace it with reading the data passed by the worker from a collector table. At the same time, it is necessary to convert the structure of the t1/t2 table referenced in the subsequent steps to the corresponding structure referencing the collector table.
Here is just the simplest example, without involving subquery and multi-stage plan, the actual project implementation cost is much higher.
3 parallel executors
PQ implements a series of intra-operator parallel mechanisms, such as logical partitioning and parallel scanning of tables, parallel hash join, etc., to make parallel execution possible or further improve performance, as well as a variety of subquery processing mechanisms, etc. Here Select some representative ones to introduce.
parallel scan
PolarDB is a shared storage, and all data is visible to all nodes. This is different from the sharding distributed system. Which part of the data is processed by different workers cannot be determined in advance, so the logical partitioning scheme is adopted:
At the btree level, the data is divided into many small fragments, and different workers are responsible for different fragments to trigger parallel execution. Here are some optimization points:
Try to do fine-grained segmentation, so that the number of shards>>the number of workers, and then the workers "grab" the shards to execute through round robin. This is a natural advantage of the shared storage system.
When splitting, it is not necessary to dive to the leaf node, that is, to use the page as the smallest partition unit to speed up the initial partition speed.
Hash join is a function introduced by community 8.0 to speed up analytical queries, and supports semi hash/anti hash/left hash join as the version evolves. PolarDB also introduces these patches to realize the complete hash join function, and Various parallel execution strategies are implemented.
Parallel hash join is supported in both phases of build/probe
In the build phase, multiple workers insert data into the same shared lock-free hash table.
In the probe phase, multiple workers search the hash table in parallel.
There is no overlap between the two stages, so that the parallelism of the whole stage is realized, but parallel hash join also has its own problems, for example, the shared hash table is too large to cause the spill to disk problem, although parallel insert is lock-free, there are still "synchronization" primitives Bring cache invalidation.
partition hash join
Partition hash join can avoid the above problems, but at the cost of introducing data shuffle overhead:
Both sides of the build/probe do shuffle according to the join key, and distribute the data to the target partition;
In each partition, the build side builds a small hash table;
In each partition, the probe side searches for the corresponding hash table;
In this way, the co-located join is completed in each partition, and each hash table is smaller to avoid disk placement. In addition, there is no concurrency problem in the build.
Which of the above two options is better? Determined by the parallel optimizer based on Cost.
Subquery parallelism - pushdown exec
Here the subquery is part of the expression and can exist in clauses such as select list / where / having.
For correlated subqueries, the only way to parallelize them is to push down the data (tables) that the outer layer depends on to the workers and execute them completely in each worker. However, since the outer layer is parallelized, the number of subquery executions in each worker is still ok decrease in equal proportion.
The EXISTS subquery is completely cloned into each worker, and is triggered repeatedly with the evaluation of the WHERE condition.
Subquery parallelism - pushdown shared
This parallel subquery can be part of an expression or a derived table.
Roughly speaking, this parallel method is suitable for non-correlated subqueries, so it can be materialized in parallel in advance to form a temporary result table. The subsequent outer layer is in parallel, and each worker can directly read from the table in parallel when referencing the subquery. Get the result data.
In addition, in the report query of online users, a very common query mode is the multi-layer nesting of derived table. For this type of SQL, the pushdown shared strategy can improve the performance of parallel execution, such as the following example:
The squares of each color in the above figure represent a layer of query block, which constitutes the nesting logic of multi-layer derived tables. Some layers are summarized by UNION ALL, and some layers are multiple tables (including derived table ) join, for such a query, MySQL will do the necessary materialization for each derived table, and form a temporary result table in the outer layer to participate in subsequent calculations, and PQ2.0 provides more general support for this common query mode, Now the execution of each layer of queries is done in parallel, striving to achieve a linear acceleration effect.
Exchanges
To generate efficient and flexible execution plans, data distribution components are essential. Currently, PolarDB supports three distribution methods: Shuffle/Broadcast/Gather, and uses lock-free shared ring buffer to achieve efficient data transmission in pipeline mode.
The following figure shows the basic form of Shuffle (Repartition)
So far, the function and implementation of the online version of parallel query have been roughly introduced.
As a mature enterprise-level functional feature, the team has also implemented a complete set of auxiliary tools to improve the usability of the product and realize the functions that can be monitored, intervened, and feedbacked, but the space here is already too large , I will not introduce it first.
Four future plans
The future plan here is not accurate because the team has done a lot of work on cross-node parallelism and entered the end of the development cycle. Cross-node parallelism will improve the complex query capabilities for massive data to another level:
* Open up computing resources between nodes to achieve higher computing parallelism
*Break through the bottleneck of a single node on IO/CPU, and make full use of the high throughput of distributed storage
*Combine global node management and resource view to balance and dispatch global computing resources to achieve load balancing while ensuring query performance
*Combined with the global consistency view to ensure the correct reading of transactional data
1. PolarDB
The rise of the cloud has brought new development opportunities to the old and stubborn database market. According to Gartner's prediction, by 2022, 75% of all databases will be deployed or migrated to the cloud platform. Cloud service providers provide an excellent opportunity to overtake on a corner. If you look at Babelfish released by AWS at Re:invent 2020, you can smell how ambitious it is in the database market.
The paper [1] on Aurora published by AWS in 2017 has led the development trend of cloud-native relational databases. As the earliest manufacturer of cloud computing in China, Alibaba Cloud also launched its own cloud-native relational database in 2018. The database PolarDB is consistent with the concept of Aurora. PolarDB deeply integrates the infrastructure on the cloud. The goal is to provide customers with cloud-specific scalability, elasticity, and high availability while having lower response latency and higher concurrency. Throughput, its basic architecture is as follows:
The underlying distributed shared storage breaks through the limitation of single-machine storage capacity, and can automatically expand elastically as the user's data volume grows. The computing layer is a typical topology of one write multiple reads, using the high-speed remote access capabilities provided by RDMA to offset computing storage Additional network overhead caused by separation.
2 challenges
As can be seen from the above figure, the storage layer will allow data capacity much larger than that of a single machine (currently 128T), and even some users will appear online, and the capacity of a single table will reach the level of xx T, which is based on the traditional deployment of MySQL master-slave replication is unimaginable. At the same time, a large number of users will demand real-time analysis of business data, such as statistics, reports, etc., but everyone's intuitive impression of MySQL is: small transaction processing is fast, concurrency is strong, and analysis ability is weak. For these real-time analysis queries, how should How to deal with it?
3 options
The first thing to say is that with the development of the Internet and the explosion of data volume, certain data analysis capabilities and heterogeneous data processing capabilities have become standard configurations for transactional databases. The MySQL community also supports its own query processing in version 8.0 Capabilities have been enhanced, including transformation, hash join, and window function support for subqueries. At the same time, the PolarDB MySQL optimizer team has also done a lot of work to improve the processing capabilities of complex queries, such as enhanced statistical information and more subqueries. Such transformation, query cache, etc.
Parallel query (Parallel Query) is a query acceleration function equipped with PolarDB MySQL at the beginning of its launch. In essence, it solves a core problem: MySQL query execution is single-threaded, which cannot make full use of modern multi-core large memory hardware resources. . Through multi-threaded parallel execution to reduce the processing time including IO and CPU calculation, the response time is greatly reduced. After all, for users, if a query can be completed with 10 cores in 1 minute, it is more meaningful than 1 core in 10 minutes. In addition, all mature commercial databases also have the ability to query in parallel.
Two parallel query introduction
1 Features
Parallel query can be said to be the most important and complex functional component of PolarDB MySQL at the computing layer. With the launch of PolarDB, it has been running stably online for many years and has been continuously evolving. It has the following features:
Based entirely on the MySQL codebase, the native MySQL is 100% compatible, including
syntax compatible
type compatible
behavior compatible
0 additional cost, features shipped with product release
No additional storage resources required
No additional compute nodes required
0 maintenance cost, there is no difference between the use and ordinary query, but the response is faster
Deployed with the cluster, out of the box
No intrusion to business
Single configuration parameter (degree of parallelism)
Real-time analysis, a native part of PolarDB, benefits from the low latency of REDO physical replication
Unify underlying transactional data
submit to see
Extreme performance, with the continuous improvement of PQ, the ability to support analytical operators and complex query structures has been continuously improved
full operator parallelism
Efficient assembly line
Complex SQL structure support
Stable and reliable, as an enterprise-level feature, this is beyond doubt
Extend MySQL test system
Years of online accumulation
Complete diagnostic system
The above sounds like advertising slogans, but it is indeed the core competitiveness of parallel query.
2 evolution
The function of parallel query is continuously accumulated. From the initial PQ1.0 to PQ2.0, it has entered the research and development stage of cross-node parallelism and will be released online soon. Here we will not introduce the cross-node parallelism capability, but only focus on the existing online situation.
PQ1.0
The basic idea of the first released parallel query capability is to push down calculations, distributing as many calculations as possible to multiple workers to complete in parallel, so that heavy operations such as IO can be performed at the same time, but it is different from the general share- The nothing distributed database is different. Due to the underlying shared storage, data fragmentation in PolarDB parallelism is logical rather than physical. Each worker can see the full amount of table data. The executor part will be introduced later on logical fragmentation.
The typical form of the plan for parallel splitting is as follows:
It can be seen that there are several characteristics:
The execution mode is simple scatter-gather, that is, there is only one plan slice, and multiple workers complete the same function and aggregate it to the leader
Push down operators to workers as much as possible
The leader is responsible for completing calculations that cannot be pushed down
This solution can solve many online slow query problems and get a good acceleration effect, but it also has certain limitations.
The plan form is single, resulting in a single parallel mode of operators, such as group by + aggregation, which can only be completed through two-stage aggregation: the worker performs partial aggregation first, and the leader performs final aggregation
Once the aggregation operation is completed on the leader, if there are distinct / window function / order by, etc., it can only be completed on the leader, forming a single-point bottleneck
If there is data skew, some workers will have no work to do, resulting in poor parallel scalability
In addition, there are still some areas to be improved in the implementation, for example, a small number of operators do not support parallelism, and some complex query nesting structures do not support parallelism
Generally speaking, the parallel form of PQ1.0 is similar to the solution of the PostgreSQL community, and there is still room for improvement. After all, the parallel form of all commercial databases must be more flexible and complex.
PQ2.0
PQ2.0 makes up for the limitations mentioned above, aligns SQL Server from the execution mode, and realizes more powerful multi-stage parallelism.
Typical plans are as follows:
The first change you see is that there are multiple worker groups here. The execution plan of PQ2.0 is multi-stage, and the plan will be split into several pieces (plan slices). Each slice is completed in parallel by a group of workers. The intermediate results are passed between the slices through the exchange data channel, and the pipeline execution of the subsequent slices is triggered. Some of these enhancements include:
The new Cost-based parallel optimizer determines the optimal plan form based on statistical information and cost
Parallel support for all operators, including the complex multi-layer nested structure mentioned above, can also be completely parallel
Introduce the exchange operator, that is, support data distribution operations such as shuffle/broadcast
Introduce certain self-adaptive capabilities, even if parallel optimization is completed, dynamic adjustments can be made according to resource load conditions, such as rolling back serial or reducing parallelism
What do these changes mean? Let's look at a simple and practical example:
For the above simple query, after optimization, PQ1.0 will generate the execution plan in the figure.
In the join table set, look for a table that can be used for logical sharding. If the three tables are not enough to split enough shards, then choose the one with the most. For example, t2 is selected here, it may Twelve shards were split, but the parallelism requirement of 16 was still not met, resulting in 4 workers idling because they could not read the data.
The aggregation operation first performs partial aggregation on the worker, and aggregates on the leader. If the aggregation of groups on each worker is not good, the leader will still receive a large number of groups from below, and the leader will still have heavy aggregation calculations. If the leader calculates slowly, it will be too late to receive worker data, thereby counter-pressing the worker's execution speed and causing the overall query to slow down.
The implementation plan of PQ2.0 is as follows
Although data sharding can only be done on t2, 12 workers only need to complete the operation of t1 join t2. After the join is completed, the amount of data will generally expand, and more intermediate results will be distributed to subsequent ones through Shuffle (Repartition). slice, so as to complete the join with t3 with a higher degree of parallelism
After each worker completes the local aggregation, if there are still many groups, you can do a Shuffle based on the group by key to disperse the data to the next slice. The next group of workers will complete the heavy aggregation operation in parallel, and the subsequent order by partial Sorting, the final leader only needs to do a summary of merge sort
This solves the scalability problems caused by single-point bottlenecks and insufficient data volume, and achieves linear acceleration.
Why is linear scaling so important?
As can be seen from the above figure, as the parallelism increases, the E2E response time decreases linearly, which has two important effects on customers:
As the business grows and the data continues to expand, use matching computing resources by increasing the degree of parallelism accordingly to continuously obtain stable and predictable query performance
Consistently fast time-to-analysis drives fast business decisions, keeping businesses competitive in a rapidly changing market environment
The perfect linear acceleration is, Parallel RT = Serial RT / CPU cores, of course, this is not realistic
3 Architecture
The overall architecture of the parallel query component is as follows
The core part is included in 3 layers, from top to bottom are:
Cost-based Parallel Optimizer, embedded in the MySQL optimizer framework, completes the parallel optimization part
Parallel Plan Generator, based on the abstract parallel plan description, generates a physical execution plan that can be executed by workers
Parallel Executor, a parallel executor component, including some parallel functions within operators and data distribution functions, etc.
The implementation of each component will be described in detail later.
4 performance
Since it is a personal article, the specific execution time is hidden here (you can search it online), mainly look at the query acceleration capability of PQ2.0, here the parallelism is 32 (some students may wonder why the speedup ratio of Q6/Q12 exceeds 32 , will be mentioned in detail later)
The total number is: 100% of SQL can be accelerated, and the total speedup ratio is 18.8 times.
5 usage
From the perspective of ease of use, users only need to set one parameter to enable parallel query:
If you want to view the parallel execution plan, you only need to execute EXPLAIN / EXPLAIN FORMAT=TREE like a normal query.
Explain has made necessary enhancements to display parallel-related information, including cost, parallel mode, distribution method, etc.
Three parallel query implementation
The above is some general content, without any technical details, and the following chapters will dive into each module in turn.
1 Parallel optimizer
In PQ2.0, since the planning forms will become more diverse, it is difficult to obtain the optimal solution if the split plan only relies on simple rules and simple statistics. Therefore, we have re-implemented a set of parallel optimizers based entirely on cost.
The basic process is to further perform parallel splitting after MySQL serial optimization. Here, some students may wonder why it is not integrated like Greenplum, that is, the serial/parallel execution strategy is uniformly considered in the optimization process. The reason is that in the optimization process of MySQL, there is no clear boundary between sub-steps, and the deeply recursive join ordering algorithm and the semi-join optimization strategy selection embedded in it all make the code logic and structure more complicated, making it difficult to Integrated optimization can be achieved without a large amount of intrusion into the native code. Once the community code is seriously damaged, it is impossible to follow the community's subsequent version iterations and enjoy the community dividend.
Therefore, a two-step optimization process is adopted, which is also a common method in the industry. For example, Spark, CockroachDB, SQL Server PDW, Oceanbase, etc. have adopted similar solutions.
Enhancements to the cost model
Since it is cost-based optimization, it is necessary to be able to obtain the cost information of parallel execution of each operator in the process. For this reason, PolarDB has also done a lot of statistical information enhancement work:
Stats are automatically updated
In the serial optimization process, make reinforcements for parallel execution, such as modifying the table scanning method, etc., which is why Q6/Q12 in the above performance data will have a super-linear speedup ratio
Full operator statistical information derivation + cost calculation, supplemented with a series of cost formula and cardinality estimation derivation mechanism
Here we can only show the effect of statistical information enhancement. The benefit is not only parallel query, but also serial execution.
Adaptive Execution Policy
In earlier versions, there was a certain coupling between serial optimization and parallel optimization, and between parallel optimization and parallel plan generation. Too many, it will occupy a lot of worker threads at the same time and cause the CPU to explode. The new parallel optimizer solves this problem.
Serial optimization is decoupled from parallel optimization. Parallel optimization will rebuild the abstract operator tree and start enumeration with this as input
Parallel optimization is decoupled from parallel plan generation, and the result of optimization is an abstract description of plan sub-fragments, which are output as plan generation
In this way, the flexibility of the execution strategy is possible, and it is allowed to either return to the serial, or reduce the degree of parallelism, or enter the scheduling queue to queue up resources in the case of insufficient resources.
Cost-based exhaustive enumeration
This is a relatively large topic. Generally speaking, parallel optimization is a bottom-up, exhaustive enumeration process based on dynamic programming. The implementation idea refers to the SQL Server PDW paper[2]. Sub, enumerate possible parallel execution methods and data distribution methods, and construct physical equivalence classes based on the phsical property (distribution + order) of the output data, so as to do local pruning, obtain the optimal solution of local sub-problems and pass them to the upper layer , and finally go to the root operator to obtain the global optimal solution.
The following figure is a brief example of the enumeration process for the operator t1 NLJ t2:
After the overall enumeration is completed, a series of physical operator trees with data distribution Exchange Enforcers will be generated in the plan space, and the optimal tree can be selected based on the cost. Then, a sub-plan can be constructed with the Enforcer as the split point of the sub-plan. The abstract description of the execution plan of the series is output to the plan generator.
2 Parallel plan generation
From the perspective of engineering implementation, parallel plan generation can be said to be the part with the highest complexity and the most pitfalls in the entire component. The mechanism of physical plan clone is adopted here, that is, according to the parallel plan description generated by the optimizer, the physical execution plan of each plan fragment is cloned from the original serial plan.
Why use this method? It is still related to the mechanism of MySQL itself. MySQL optimization and execution are coupled together, and there is no clear boundary, that is, related execution structures are built during the optimization process. Therefore, there is no way to directly construct each physical execution structure based on an independent plan description, and it can only be "clone" from the serial plan, which can be said to be the root of all complexity.
The execution structure of MySQL is very complex. The cross-reference of expression (Item) and query block (SELECT_LEX), the association of inner and outer queries (Item_ref), etc., all make this task more difficult. During the process, the team also gained a deep understanding of the optimized execution structure of MySQL, and found many bugs in the community...
The simple query in the above figure is an example
Although the community has refactored the executor based on the Iterator model, in essence, the physical execution plan is still a sequence composed of QEP_TAB, in which group by+aggr is completed by a tmp table1, and order by is completed by a tmp table2.
When doing plan generation, there are two core operations:
According to the description of the serial physical plan and sub-slices, the corresponding structure is cloned into each worker thread, as shown in the lower right part of the figure above, the t1 join t2 and push-down aggregation operations performed on the worker are cloned.
refix
The original serial plan needs to be converted into a leader plan, so it is necessary to replace unnecessary execution structures and adjust some reference relationships, as shown in the upper right part of the figure above, since the t1 join t2 and some aggregation operations have been pushed down, the leader needs to remove unnecessary structure, and replace it with reading the data passed by the worker from a collector table. At the same time, it is necessary to convert the structure of the t1/t2 table referenced in the subsequent steps to the corresponding structure referencing the collector table.
Here is just the simplest example, without involving subquery and multi-stage plan, the actual project implementation cost is much higher.
3 parallel executors
PQ implements a series of intra-operator parallel mechanisms, such as logical partitioning and parallel scanning of tables, parallel hash join, etc., to make parallel execution possible or further improve performance, as well as a variety of subquery processing mechanisms, etc. Here Select some representative ones to introduce.
parallel scan
PolarDB is a shared storage, and all data is visible to all nodes. This is different from the sharding distributed system. Which part of the data is processed by different workers cannot be determined in advance, so the logical partitioning scheme is adopted:
At the btree level, the data is divided into many small fragments, and different workers are responsible for different fragments to trigger parallel execution. Here are some optimization points:
Try to do fine-grained segmentation, so that the number of shards>>the number of workers, and then the workers "grab" the shards to execute through round robin. This is a natural advantage of the shared storage system.
When splitting, it is not necessary to dive to the leaf node, that is, to use the page as the smallest partition unit to speed up the initial partition speed.
Hash join is a function introduced by community 8.0 to speed up analytical queries, and supports semi hash/anti hash/left hash join as the version evolves. PolarDB also introduces these patches to realize the complete hash join function, and Various parallel execution strategies are implemented.
Parallel hash join is supported in both phases of build/probe
In the build phase, multiple workers insert data into the same shared lock-free hash table.
In the probe phase, multiple workers search the hash table in parallel.
There is no overlap between the two stages, so that the parallelism of the whole stage is realized, but parallel hash join also has its own problems, for example, the shared hash table is too large to cause the spill to disk problem, although parallel insert is lock-free, there are still "synchronization" primitives Bring cache invalidation.
partition hash join
Partition hash join can avoid the above problems, but at the cost of introducing data shuffle overhead:
Both sides of the build/probe do shuffle according to the join key, and distribute the data to the target partition;
In each partition, the build side builds a small hash table;
In each partition, the probe side searches for the corresponding hash table;
In this way, the co-located join is completed in each partition, and each hash table is smaller to avoid disk placement. In addition, there is no concurrency problem in the build.
Which of the above two options is better? Determined by the parallel optimizer based on Cost.
Subquery parallelism - pushdown exec
Here the subquery is part of the expression and can exist in clauses such as select list / where / having.
For correlated subqueries, the only way to parallelize them is to push down the data (tables) that the outer layer depends on to the workers and execute them completely in each worker. However, since the outer layer is parallelized, the number of subquery executions in each worker is still ok decrease in equal proportion.
The EXISTS subquery is completely cloned into each worker, and is triggered repeatedly with the evaluation of the WHERE condition.
Subquery parallelism - pushdown shared
This parallel subquery can be part of an expression or a derived table.
Roughly speaking, this parallel method is suitable for non-correlated subqueries, so it can be materialized in parallel in advance to form a temporary result table. The subsequent outer layer is in parallel, and each worker can directly read from the table in parallel when referencing the subquery. Get the result data.
In addition, in the report query of online users, a very common query mode is the multi-layer nesting of derived table. For this type of SQL, the pushdown shared strategy can improve the performance of parallel execution, such as the following example:
The squares of each color in the above figure represent a layer of query block, which constitutes the nesting logic of multi-layer derived tables. Some layers are summarized by UNION ALL, and some layers are multiple tables (including derived table ) join, for such a query, MySQL will do the necessary materialization for each derived table, and form a temporary result table in the outer layer to participate in subsequent calculations, and PQ2.0 provides more general support for this common query mode, Now the execution of each layer of queries is done in parallel, striving to achieve a linear acceleration effect.
Exchanges
To generate efficient and flexible execution plans, data distribution components are essential. Currently, PolarDB supports three distribution methods: Shuffle/Broadcast/Gather, and uses lock-free shared ring buffer to achieve efficient data transmission in pipeline mode.
The following figure shows the basic form of Shuffle (Repartition)
So far, the function and implementation of the online version of parallel query have been roughly introduced.
As a mature enterprise-level functional feature, the team has also implemented a complete set of auxiliary tools to improve the usability of the product and realize the functions that can be monitored, intervened, and feedbacked, but the space here is already too large , I will not introduce it first.
Four future plans
The future plan here is not accurate because the team has done a lot of work on cross-node parallelism and entered the end of the development cycle. Cross-node parallelism will improve the complex query capabilities for massive data to another level:
* Open up computing resources between nodes to achieve higher computing parallelism
*Break through the bottleneck of a single node on IO/CPU, and make full use of the high throughput of distributed storage
*Combine global node management and resource view to balance and dispatch global computing resources to achieve load balancing while ensuring query performance
*Combined with the global consistency view to ensure the correct reading of transactional data
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00