The Road to Transparency with PolarDB-X
Introduction: PolarDB-X was formerly the sub-database and sub-table middleware TDDL (in 2007, the form of Java library) used by Taobao. In the early days, it was developed with DRDS (started in 2012, launched in 2014, sub-database and sub-table middleware + MySQL Proxy) form) brand to provide services on Alibaba Cloud, and later (2019) officially transformed into a distributed database PolarDB-X (officially became a member of the PolarDB brand). From middleware to distributed database, we have been building a distributed database with MySQL as storage for more than 10 years. We have accumulated a lot of technologies and taken some detours in the process. We will continue to move forward firmly in the future.
The predecessor of PolarDB-X is the sub-database and sub-table middleware TDDL (in 2007, the form of Java library) used internally by Taobao. In the early days, it was developed as DRDS (started in 2012, launched in 2014, in the form of sub-database and sub-table middleware + MySQL Proxy) ) brand provides services on Alibaba Cloud, and later (2019) was officially transformed into a distributed database PolarDB-X (officially became a member of the PolarDB brand). From middleware to distributed database, we have been building a distributed database with MySQL as storage for more than 10 years. We have accumulated a lot of technologies and taken some detours in the process. We will continue to move forward firmly in the future.
The development process of PolarDB-X is mainly divided into two stages: middleware (DRDS) and database (PolarDB-X). There are huge differences between these two stages. The author has just been involved in the development of PolarDB-X for ten years, and has gone through the entire development process. Today, I will talk to you about some interesting things in the development and transformation of PolarDB-X.
Middleware Era (2012~2019)
The development idea in the DRDS period is actually very simple and meets several main demands of users:
The maximum storage space of the RDS MySQL single instance provided on Alibaba Cloud is limited (for example, only 2T in the early days)
Compatibility with stand-alone indexes
In addition, in a stand-alone database, indexes have some very natural behaviors that need to be compatible.
E.g:
• Can create indexes directly through DDL statements, instead of needing a variety of peripheral tools to complete.
• Prefix query. In a stand-alone database, the index can well support the prefix query. How should the global index solve such problems?
• Hot issue (Big Key issue), in a stand-alone database, if an index has a low degree of selection (for example, an index is created on gender), it will not be a serious problem except for a slight waste of resources; but for distributed In the database, this index with low selectivity will become a hot spot, causing some hot nodes in the entire cluster to become the bottleneck of the entire system. The global index needs to have a corresponding method to solve such problems.
The creation speed of the index, the performance of the index returning to the table, the functional limitations of the index, the clustered index, the storage cost of the index, etc., have also greatly affected the experience of using the global index. Continue to expand.
number of indexes
These requirements for global indexes are essentially derived from the number of global indexes.
In a database with good transparency, all indexes will be global indexes, and the number of global indexes will be very large (just like the number of secondary indexes of one table and one library in a stand-alone database). The more the number, the higher the demand.
However, these distributed databases that are not fully prepared, even if they have global indexes, you will find that the usage they give will still be strongly dependent on the partition key usage.
They make the creation of a global index an optional, special thing. In this way, the business will become very cautious when using the global index. Naturally, the number of global indexes becomes very limited.
When the number and usage scenarios of global indexes are strictly limited, the above disadvantages of doing well become less important.
Query Optimizer for Index Selection
We know that the core working mechanism of the database optimizer is:
1. Enumerate possible execution plans
2. Find the least expensive of these execution plans
For example, three tables are involved in a SQL, when only the left deep tree is considered:
• When there is no global index, it can be simply understood that the space of the execution plan is mainly reflected in the JOIN order of the three tables, and the space size is roughly 3x2x1=6. The space of the execution plan is relatively small, and it is much easier for the optimizer to judge the cost of these 6 execution plans. (Of course, the optimizer still has a lot of work, such as partition pruning, etc. These optimizations have to be done with or without indexes, so I won't say more).
• When there is a global index, the situation is more complicated. Assuming that each table has 3 global indexes, the size of the execution plan space will roughly become (3x3)x(2x3)x(1x3)=162, and the complexity will rise sharply. Accordingly, the requirements for the optimizer will be much higher. The optimizer needs to consider more statistical information in order to choose a better execution plan; it needs to do more pruning to complete query optimization in a shorter time.
So we can see that in "distributed databases" without global indexes or some middleware products, the optimizer is very weak, most of them are RBO, they don't need a powerful optimizer at all, more The optimized content is actually replaced by the stand-alone optimizer.
The predecessor of PolarDB-X is the sub-database and sub-table middleware TDDL (in 2007, the form of Java library) used internally by Taobao. In the early days, it was developed as DRDS (started in 2012, launched in 2014, in the form of sub-database and sub-table middleware + MySQL Proxy) ) brand provides services on Alibaba Cloud, and later (2019) was officially transformed into a distributed database PolarDB-X (officially became a member of the PolarDB brand). From middleware to distributed database, we have been building a distributed database with MySQL as storage for more than 10 years. We have accumulated a lot of technologies and taken some detours in the process. We will continue to move forward firmly in the future.
The development process of PolarDB-X is mainly divided into two stages: middleware (DRDS) and database (PolarDB-X). There are huge differences between these two stages. The author has just been involved in the development of PolarDB-X for ten years, and has gone through the entire development process. Today, I will talk to you about some interesting things in the development and transformation of PolarDB-X.
Middleware Era (2012~2019)
The development idea in the DRDS period is actually very simple and meets several main demands of users:
The maximum storage space of the RDS MySQL single instance provided on Alibaba Cloud is limited (for example, only 2T in the early days)
Compatibility with stand-alone indexes
In addition, in a stand-alone database, indexes have some very natural behaviors that need to be compatible.
E.g:
• Can create indexes directly through DDL statements, instead of needing a variety of peripheral tools to complete.
• Prefix query. In a stand-alone database, the index can well support the prefix query. How should the global index solve such problems?
• Hot issue (Big Key issue), in a stand-alone database, if an index has a low degree of selection (for example, an index is created on gender), it will not be a serious problem except for a slight waste of resources; but for distributed In the database, this index with low selectivity will become a hot spot, causing some hot nodes in the entire cluster to become the bottleneck of the entire system. The global index needs to have a corresponding method to solve such problems.
The creation speed of the index, the performance of the index returning to the table, the functional limitations of the index, the clustered index, the storage cost of the index, etc., have also greatly affected the experience of using the global index. Continue to expand.
number of indexes
These requirements for global indexes are essentially derived from the number of global indexes.
In a database with good transparency, all indexes will be global indexes, and the number of global indexes will be very large (just like the number of secondary indexes of one table and one library in a stand-alone database). The more the number, the higher the demand.
However, these distributed databases that are not fully prepared, even if they have global indexes, you will find that the usage they give will still be strongly dependent on the partition key usage.
They make the creation of a global index an optional, special thing. In this way, the business will become very cautious when using the global index. Naturally, the number of global indexes becomes very limited.
When the number and usage scenarios of global indexes are strictly limited, the above disadvantages of doing well become less important.
Query Optimizer for Index Selection
We know that the core working mechanism of the database optimizer is:
1. Enumerate possible execution plans
2. Find the least expensive of these execution plans
For example, three tables are involved in a SQL, when only the left deep tree is considered:
• When there is no global index, it can be simply understood that the space of the execution plan is mainly reflected in the JOIN order of the three tables, and the space size is roughly 3x2x1=6. The space of the execution plan is relatively small, and it is much easier for the optimizer to judge the cost of these 6 execution plans. (Of course, the optimizer still has a lot of work, such as partition pruning, etc. These optimizations have to be done with or without indexes, so I won't say more).
• When there is a global index, the situation is more complicated. Assuming that each table has 3 global indexes, the size of the execution plan space will roughly become (3x3)x(2x3)x(1x3)=162, and the complexity will rise sharply. Accordingly, the requirements for the optimizer will be much higher. The optimizer needs to consider more statistical information in order to choose a better execution plan; it needs to do more pruning to complete query optimization in a shorter time.
So we can see that in "distributed databases" without global indexes or some middleware products, the optimizer is very weak, most of them are RBO, they don't need a powerful optimizer at all, more The optimized content is actually replaced by the stand-alone optimizer.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00