The Transparent Road of PolarDB-X

PolarDB-X, formerly known as TDDL, a database and table middleware used internally by Taobao (in 2007, in the form of a Java library), used to provide services on Alibaba Cloud under the brand of DRDS (developed in 2012, launched in 2014, in the form of database and table middleware+MySQL proxy). Later (2019), PolarDB-X was formally transformed into a distributed database (officially became a member of the PolarDB brand). From middleware to distributed database, we have been building a distributed database using MySQL as storage for more than 10 years. We have accumulated a lot of technology and made some detours in the process, and we will continue to do so firmly in the future.

The development process of PolarDB-X is mainly divided into two stages: middleware (DRDS) and database (PolarDB-X). There are huge differences between the two stages. The author has just participated in the development of PolarDB-X for ten years, and has experienced the whole development process. Today, I will talk about some interesting things in the development and transformation of PolarDB-X.

Middleware era (2012~2019)

In fact, the development idea in the DRDS period is very simple, meeting several main demands of users:

1. The maximum storage space of the RDS MySQL single instance provided by Alibaba Cloud is limited (for example, it was only 2T in the early days)

2. The share storage database can solve the problem of disk capacity, but it is still limited by the single CPU memory, and cannot solve the problem of write scalability

3. The use of open source middleware can solve the above problems, but it is very troublesome and complicated to do operations such as capacity expansion

Against this background, we have added a MySQL proxy (actually the network layer of the Cobar) to the middleware TDDL, which is deployed on Alibaba Cloud and becomes the earliest DRDS.

The beginning of going to the cloud - using the cloud also serves the cloud

It is worth mentioning that the DRDS cloud mode is very fashionable now.

Like ordinary users of Alibaba Cloud, it also has an Alibaba Cloud account (only this account has a credit line of more than one trillion yuan). Use the AK/SK of this account to call the Open API of various Alibaba Cloud products for various operations.

For example, when creating an instance, you will purchase ECS to deploy DRDS nodes; They will purchase SLBs to load balance; Will purchase SLS services to store the SQL audit of this instance; It will connect the DRDS node to the user RDS network.

This form of management and control architecture is currently widely used, making full use of the advantages of the cloud. DRDS hardly needs to pay attention to resource issues, nor does it need to maintain its own inventory; For problems like machine downtime, ECS can also automatically migrate (even the IP address will not change), which is very convenient. Let the R&D team of DRDS focus more on improving the capability of the product itself.

On the one hand, DRDS provides services for Alibaba Cloud users, and on the other hand, as an "ordinary user" of Alibaba Cloud, it is very interesting to enjoy the benefits of cloud technology.

During the DRDS period, we focused on the following technical capabilities in the kernel:

SQL semantic compatibility with MySQL

TDDL only serves internal users, while Taobao's R&D specifications are relatively strict. The SQL used in applications is relatively simple, so there is very little SQL processing. To put it simply, it does not even need to understand the semantics of SQL. It just forwards. However, the demands of cloud users are diverse, and there are a large number of stock applications migrated to the cloud, which requires much higher SQL compatibility. This requires us to provide a complete SQL engine.

DRDS has two more key components compared with TDDL and a lot of sub database and sub table middleware on the market: query optimizer and executor with complete operator system. Its goal is to understand the semantics of SQL correctly and execute the correct results no matter how complex the SQL is.

There is a lot of work that needs to be accumulated for a long time. Here are some examples:

• Any built-in function supported by MySQL may be calculated based on a result that cannot be pushed down, which requires DRDS to support all built-in functions of MySQL, and the goal is consistent with the behavior of MySQL. We have implemented almost all these functions in DRDS( https://github.com/ApsaraDB/galaxysql/tree/main/polardbx-optimizer/src/main/java/com/alibaba/polardbx/optimizer/core/function )。 In the early days, two of our classmates spent years doing this and have polished it up to now.

• MySQL supports a large number of charsets and collations. Different combinations will bring different sorting results. If we want to use the merge and sort operator to merge the partially ordered results of the MySQL layer, we need to ensure that the DRDS uses the same sort behavior as MySQL. In fact, DRDS is required to support charset and collaboration systems consistent with MySQL behavior, such as utf8mb4_ general_ ci: https://github.com/ApsaraDB/galaxysql/blob/main/polardbx-common/src/main/java/com/alibaba/polardbx/common/collation/Utf8mb4GeneralCiCollationHandler.java 。

There are many similar works, such as type system( https://zhuanlan.zhihu.com/p/374130246 )、sql_ Mode, time zone system, default value, etc. are tedious but necessary. These works have been well extended to PolarDB-X.

Extreme pushdown optimization

It is the simplest principle to ensure the performance to push down the calculation to the place closest to the data.

MySQL, as a storage engine of distributed database, actually has strong computing power. Especially, compared with many distributed databases that use KV as the storage engine at present, most of them can only implement filter and function push down. However, MySQL supports complete SQL execution. It is a key for DRDS to ensure high performance to push down as many pieces of JOIN, sub query, aggregation and other operations as possible to MySQL.

The following table briefly compares some optimization options of products in the industry. The information comes from public documents:



Collection of partition clipping conditions

Predicate push down

Predicate push down

Predicate push down

Predicate Move-Round

Pushing too much will lead to incorrect results, and pushing too little will fail to achieve optimal performance. The optimizer of DRDS accumulates a large number of push down optimization strategies. Most of these optimization strategies can't be imagined out of thin air. Only through actual scenario cases can they be accumulated. See:

https://zhuanlan.zhihu.com/p/366312701 。

The Enrichment of Physical Operators and the Execution Engine of MPP

Physical operators refer to various algorithms of the actuator. For example, for Join, we support HybridHashJoin, LookupJoin, NestedLoopJoin, SortMergeJoin, MaterializedSemiJoin and other algorithms.

DRDS initially only supports single thread SQL execution, but this execution is not enough for complex SQL. We first built a stand-alone parallel engine (SMP), and then developed to the current MPP engine, https://zhuanlan.zhihu.com/p/346320114 。

At the same time, the execution engine also supports spill out (the ability to drop intermediate results). Even if there is only 15 MB of memory, it can run through the TPCH 1G test. See the following for details: https://zhuanlan.zhihu.com/p/363435372 。

The accumulation and breakthrough of these capabilities have greatly improved PolarDB-X's computing power in the face of complex SQL.

Rough distributed transactions

Distributed transaction is an unavoidable problem.

For middleware products, we have a very basic assumption: use standard MySQL to avoid invasive changes to MySQL; Even if it is modified, it should be plug-in.

Without modifying MySQL, we haven't implemented distributed transactions for a long time.

Some detours we have gone through:

1. Like traditional middleware, distributed transactions are prohibited. However, the transformation cost of this application is too high.

2. Using flexible transactions, we have used third-party components such as GTS (formerly TXC) to implement distributed transactions for a long time. This scheme needs to implement rollback statements for different SQL according to semantics. SQL compatibility is poor.

3. Scheme of using GTM. GTM is a single point in nature, and there is a lot of data interaction between GTM and the coordinator. The performance is too poor to be used as a default transaction strategy. So let's look at the "database" of the GTM scheme. It must have very strict conditions for use (for example, the application is required to avoid distributed transactions as much as possible and close strong consistency by default).

4. XA transactions. Early MySQL supported XA weakly and had many BUGs (in fact, MySQL still has many BUGs for XA). For example, it is easy for XA to hang up the downtime recovery process. In addition, XA transactions cannot solve the problem of read visibility and are incompatible with the behavior of stand-alone transactions.

The transaction system is closely related to the storage layer. From PolarDB-X's exploration, it is impossible to make a distributed transaction with satisfactory performance and functions without making deep modifications to MySQL. This is a problem that all middleware products cannot solve, and it is also a fundamental difference between middleware and database.

The partition key that cannot be bypassed

Since the first user of DRDS, I have always had to answer the question: How do I select the partition key for my table?

From the perspective of "high throughput" and "high concurrency" business systems, it is very reasonable to require that tables and SQL have partition keys with business characteristics. All of them are pushed down to the storage layer to avoid cross machine queries and transactions, which can ensure the best performance. This is the ceiling of performance.

The problem is that although the upper limit is very high (even the peak of Taobao's business at 11:00 on the Double Ninth Day can be very smooth)

1. The cost of this transformation is very high. In many cases, the partition key is difficult to select. For example, the order table of many e-commerce systems has two query dimensions: seller and buyer. Which is the partition key

2. Not all business systems (or not all tables and SQL) are worth the cost of transformation. Only the core logic in the core system needs such a detailed transformation

3. If the split key is selected incorrectly, the lower limit will be extremely low. For databases, it is equally important to provide a higher upper limit and a lower limit.

Naturally, we want to know what kind of technology can make you "forget" the partition key.

Database Age (2019~)

The Road to Transparent Distribution

One of the key differences between middleware and database is whether it is necessary to enforce the concept of partition key.

Partition key and global index

The broad concept of "partition key" is not unique to distributed databases.

In a stand-alone database, such as MySQL, data is stored as B trees. If a table has only a primary key, it has only one B tree, for example:

CREATE TABLE t1(

id INT,

name CHAR(32),

addr TEXT,

PRIMARY KEY (id)

)

The unique B tree of this table is sorted by primary key (id). If our query criteria include an equivalent condition with an ID, such as where id=14, we can quickly locate the record corresponding to this ID in the tree; On the contrary, full table scanning is required.

The key used for sorting in the B-tree can be located to a leaf node through binary search; The partition key can be located to a partition through hash or range binary search. It can be seen that they are all designed to locate data quickly.

If we want to query the above table with where name='Megan ', we do not need to set name as the primary key in MySQL. A more natural way is to create a secondary index on name:

CREATE INDEX idx ON t1(name)

Each secondary index is an independent B Tree in MySQL. The key used for sorting is the column of the secondary index.

That is to say, there are two B Trees in the current table t1, one primary key and one idx. They are:

id->name,addr

name->id

When you use where name='Megan 'to query, you will first access the B tree of idx, locate the leaf node according to name='Megan', obtain the value of the id, and then use the value of the id to the B tree of the primary key to find the complete record.

In fact, the secondary index gains the query performance by redundant data, using space and increasing the cost of writing.

At the same time, the maintenance cost of secondary indexes is not very high. Generally, you can create several secondary indexes on a table with confidence.

Similarly, in a distributed database, the only way to "forget" the partition key is to use a distributed secondary index, also known as the Global Index. And this global index needs to be efficient, cheap, and highly compatible with traditional secondary indexes.

The global secondary index is also a kind of data redundancy. For example, when executing an SQL:

INSERT INTO t1 (id,name,addr) VALUES (1,"meng","hz");
If there is a seller in the orders table_ The global secondary index, ID, can be simply understood as:_ ID The insert is executed in two global indexes, and two records are written:

INSERT INTO t1 (id,name,addr) VALUES (1,"meng","hz");
INSERT INTO idx (id,name) VALUES (1,"meng");
The partition key of t1 primary key index is id, and the partition key of idx is name.

At the same time, since the approximate rate of the two records will not be on the same DN, in order to ensure the consistency of the two records, we need to encapsulate the two writes into a distributed transaction (this is similar to that in a stand-alone database, where the secondary index is written through a stand-alone transaction).

When all our DML operations maintain the global index through distributed transactions, the secondary index and the primary key index can remain consistent.

Does global indexing sound simple? In fact, it is not.

Global Index and Distributed Transaction
The index must be strong and consistent, for example:
• Failed to write the index table, but failed to write the primary table, resulting in data shortage in the index table
• Read the index table and the main table at the same time. The records you see should be the same. You cannot read the results submitted while not submitted
...
The consistency requirements for indexes here are actually the requirements for distributed transactions.
Due to the introduction of global indexes, 100% of transactions will be distributed transactions. The requirements for distributed transactions are completely different from those for "distributed databases that strongly depend on partition key types". The requirements become higher:

1. At least the isolation level above SNAPSHOT ISOLATION must be achieved. Otherwise, the behavior will be very different from that of the stand-alone MySQL, and there will be a very large data consistency risk. At present, the common solutions are HLC, TrueTime, TSO and GTM. If a database does not use these technologies, you need to carefully identify them.
2. 100% distributed transactions have higher performance requirements than 10% distributed transactions in TPCC model. HLC, TSO, and TrueTime schemes can achieve relatively large transaction capacity. GTM is relatively heavier, and its upper limit is far lower than that of TSO in the same single point scheme (although TSO is a single point, it can do a lot with Grouping optimization).
3. Even if TSO/HLC and other schemes are used, optimization should also be in place, such as typical 1PC, Async Commit and other optimizations. Otherwise, the response time increased by index maintenance will be difficult to accept.

Compatibility with stand-alone indexes

In addition, in stand-alone databases, indexes have some very natural behaviors that need to be compatible.

For example:
• Indexes can be created directly through DDL statements, rather than requiring various peripheral tools.
• Prefix query. In a stand-alone database, the index can well support prefix query. How should the global index solve this problem?
• Hot issues (Big Key issues). In a stand-alone database, if an index is not highly selective (for example, an index is created on gender), there will be no serious problems except a slight waste of resources; However, for distributed databases, the index with low selectivity will become a hotspot, causing some hot nodes in the whole cluster to become the bottleneck of the whole system. Global indexes need to have corresponding methods to solve such problems.
The creation speed of the index, the performance of the index back to the table, the functional limitations of the index, clustered indexes, and the storage cost of the index also greatly affect the use experience of the global index. In view of the space, we will not continue to expand.

Number of indexes

These requirements for global indexes are essentially derived from the number of global indexes.

For a database with good transparency, all indexes will be global indexes, and the number of global indexes will be very large (just like the number of secondary indexes of one table and one database in a stand-alone database). The requirements will be higher only when the quantity is more.
However, even if there is a global index, you will find that the usage of these incomplete distributed databases is still strongly dependent on the partition key.

They will make the creation of global indexes an optional and special matter. In this way, businesses will become very cautious when using global indexes. Naturally, the number of global indexes will become very limited.

When the number of global indexes and usage scenarios are strictly limited, the disadvantages of the above will become less important.

Index selection oriented query optimizer

We know that the core working mechanism of the database optimizer is:
1. Enumerate possible execution plans
2. Find the lowest cost of these execution plans

For example, three tables are involved in an SQL statement. When only the left deep tree is considered:

• When there is no global index, it can be simply understood that the execution plan space is mainly reflected in the JOIN order of the three tables, and its space size is about 3x2x1=6. The space for execution plans is relatively small, and it will be much easier for the optimizer to judge the cost of these six execution plans. (Of course, the optimizer still has a lot of work to do, such as partition pruning, and so on. I won't say much about whether these optimizations have indexes or not.).
• When there is a global index, the situation is much more complicated. Assuming that each table has three global indexes, the size of the execution plan space will roughly become (3x3) x (2x3) x (1x3)=162, and the complexity will rise sharply. Accordingly, the requirements for the optimizer will be much higher. The optimizer needs to consider more statistics to select a better execution plan; More pruning is needed to complete query optimization in a shorter time.

So we can see that in the "distributed database" or some middleware products without global indexes, the optimizers are very weak, most of them are RBO, they do not need a powerful optimizer at all, and more optimization content is actually replaced by the stand-alone optimizer.

Implement strong consistency and high-performance distributed transactions on MySQL

In order to build a transparent distributed database, the key is the global index and the distributed transactions that the global index depends on. The exploration in the middleware era has told us that to make strong, consistent and high-performance distributed transactions, we must make deep modifications to the storage (MySQL).

We choose the scheme of using global MVCC (TSO)+2PC (XA).
Start is included in the stand-alone MVCC of MySQL_ timestamp (that is, trx_id in MySQL). In order to achieve global MVCC, several core things need to be done:
• Provide a global timestamp generator TSO, https://zhuanlan.zhihu.com/p/360160666 。
• Replace stand-alone trx with TSO generated global timestamp_ id。
• Introduce commit_ Timestamp (also generated by TSO), using strat_ timestamp and commit_ It is very efficient to judge the visibility by timestamp, https://zhuanlan.zhihu.com/p/355413022 。 The costly scheme of exchanging active transaction lists or GTMs between nodes is not used.
Transaction flow in PolarDB-X:
The modification of the record format in InnoDB is called the Lizard transaction system. See the following for details:

We also have some other articles to introduce the implementation of PolarDB-X distributed transactions:
• PolarDB-X Strong Consistent Distributed Transaction Principle
• Implementation of PolarDB-X distributed transaction (I)
With distributed transactions and global indexes, PolarDB-X has officially transformed from a middleware into a distributed database.
Transparent distribution of PolarDB-X
PolarDB-X implements excellent distributed transactions and global indexes, which meet the requirements for global indexes mentioned above and achieve transparent distribution.
In the transparent distributed mode (mode='auto' is specified in CREATE DATABASE), all indexes are global indexes, and the application does not need to care about the partition key.
For example, our table creation statement is completely consistent with the stand-alone MySQL, and it does not need to specify the partition key:
create table orders (
id bigint,
buyer_ Id varchar (128) comment 'Buyer',
seller_ Id varchar (128) comment 'Seller',
primary key(id),
index sdx(seller_id),
index bdx(buyer_id)
)
Creating a global index is also consistent with the experience of creating a secondary index on a stand-alone MySQL. The whole process is online:
CREATE INDEX idx_ seller_ id ON orders (seller_id);
PolarDB-X's global index is strongly consistent, and its data consistency experience is not significantly different from that of stand-alone MySQL. It provides an isolation level between RC and RR that conforms to MySQL semantics.
At the same time, PolarDB-X has also done a lot of work on index maintenance and optimizer to ensure that the index can be created and maintained efficiently, and the optimizer can correctly generate the execution plan using the index.
The partition algorithm of PolarDB-X can also deal with hot spots, data skew and other problems in the index. Refer to: https://zhuanlan.zhihu.com/p/424174858
Automatically (transparently) determine the lower limit and manually determine the upper limit
Divide common distributed databases on the market by transparency and manual:
• Typical representatives of transparent distributed databases: TiDB, CockroachDB.
• Typical representatives of manual distributed databases: OceanBase, YugabyteDB.
Is transparent distributed data better than manual data?
For databases that only provide transparent usage, the migration cost will be lower and the initial experience will be better. However, after entering the deep-water area, due to the inevitable use of a large number of distributed transactions, the performance in the core scenarios often fails to meet the requirements (or the same performance requires higher costs), and there is a lack of optimization methods such as eliminating distributed transactions and more sufficient computing push down.

For databases that only provide manual usage, although well-designed partitioning keys enable optimal performance in theory, the threshold for use will increase significantly (10% of core tables need to be designed for partitioning keys, and the remaining 90% of non core tables need to be designed for partitioning keys).

We believe that both transparent and manual distributed databases can not meet the requirements of businesses for both cost and performance.
PolarDB-X not only provides a transparent mode, but also fully supports the syntax of partitioned tables. It also provides Join Group/Table Group, partition online change and other tools to enable applications to push down more transactions and calculations to storage nodes when they need extreme performance.

PolarDB-X is the only distributed database on the market that can provide both transparent and manual modes. We recommend that most scenarios use the transparent mode, then perform pressure testing on core business scenarios, and manually optimize these scenarios using partition table syntax to achieve the highest performance.

Use Paxos protocol to achieve RPO=0
MySQL connected by middleware generally uses the active/standby architecture. The biggest problem with this method is that data will be lost, even if it takes a long time (walking along the river, you can't help wetting your shoes).

After years of science popularization in the database circle, we all know that some consistency protocols, such as Paxos and Raft, are required to avoid data loss. In fact, these agreements are no secret. Even the database circle has a joke that "students enrolled in schools should be able to hack Paxos".
The threshold is not the protocol itself, but the stability and performance of MySQL. Stability can only be obtained after large-scale verification and enough pits.

The Paxos protocol used by PolarDB-X is derived from Alibaba's internal X-Paxos. It can be said that there is no active/standby mode for Alibaba's internal MySQL databases, and 100% of them use X-Paxos. This means that it has experienced tens of thousands of MySQL clusters and various rapid verifications, and has high reliability.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us