Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
In the last article, we have a basic understanding of the consistency problem of distributed systems. The data replication mechanism of single-master synchronization is introduced in detail. Although it has been guaranteed by best effort, it still cannot guarantee the desired strong consistency.
Then through the essential analysis of data replication, we found the possibility of using transactions to solve the problem of data consistency.
Although transactions can be used to solve consistency problems during data replication, the original purpose of transactions is not this, at least not only. Therefore, it is necessary for us to have a good understanding of distributed transactions.
source of inconsistency
The data on the replicas will be inconsistent because of machine and network failures, etc., either the replicas do not know that they are different from each other, or they know they are different but cannot solve it.
Essentially, because each replica only has local information and cannot make correct decisions.
The right medicine, if each copy can know all the information, can make the right decision.
However, the information synchronization consumption caused by this, and the higher possibility of errors caused by more and more complex processes, are not feasible in fact.
As a compromise, not all copies are needed, as long as one copy (extracting this part of the function and giving it a new role) has all the information, and then let it coordinate other copies.
This idea is actually reflected in the architecture of a single master and multiple slaves. As a special copy, the master has already acted as a coordinator, but it only replicates data to other copies independently. Now these separate coordination efforts need to be considered together.
Another way to troubleshoot
In a complex distributed system, there are too many places where things can go wrong:
• Servers may go down and be divided into restarts that can be recovered and never recovered.
• Network failures are further divided into occasional jitter and long-term failures.
• Service failure, but also whether it is divided into different roles at the same time.
We also need to design a response mechanism for each possible failure, taking the replica problem as an example:
• In order to cope with failures, there must be replicas.
• To deal with IDC-level failures, replicas must be distributed across different computer rooms.
• To deal with switch-level failures, all replicas cannot be placed on the same switch.
This is not only difficult to cover all possible faults, but will inevitably lead to more and more complex system design and implementation, which in turn affects the reliability of the system.
Can you change the way of thinking? I know that failures may occur, and there will be various failures, and sometimes it is impossible to judge what failure is currently occurring.
But I don't want to deal with this problem in such a fine-grained way. I'll be a little rough and operate it first, and the best way to succeed; if it fails, no matter what the reason is, restore the scene and try again later.
Combining the above two ideas, there is the so-called 2PC (Two Phase Commit). 2PC is also one of the typical implementations of distributed transactions.
Give the coordinating role a new name called Coordinator, and other participating roles are called Participants.
As the core, the coordinator has the overall information and the ability and right to make decisions. Participants only need to pay attention to themselves and work in peace.
It is called 2PC because the entire transaction commit is divided into two phases: Prepare and Commit.
The main execution process is as follows:
It seems to meet the needs well, but whether 2PC is perfect enough, we have to carefully analyze the implementation process. It can be analyzed from several dimensions:
• Process: There are a total of 4 message transfers in the two phases, which can also be subdivided into pre-, mid- and post-message transmissions.
• Points of failure: There are servers (participants, coordinators) and networks that can fail, and can be subdivided into single failures and simultaneous multiple failures.
• Failure types: recoverable machine failure (fail-recover), unrecoverable machine failure (fail-dead), network jitter (casual packet loss), network partition (long-term network failure).
• Impact: Short-term blocking affects performance, permanent blocking affects availability, and data consistency issues.
I have tried to combine all of the above dimensions, but it is too complicated. Let's first find some rules and commonalities, exclude some combinations, and only focus on key situations.
For the process:
• Since the first stage does not actually commit data, the entire transaction can be canceled after a failure, and there will be no side effects, but the process may be blocked.
• The second stage will actually commit the data, once a partial commit occurs, it may cause data consistency problems. More specifically, when the participant's reply message is lost, the transaction has actually been executed and no side effects will occur, so we only focus on the partial delivery of the message sent by the coordinator.
For failure points and failure types:
• Failures of the fail-recover class will only block the current transaction because the coordination and participants will persist the transaction state locally, and the message retry mechanism will only block the current transaction, or because the current transaction occupies resources (such as acquiring locks) Other transactions block, but do not cause data consistency issues.
• Fail-dead failures, due to the loss of local transaction state, have the possibility of data inconsistency. Participant fail-dead can replicate data from other participants, but coordinator's fail-dead has nowhere to replicate and needs to be focused.
• Faults such as network jitter can be resolved by message retry, which will only lead to blocking and not cause consistency problems (strictly speaking, before the retry is successful, the data is also inconsistent).
• The failure of network partition, through the analysis of CAP in the previous article, is an important cause of data consistency problems and needs to be paid attention to.
(BTW, it seems that many problems are caused by the partial delivery of the message. It seems that the coordinator needs to send commits to multiple participants as a transaction. But isn't this still in the process of designing transactions, and nesting dolls are prohibited !)
From the above analysis, we can draw the first conclusion that short-term blockage can happen anytime and anywhere. This is the nature of synchronous operation, and it is also an unavoidable disadvantage of 2PC.
We then focus on the following dimensions that can cause permanent system blocking and data consistency issues:
• For the process, we focus on the second stage, and focus on the successful delivery of part of the message in the second stage, which will have a real impact.
• For failure points and failure types, we focus on coordinator fail-dead and network partitions.
No. Process Coordinator Participant Consistency Hazard May Block Permanently
1 commit/abort fail-dead ok no no
2 commit/abort fail-dead fail-dead yes yes
3 commit/abort fail-dead fail-recover no no
Explain by number one by one:
• No. 1, the coordinator dies after sending a commit/abort message, some participants received it, some did not. After the new coordinator is elected, it can only ask all participants about the status of related transactions, and get some replies with some instructions and some without instructions. It is enough to judge whether the transaction has been decided to commit or abort, so it is only necessary to send the instruction again to the participant who has not received the instruction.
• No. 2 and 3, the coordinator dies after sending the commit/abort message, some participants receive it, some don't. After the new coordinator is elected, it routinely asks all participants about the status of related transactions. Suppose only one participant did not reply, and the other participants all gave their own responses, either all commits or aborts, or no instructions were received. When a response is received, the participant can determine the previous decision; but without the response, the participant cannot determine the previous decision. If the failed participant fails-recover, it will naturally know the status from it, but it will just block. However, if the failed participant fails-dead, the decision result will be lost forever, the transaction will be blocked forever, and the participant may have completed the commit operation before death, which will lead to inconsistency.
The process is troublesome, and the conclusion is quite simple. When the coordinator and some participants fail-dead at the same time, it may lead to permanent blocking and data consistency problems.
For network partitions, when the coordinator sends a commit/abort message, a network partition occurs, some participants receive it, and some do not. A partition without a coordinator elects a new coordinator. If the participants who received and did not receive messages happen to be scattered in different network partitions, each coordinator will make different judgments, resulting in data inconsistency between partitions.
No. Procedure Network Partition Consistency Hazard May Block Permanently
1 commit/abort yes yes no
The above two types of failures, fail-dead and network partition, are analyzed separately. When the two are combined, it is similar to the effect of bitwise OR.
To sum up, 2PC mainly produces two types of three problems:
In the last article, it was to solve the problem of data consistency that this distributed transaction was introduced. I finally designed 2PC, but unexpectedly, there was a consistency problem, and there may be unrecoverable blocking, which cannot be solved by the problem I want to solve. I have to find a way.
Thinking about it, the coordinator and the participants fail-dead at the same time. After the new coordinator is elected, why can't it be judged whether the current transaction should commit or abort? We define the role of the coordinator, and the purpose is to make it have the ability to make decisions. Why is there no ability to judge in this case?
The key lies in the sentence we mentioned above when we "reduced the dimensionality" of the problem: the coordination and the participants will persist the transaction state locally.
It is precisely because the transaction state is locally persisted on each machine that we can ensure that failures such as fail-recover will not lead to inability to make decisions.
But in the fail-dead case, the transaction state is lost. If all machines that have locally persisted transactional state die, the state is completely lost. For example, in the example mentioned above, the coordinator fails-dead after issuing the first commit command, and the participant who received the command also fails-dead, and the remaining participants are futile.
The source of the problem has been found. Since the loss of the transaction state -- mainly the decision results generated by the voting results of the first stage -- caused this problem, we will send the decision results to all participants, and then we will execute the real commit action. In this way, as long as one machine is still alive (the full-failure situation needs to be avoided through a node distribution scheme such as multi-rack), the decision result is still there.
This is the so-called 3PC (Three Phase Commit) idea.
In the middle of the two phases of 2PC, a step dedicated to synchronizing the decision results is inserted. Only if this step is successful, will it enter the next stage, otherwise retry or abort.
• Can-Commit, similar to the Prepare phase in 2PC.
• Pre-Commit, a new phase where the decision maker synchronizes the decision results with the participants.
• Do-Commit, similar to the Commit stage in 2PC.
3PC solves the second problem of 2PC nicely. But there is still no way to deal with the third problem - the data consistency problem after network partition. The first problem of 2PC, the performance loss caused by short-term blocking, is a common problem of synchronous solutions, and 3PC is also powerless.
In addition, in the absence of a standby coordinator for 2PC, as long as the coordinator fails, the entire system will be blocked for a long time, so it is counted as a blocking algorithm.
In addition to solving possible consistency problems, an additional phase of 3PC also solves the blocking problem. In order to further alleviate the blocking, referring to the coordinator's practice, a timeout mechanism is also introduced on the participant side. After Pre-Commit, if no Do-Commit command is received, it will automatically commit after timeout.
In this way, although 3PC can barely be called a nonblocking (blocking here refers to permanent blocking, excluding short-term blocking caused by synchronous operations) algorithm, it increases the possibility of data inconsistency. In a later article, we will specifically discuss that the timeout mechanism seems reliable and practical but is full of uncertainty.
Although 3PC is better than 2PC at the algorithm level, the additional round of message synchronization makes the already poor performance worse; the pursuit of non-blocking introduces new inconsistencies; and there is no good solution for network partitioning . So in reality, it did not get more applications than 2PC as expected.
On the contrary, 2PC, because it basically achieves the goal of distributed transactions, has formed a system called XA eXtended Architecture) standard, widely adopted by databases such as PostgreSQL, MySQL, and Oracle, and supported by various languages and APIs.
This is also a reflection of the different trade-offs between theory and practice.
In addition to 2PC and 3PC, there is also a distributed transaction implementation called TCC (Try-Comfirm-Cancel).
TCC and 2PC/3PC have similar ideas, but couple the application layer into the whole process, so I won't go into details here.
• The root of inconsistency is that each replica only has local information and cannot make correct decisions. A character needs to have all the information.
• Rather than passively solving problems one by one, consider a try-first-failure-and-rollback approach, known as a transaction.
• 2PC is a typical implementation of distributed transactions. It is divided into two stages: Prepare-Commit, which is coordinated and then operated.
• 2PC has blocking and data consistency issues in some cases.
• 3PC synchronizes the decision results by inserting one more round of messages, which solves the blocking problem when the coordinator and participants hang up at the same time.
• Although 3PC alleviates blocking and solves some data inconsistencies, it also sacrifices performance and introduces new possibilities for data inconsistencies.
• Neither 2PC nor 3PC provide partition tolerance.
In the previous article, we divided the solutions to the data consistency problem into prevention categories and pollution first and then treatment categories. Then, a consistent solution of prevention class -- single-master synchronization is introduced.
In this article, several implementations of distributed transactions are roughly introduced as the second type of consistency solution for prevention.
However, these two solutions are powerless when encountering network partitions.
And as we said when we introduced CAP, network partitioning is an unavoidable problem. Therefore, in the next article, let's see if there is any preventive consistency algorithm that can provide partition tolerance.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
Alibaba Cloud_Academy - September 21, 2022
ApsaraDB - July 3, 2019
Alibaba Clouder - December 5, 2016
Alibaba Cloud_Academy - September 16, 2022
vic - August 30, 2019
Alibaba Cloud_Academy - September 30, 2022
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resourcesLearn More
Mitigate the scalability problem of single machine relational databases for large-scale online databases.Learn More
A PaaS platform for a variety of application deployment options and microservices solutions to help you monitor, diagnose, operate and maintain your applicationsLearn More
More Posts by Alibaba Cloud_Academy