By Qinxia
Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统. All rights reserved to the original author.
In the previous article, we explained the consistency problems in distributed systems. The single-master synchronous data replication mechanism was introduced in detail in the previous article. However, the required strong consistency still cannot be guaranteed.
Then, through the analysis of data replication, we find that the data consistency problem can be solved with transactions.
While transactions can be used to solve the problem of consistency in data replication, they are not only intended for this purpose. Therefore, we must have a good understanding of distributed transactions.
Data inconsistency among replicas is caused by machine or network failure. As a result, either the replicas do not know that they are different from each other, or they know they are different, but the problem cannot be solved.
In essence, each replica only holds partial information, so correct decisions cannot be made.
The right decision can be made if each replica holds all the information.
However, the synchronous consumption of information and the higher risk of errors caused by more complex processes make this solution unfeasible.
The compromise is that only one replica (with the relevant functions abstracted and given a new role) holds all the information. Then, we can use it to coordinate the other replicas.
This idea has been reflected in the single-master and multiple-slave architecture. As a special replica, the master acts as a coordinator and does data replication to other replicas separately and independently. Now, there is a need to integrate these separate coordination efforts.
There are too many possible failures in a complex distributed system:
We also need to design a response mechanism for each possible failure. Let’s take the replica problem as an example:
This makes it difficult to cover all possible failures and leads to more complex system design and implementation, which affects the reliability of the system.
We can think differently. We know failures may occur, and there may be various failures. Sometimes, we even don't know what failures are currently occurring.
However, we won't deal with the failure in such a fine-grained way. We ignore the failure and continue processing the task. If it works, that's good, and if it fails, no matter what the reason is, restore the scene and do it again later.
The combination of the two ideas above gives us Two Phase Commit (2PC). 2PC is also one of the typical implementations of distributed transactions.
Give the coordinating role a new name called Coordinator, and the other participating roles are called Participant.
As the core, the coordinator holds global information and has the ability and right to make decisions. Participants only need to focus on themselves and their jobs.
This solution is called 2PC because the transaction commit is divided into two stages: Prepare and Commit.
The main execution process is listed below:
It seems that 2PC meets the requirements well, so we have to analyze the execution process carefully to determine whether 2PC is perfect enough. The execution process can be analyzed from the following several dimensions:
I've tried to consider all of these dimensions in combination, but it's too complicated. Let's first find some rules and commonalities, rule out some combinations, and focus only on the key situations.
For the process:
For fault points and fault types:
(By the way, it seems that many problems are caused by partial delivery of messages and that the coordinator needs to make sending the commit command to multiple participants a transaction. Also, we are in the process of designing the transaction.)
From the analysis above, we can first come to the first conclusion that short-term blocking can occur anytime and anywhere. This is the nature of the synchronous operation and an unavoidable shortcoming of 2PC.
Then, we focus on the following dimensions that may cause permanent blocking and data consistency problems:
No. | Process | Coordinator | Participants | Consistency Hazard | Permanent Blocking |
1 | commit/abort | fail-dead | ok | no | no |
2 | commit/abort | fail-dead | fail-dead | yes | yes |
3 | commit/abort | fail-dead | fail-recover | no | no |
The explanation by number is listed below:
The process is troublesome, but the conclusion is quite simple. When the fail-dead fault occurs in the coordinator and some participants at the same time, permanent blocking and data consistency problems may arise.
For network partition, when the coordinator sends a commit/abort message, network partition failure occurs. Some participants receive the message, and some do not. New coordinators will be selected for partitions without coordinators. If the participants that receive and do not receive messages are all scattered in different network partitions, each coordinator will make different decisions, resulting in data inconsistency between partitions.
No. | Process | Network Partition | Consistency Hazard | Permanent Blocking |
1 | commit/abort | yes | yes | no |
The two types of failures above (fail-dead and network partition) are analyzed separately. When the two are combined, the effect is similar to bitwise OR.
To sum up, 2PC mainly produces three problems in two types:
In the previous article, it is to solve the problem of data consistency that leads to distributed transactions in this article. After 2PC is proposed, the consistency problem appears, and unrecoverable blocking may occur. We must find a way to solve all those problems.
Let's think about it carefully. When the fail-dead fault occurs in the coordinator and participants at the same time, and after the new coordinator is elected, why can't the coordinator decide whether the current transaction should be committed or aborted? We define the role of the coordinator to make it have the ability to make decisions. Why does it not have the ability to make decisions in this case?
The key lies in the sentence mentioned above when we reduce the dimension for the problem: The coordinator and participants will persist the transaction status locally
.
It is because of the local persistence of the transaction status in each machine that we can ensure the failure of the fail-recover type will not lead to the failure of decision-making.
However, in the case of fail-dead, the transaction status is lost. If all the machines that have locally persisted the transaction status are dead, the status is completely lost. In the example mentioned above, the fail-dead failure occurs in the coordinator after the coordinator sends the first commit command, and the fail-dead failure also occurs in the participant that receives the command, so the remaining participants will be meaningless.
The source of the problem is found. Since it is the loss of the decision result resulting from the first stage of voting that causes the transaction status loss, we send the decision result to all participants before we perform the real commit operation. This way, as long as one machine is still alive (the situation that all machines are dead needs to be avoided by multi-rack and other node distribution schemes), the decision result will not be lost.
This is the idea of the Three Phase Commit (3PC).
In the middle of the two phases of 2PC, insert a step dedicated to synchronizing the decision result. The system will only go to the next step if this step is successful. Otherwise, it will try again or abort the task.
3PC solves the second problem of 2PC very well. However, there is still no way to solve the third problem-data inconsistency after network partitioning. The first problem of 2PC, the performance loss caused by short-term blocking, is a common problem of synchronous schemes, and 3PC can't do anything about it.
In addition, if 2PC does not have a standby coordinator, as long as the coordinator fails, the entire system will be blocked for a long time, so it is considered a blocking algorithm.
The added phase of 3PC solves the blocking problem and the consistency problem. A timeout mechanism is introduced on the participant side to alleviate the blocking, referring to the practice of the coordinator. After the Pre-Commit phase, if no Do-Commit command is received, it will automatically perform the Commit command after the timeout.
This way, although 3PC can be reluctantly called the non-blocking algorithm (non-blocking, where blocking refers to permanent blocking, excluding short-term blocking due to synchronization operations), the possibility of data inconsistency increases. In a future article, we will focus on how the timeout mechanism seems reliable but is full of uncertainties.
Although 3PC is better than 2PC at the algorithm level, the additional step of message synchronization makes the already poor performance worse. The pursuit of non-blocking has introduced a new possibility of inconsistency. Also, there is no good way to solve network partition. So, 3PC is not expected to be applied more than 2PC in practice.
On the contrary, 2PC has achieved the goal of distributed transactions and formed a standard called eXtended Architecture (XA), which is widely adopted by PostgreSQL, MySQL, Oracle, and other databases and is supported by various languages and APIs.
This is also a reflection of the different trade-offs between theory and practice.
In addition to 2PC and 3PC, there is a distributed transaction implementation called Try-Comfirm-Cancel (TCC).
TCC is similar to 2PC/3PC, but the application layer is coupled into the whole process in TCC, which is not discussed here.
In the previous article, we divided the solutions to the data consistency problem into the prevention method and the treatment method. Then, it introduced a preventive solution to data inconsistency: single-master synchronous replication.
This article briefly introduces several implementations of distributed transactions as a second preventive solution to data inconsistency.
However, neither of these two solutions can help when it comes to network partition.
When we introduced CAP, we said network partition is an unavoidable problem. So, in the next article, let's see if any preventive consistency algorithms can provide partition tolerance.
This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!
Learning about Distributed Systems – Part 9: An Exploration of Data Consistency
Data Consistency and Consensus- Part 11 of About Distributed Systems
61 posts | 50 followers
FollowAlibaba Cloud_Academy - August 29, 2022
Alibaba Clouder - October 23, 2020
Alibaba Cloud_Academy - October 7, 2023
Alibaba Clouder - May 17, 2018
Alibaba Cloud_Academy - September 4, 2023
Wei Kuo - August 30, 2019
61 posts | 50 followers
FollowProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreA PaaS platform for a variety of application deployment options and microservices solutions to help you monitor, diagnose, operate and maintain your applications
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreMore Posts by Alibaba Cloud_Academy