Learning about Distributed Systems – Part 10: An Exploration of Distributed Transactions

By Qinxia

Solve Data Consistency Problems with Distributed Transactions

In the previous article, we explained the consistency problems in distributed systems. The single-master synchronous data replication mechanism was introduced in detail in the previous article. However, the required strong consistency still cannot be guaranteed.

Then, through the analysis of data replication, we find that the data consistency problem can be solved with transactions.

While transactions can be used to solve the problem of consistency in data replication, they are not only intended for this purpose. Therefore, we must have a good understanding of distributed transactions.

Root Causes of Inconsistency

Data inconsistency among replicas is caused by machine or network failure. As a result, either the replicas do not know that they are different from each other, or they know they are different, but the problem cannot be solved.

In essence, each replica only holds partial information, so correct decisions cannot be made.

The right decision can be made if each replica holds all the information.

However, the synchronous consumption of information and the higher risk of errors caused by more complex processes make this solution unfeasible.

The compromise is that only one replica (with the relevant functions abstracted and given a new role) holds all the information. Then, we can use it to coordinate the other replicas.

This idea has been reflected in the single-master and multiple-slave architecture. As a special replica, the master acts as a coordinator and does data replication to other replicas separately and independently. Now, there is a need to integrate these separate coordination efforts.

Another Solution to the Problem

There are too many possible failures in a complex distributed system:

The server may go down, and the downtime is divided into two cases: the server can be recovered after restarting and never be recovered.
Network failures are also divided into occasional jitter and long-term failures.
If a service fails, we need to confirm whether different roles of the service fail at the same time.

We also need to design a response mechanism for each possible failure. Let’s take the replica problem as an example:

Replicas are necessary to cope with failures.
Replicas must be distributed in different data centers to cope with IDC-level failures.
We cannot put all replicas in the same switch to cope with switch-level failures.

This makes it difficult to cover all possible failures and leads to more complex system design and implementation, which affects the reliability of the system.

We can think differently. We know failures may occur, and there may be various failures. Sometimes, we even don't know what failures are currently occurring.

However, we won't deal with the failure in such a fine-grained way. We ignore the failure and continue processing the task. If it works, that's good, and if it fails, no matter what the reason is, restore the scene and do it again later.

2PC

The combination of the two ideas above gives us Two Phase Commit (2PC). 2PC is also one of the typical implementations of distributed transactions.

Give the coordinating role a new name called Coordinator, and the other participating roles are called Participant.

As the core, the coordinator holds global information and has the ability and right to make decisions. Participants only need to focus on themselves and their jobs.

This solution is called 2PC because the transaction commit is divided into two stages: Prepare and Commit.

The main execution process is listed below:

After receiving the transaction request from the application, the coordinator sends the Prepare command to all participants.
Participants need to make preparations locally to ensure the success of the transaction (such as acquiring locks) and record redo logs and undo logs for performing the redo or rollback operations (similar to stand-alone transactions). Participants will return Yes to the coordinator if the preparation work is done successfully. Otherwise, No is returned.
After the coordinator receives responses from all participants, it summarizes, checks, and records the responses in the local transaction log. If all responses are Yes, the coordinator sends a Commit command to all participants. Otherwise, the coordinator sends an Abort command.
Participants receive the command from the coordinator. If it is a Commit command, participants formally commit the transaction. If it is an Abort command, participants perform a rollback operation based on the undo log.

It seems that 2PC meets the requirements well, so we have to analyze the execution process carefully to determine whether 2PC is perfect enough. The execution process can be analyzed from the following several dimensions:

Process: There are four times of message delivery in the two phases, which can be subdivided into pre-transmission, mid-transmission, and post-transmission phases.
Fault Point: Faults may occur in the server (participants and the coordinator) and network, which can be divided into a single fault and multiple faults.
Fault Type: The fault type includes fail-recover, fail-dead, network jitter (accidental packet loss), and network partition (network failure for a long time).
Impact: Short-term blocking affects performance, permanent blocking affects availability, and data consistency problems may occur.

I've tried to consider all of these dimensions in combination, but it's too complicated. Let's first find some rules and commonalities, rule out some combinations, and focus only on the key situations.

For the process:

In the first phase, since the data will not be committed, there will be no side effects if the entire transaction is canceled after a failure occurs, but blocking may occur during the process.
In the second phase, the data will be committed. Once a partial commit occurs, data consistency problems may appear. More specifically, when the response message from the participant is lost, the transaction has been executed, so no side effect is generated. Therefore, we only need to focus on the case of partial delivery of the message sent by the coordinator.

For fault points and fault types:

As for the fault of fail-recover type, since the coordinator and participants will persist the transaction status locally, coupled with the message retry mechanism, the current transaction will be blocked. Other transactions will be blocked because the current transaction occupies resources (such as obtaining locks), but it will not cause data consistency problems.
As for the fault of fail-dead type, data may be inconsistent due to the loss of the local transaction status. If the fail-dead fault occurs in a participant, the participant can copy data from other participants. However, if the fail-dead fault occurs in the coordinator, the coordinator has no place to copy data. So, we need to focus on the coordinator.
Network jitter failure can be resolved through message retry, and it only causes blocking and does not cause data consistency problems (strictly speaking, data is inconsistent before the retry is successful).
Through the analysis of CAP in the previous article, network partition failure is an important cause of data consistency problems, so we need to focus on it.

(By the way, it seems that many problems are caused by partial delivery of messages and that the coordinator needs to make sending the commit command to multiple participants a transaction. Also, we are in the process of designing the transaction.)

From the analysis above, we can first come to the first conclusion that short-term blocking can occur anytime and anywhere. This is the nature of the synchronous operation and an unavoidable shortcoming of 2PC.

Then, we focus on the following dimensions that may cause permanent blocking and data consistency problems:

For the process, we focus on the second phase and the partial message delivery success in the second phase, as this is the case that has the actual impact.
For fault points and fault types, we focus on the fail-dead fault in the coordinator and the network partition.

No.	Process	Coordinator	Participants	Consistency Hazard	Permanent Blocking
1	commit/abort	fail-dead	ok	no	no
2	commit/abort	fail-dead	fail-dead	yes	yes
3	commit/abort	fail-dead	fail-recover	no	no

The explanation by number is listed below:

In number 1, the coordinator dies after sending the commit/abort message. Some participants receive it, and some do not. After the new coordinator is selected, it asks all participants about the status of the relevant transactions and gets some responses with instructions and some without instructions. It is sufficient to determine whether the transaction is to be committed or aborted. So the coordinator only needs to send the command again to the participant that has not received the command.
In numbers 2 and 3, the coordinator dies after sending the commit/abort message. Some participants receive it, and some do not. After the new coordinator is selected, it asks all participants about the status of the relevant transactions. Assuming that only one participant does not respond, the other participants give their responses, and the responses are either all commit or abort messages or receive no command. If a response is received, the participant will be able to determine the previous decision, but if there is no response, the participant will not be able to determine the previous decision. If the fail-recover fault occurs in a participant, we will naturally know the status from it, but blocking will occur. However, if the fail-dead fault occurs in a participant, the decision result will be lost forever, and the transaction will be blocked forever. What's more, the participant may have completed the commit operation before death, which will lead to data inconsistency.

The process is troublesome, but the conclusion is quite simple. When the fail-dead fault occurs in the coordinator and some participants at the same time, permanent blocking and data consistency problems may arise.

For network partition, when the coordinator sends a commit/abort message, network partition failure occurs. Some participants receive the message, and some do not. New coordinators will be selected for partitions without coordinators. If the participants that receive and do not receive messages are all scattered in different network partitions, each coordinator will make different decisions, resulting in data inconsistency between partitions.

No.	Process	Network Partition	Consistency Hazard	Permanent Blocking
1	commit/abort	yes	yes	no

The two types of failures above (fail-dead and network partition) are analyzed separately. When the two are combined, the effect is similar to bitwise OR.

To sum up, 2PC mainly produces three problems in two types:

Short-term blocking issues that affect performance or short-term availability (the coordinator usually prepares standby nodes to avoid long-term blocking caused by a single point of failure, which can also be classified as such).
If the fail-dead fault occurs in the coordinator and some participants at the same time, the system may be permanently blocked and data inconsistency may occur.
Consistency issues after network partitioning.

3PC

In the previous article, it is to solve the problem of data consistency that leads to distributed transactions in this article. After 2PC is proposed, the consistency problem appears, and unrecoverable blocking may occur. We must find a way to solve all those problems.

Let's think about it carefully. When the fail-dead fault occurs in the coordinator and participants at the same time, and after the new coordinator is elected, why can't the coordinator decide whether the current transaction should be committed or aborted? We define the role of the coordinator to make it have the ability to make decisions. Why does it not have the ability to make decisions in this case?

The key lies in the sentence mentioned above when we reduce the dimension for the problem: The coordinator and participants will persist the transaction status locally.

It is because of the local persistence of the transaction status in each machine that we can ensure the failure of the fail-recover type will not lead to the failure of decision-making.

However, in the case of fail-dead, the transaction status is lost. If all the machines that have locally persisted the transaction status are dead, the status is completely lost. In the example mentioned above, the fail-dead failure occurs in the coordinator after the coordinator sends the first commit command, and the fail-dead failure also occurs in the participant that receives the command, so the remaining participants will be meaningless.

The source of the problem is found. Since it is the loss of the decision result resulting from the first stage of voting that causes the transaction status loss, we send the decision result to all participants before we perform the real commit operation. This way, as long as one machine is still alive (the situation that all machines are dead needs to be avoided by multi-rack and other node distribution schemes), the decision result will not be lost.

This is the idea of the Three Phase Commit (3PC).

In the middle of the two phases of 2PC, insert a step dedicated to synchronizing the decision result. The system will only go to the next step if this step is successful. Otherwise, it will try again or abort the task.

Can-Commit is similar to the Prepare phase in 2PC.
Pre-Commit is a new stage where the decision maker synchronizes the decision result with the participants.
Do-Commit is similar to the Commit phase in 2PC.

3PC solves the second problem of 2PC very well. However, there is still no way to solve the third problem-data inconsistency after network partitioning. The first problem of 2PC, the performance loss caused by short-term blocking, is a common problem of synchronous schemes, and 3PC can't do anything about it.

In addition, if 2PC does not have a standby coordinator, as long as the coordinator fails, the entire system will be blocked for a long time, so it is considered a blocking algorithm.

The added phase of 3PC solves the blocking problem and the consistency problem. A timeout mechanism is introduced on the participant side to alleviate the blocking, referring to the practice of the coordinator. After the Pre-Commit phase, if no Do-Commit command is received, it will automatically perform the Commit command after the timeout.

This way, although 3PC can be reluctantly called the non-blocking algorithm (non-blocking, where blocking refers to permanent blocking, excluding short-term blocking due to synchronization operations), the possibility of data inconsistency increases. In a future article, we will focus on how the timeout mechanism seems reliable but is full of uncertainties.

Although 3PC is better than 2PC at the algorithm level, the additional step of message synchronization makes the already poor performance worse. The pursuit of non-blocking has introduced a new possibility of inconsistency. Also, there is no good way to solve network partition. So, 3PC is not expected to be applied more than 2PC in practice.

On the contrary, 2PC has achieved the goal of distributed transactions and formed a standard called eXtended Architecture (XA), which is widely adopted by PostgreSQL, MySQL, Oracle, and other databases and is supported by various languages and APIs.

This is also a reflection of the different trade-offs between theory and practice.

In addition to 2PC and 3PC, there is a distributed transaction implementation called Try-Comfirm-Cancel (TCC).

TCC is similar to 2PC/3PC, but the application layer is coupled into the whole process in TCC, which is not discussed here.

TL;DR

The root of inconsistency is that each replica only has local information and cannot make correct decisions. A role that holds all the information is required.
Instead of passively solving problems one by one, consider trying first and then rolling back if it fails (i.e., transactions).
2PC is a typical implementation of distributed transactions, which is divided into two phases, Prepare-Commit. Please coordinate them before starting the operation.
2PC may cause blocking and data consistency problems in some cases.
3PC solves the blocking problem when the coordinator and the participant crash at the same time by inserting a round of messages to synchronize decision results.
Although 3PC alleviates blocking and solves some data inconsistency problems, it deteriorates the performance and introduces new data inconsistency problems.
Neither 2PC nor 3PC provides partition tolerance.

Conclusion

In the previous article, we divided the solutions to the data consistency problem into the prevention method and the treatment method. Then, it introduced a preventive solution to data inconsistency: single-master synchronous replication.

This article briefly introduces several implementations of distributed transactions as a second preventive solution to data inconsistency.

However, neither of these two solutions can help when it comes to network partition.

When we introduced CAP, we said network partition is an unavoidable problem. So, in the next article, let's see if any preventive consistency algorithms can provide partition tolerance.

This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!

Community

Learning about Distributed Systems – Part 10: An Exploration of Distributed Transactions

Solve Data Consistency Problems with Distributed Transactions

Root Causes of Inconsistency

Another Solution to the Problem

2PC

3PC

TL;DR

Conclusion

Read previous post:

Read next post:

Alibaba Cloud_Academy

You may also like

Comments

Alibaba Cloud_Academy

Related Products

Hybrid Cloud Distributed Storage

ACK One

Enterprise Distributed Application Service

PolarDB for Xscale