In a distributed system that reads and writes data on multiple nodes, if you still need to ensure the atomicity, consistency, isolation, durability (ACID) features, you must implement distributed transactions. The key to this implementation is an appropriate commit protocol. At present, the two-phase commit protocol (2PC) is the most concise and most widely used protocol for this purpose.
Standalone systems implement local transactions through transaction managers (TMs). Distributed systems coordinate TMs of multiple nodes to collectively commit success or failure. Therefore, transaction coordinators (TCs) are required. A distributed transaction system mainly consists of these two subsystems, which are also called the participant and the coordinator based on their roles in transaction execution.
A local TM implements functions such as concurrency control of local transactions and recovery from exceptions. A TC enables transactions, divides transactions into multiple sub-transactions and distributes them to corresponding nodes for execution, and coordinates transactions until completion (collectively commits success or failure). You can either implement TM and TC in the same process or deploy them on different nodes.
The process of two-phase commit is simple. When a distributed transaction T is completed, that is, all the nodes that have executed the transaction inform the TC of the execution completion, the TC starts the process of two-phase commit.
Phase 1 - Prepare:
<Prepare T>log and persists it. The TC sends a "Prepare T" message to all participants.
<Ready T>log and persists it, and then sends a "Ready T" message to the TC.
<Abort T>log and persists it, and sends an "Abort T" message to the TC. Then, the TM enters the transaction abortion process locally.
Phase 2 - Commit:
1. When the TC has received responses from all nodes or the waiting timer times out, the TC decides whether to commit or abort the transaction.
<Commit T>log and persists it, and then sends a "Commit T" message to all participants.
<Abort T>log, and then sends an "Abort T" message to all participants.
2. After the participants receive the message from the TC, they write
<Commit T> or
<Abort T> logs and persist them.
The 2PC protocol can ensure a key point in the execution of distributed transactions: A participant can decide whether to abort the transaction at any time before it sends a "Ready T" message to the TC. Once this message is sent, the transaction enters the ready state, in which commit and abortion are completely controlled by the TC. When a participant sends a "Ready T" message, it essentially sends a formal and irreversible commitment to the TC. To ensure this commitment, the participant must persist all the necessary information before it sends a "Ready T" message. Otherwise, if the participant crashes after it sends a "Ready T" message, it may be unable to keep the preceding commitment after a restart. In Phase 2, when the coordinator has written a
<Commit T> or
<Abort T> log, the result of the transaction is decided, that is, it will not change anymore.
To optimize the performance of 2PC, it is crucial to reduce the persistence of critical paths and the number of remote procedure calls (RPCs). An optimization method for typical 2PC is as follows:
The TC is stateless and does not persist logs. However, to facilitate transaction state recovery after a restart that follows a crash, the TC must send a list of the participants in the transaction to each participant and persist this list. If this is the case, even if the TC goes down, each participant can easily communicate with other participants to query the transaction state. This method means that each participant performs the task of querying the transaction state during the downtime of the TC.
When all participants have prepared, the transaction will be successfully committed. Therefore, to reduce the delay in committing, the TC can return a success message to the clients on receipt of "Ready" from all participants. However, read requests may need to wait until the commit is completed, therefore increasing the delay of read requests. On the other hand, if the TC returns a success message after it confirms that all participants have committed, there will be a longer delay in committing and shorter delay of read requests.
The normal 2PC process is simple. However, various exceptions in distributed systems must be considered, such as node failures and network partitioning.
1. If the TC detects a participant failure:
2. If a participant fails during transaction committing (TC), the local transaction status is determined based on the content of participant logs during the recovery process.
<Commit T>log, it indicates that the transaction has been committed, and the participant runs the REDO(T) command.
<Abort T>log, it indicates that the transaction has failed, and the participant runs the UNDO(T) command.
<Ready T>log, the participant P communicates with other nodes to query the current transaction state.
3. If the TC fails during the transaction commit process, each participant needs to decide the local behavior based on the global transaction state by communicating with other participants. A decision on the transaction status has been made:
<Commit T>logs) have committed the transaction, it indicates that the transaction T must be committed.
<Abort T>logs) have aborted the transaction, it indicates that the transaction T must be aborted.
A decision on the transaction status has not been made:
<Ready T>logs) are not in the ready state, it indicates that no global consensus has been reached on whether to commit the transaction or not. There are two options: (1) The participants wait for the TC to recover. (2) The participants abort the transaction. The latter option is selected in most cases to decrease the duration of resource occupation.
<Abort T>logs, no participant knows the current transaction state, so the participants must wait for the TC to recover. In fact, even if there are
<Abort T>logs, log query is time-consuming, and log collection is another consideration. Again, this is one of the worst scenarios in the 2PC protocol.
If all participants are in the ready state and are waiting for the next command from the TC, and an exception occurs on the TC at this point, the participants will keep occupying system resources. If the system has implemented lock-based concurrency control, the participants will also keep holding the locks, and therefore keep other transactions waiting. If this issue exist for a long time, it will have a significant impact on the system. Therefore, the biggest problem for the 2PC protocol is in this scenario: If the TC fails, transactions may be blocked, and final state of the pending transactions cannot be decided until the TC recovers. Meanwhile, if a participant restarts after a crash and plays back these pending transactions, the recovery process will also be blocked due to waiting.
The three-phase commit protocol (3PC) is an extension of the 2PC protocol to solve the blocking problem, but it has also introduced other problems. It addresses the blocking problem by introducing a timeout mechanism: If a participant has sent a PreCommit request but does not receive the final DoCommit request from the TC, the participant must wait for the transaction to be automatically committed upon timeout. Obviously, this will cause consistency problems. For example, if the TC goes down when it receives a PreCommit failure from a participant and intends to send an abort request to other participants, the corresponding distributed transaction is supposed to fail. However, some participants may commit it due to timeout.
To address this problem, an additional phase called "CanCommit" has been introduced to 3PC, which is the first phase. In this phase, the TC asks all participants if they can commit, and the participants that are in the normal state respond with a "CanCommit" message. However, these participants do not occupy any system resources at this point. If the TC receives the "CanCommit" message from all participants, it considers all participants to be normal and supposes the later commit to be successful. However, there is still a small probability of failure: After a participant fails to execute operations in the PreCommit phase, both the TC and the participant go down, but other participants commit the transaction upon timeout, resulting in inconsistency.
Therefore, another optimization has been made in 3PC: If the TC goes down, the system quickly selects a new TC to resume the process and try to ensure that no participant will commit the transaction upon timeout. However, in case of exceptions such as network partitioning, the new TC is unable to reach the participants, so consistency problems may still occur.
3PC improves the availability at the cost of reduced consistency and increased network overhead, which is unacceptable for many online transaction processing (OLTP) systems.
However, with high availability of the TC, the blockage time can be significantly reduced. There are high-availability solutions based on consistency protocols such as Paxos or Raft. These solutions allow multiple nodes to reach consensus on whether to commit or abort a transaction before they inform the participants. When an exception occurs on the TC, the system will quickly select a new coordinator to proceed the transaction until completion.
Stone Doyle - January 28, 2021
ApsaraDB - July 3, 2019
Alibaba Clouder - January 30, 2019
Alibaba Cloud New Products - June 2, 2020
Alibaba Clouder - April 11, 2019
Alipay Technology - November 12, 2019
Mitigate the scalability problem of single machine relational databases for large-scale online databases.Learn More
A ledger database that provides powerful data audit capabilities.Learn More
A financial-grade distributed relational database that features high stability, high scalability, and high performance.Learn More
Leverage cloud-native database solutions dedicated for FinTech.Learn More
More Posts by ApsaraDB