Nodes often fail during the operation of the distributed system. You must add, delete, and replace nodes on demand.
Membership change is an important topic in a distributed system, especially in a consistency system. It helps with improving O&M capabilities and service availability.
The Joint Consensus method of two-phase membership change proposed in Raft is the mainstream membership change method in the industry, promoting the engineering application of membership changes substantially. However, Joint Consensus includes two phases of membership changes with two logs to be proposed for each change. This may cause inconvenience in some systems. Raft also proposed the single-step membership change method. However, the single-step membership change method can only add or remove one member at a time, which is highly restrictive and fallible. Therefore, this method is generally not recommended.
It is natural to wonder whether membership change through Joint Consensus can be implemented only in one step. This article discusses this topic.
Membership change refers to a change on nodes that are subject to the consensus protocol during cluster operation, such as node addition, removal, or replacement. The process of membership change shouldn't affect system availability.
Membership change also involves the consensus issue, meaning all nodes agree on the member configuration. However, membership change has its own particularity because, in the process of membership change, the members participating in the voting will change.
Figure 1 – At some point of membership change, two disassociated majorities exist simultaneously in Cold and Cnew
If membership change is regarded as a general consensus issue during the membership change process, there may be differences in time as each node was switched from Cold to Cnew. At a certain time, two disassociated majorities may exist simultaneously, which results in a double Quorum problem and destroys consensus.
To solve this problem, Raft uses Joint Consensus, a two-phase method for membership change.
A joint member configuration called Cold,new is added as the transition configuration to avoid the double Quorum problem in Joint Consensus for membership change. Cold,new is the combination of Cold and Cnew. Cold and the Quorum of Cold,new have intersection, and Cold,new and the Quorum of Cnew also intersect. Membership change starts from the switch from Cold to Cold,new. After Cold,new completes submission, switch to Cnew. This process guarantees that Cold and Cnew are not used at the same time, and double Quorum is avoided, which guarantees security.
Figure 2 – The relationship between the Quorum of Cold, Cold,new, and Cnew
Joint Consensus uses two logs to complete the membership change process. After receiving the change request, the Leader synchronizes a Cold,new log to Cold and Cnew, and then all logs need the confirmation of the majority from Cold and Cnew. The Cold,new log can only be submitted after the majority of Cold and Cnew agree on a consensus. Then, the Leader synchronizes a log only containing Cnew to Cold and Cnew. Afterward, logs only need the confirmation of the Cnew majority. The Cnew log can be submitted once the majority in Cnew agrees on a consensus. At this time, membership change is completed, and members not included in Cnew are disabled automatically.
Figure 3 – Membership change process of Joint Consensus
If a Failover occurs during membership change and the old Leader is down, any node in Cold,new may become the new Leader. If the new Leader does not have Cold,new logs, continue to use Cold. If Cold,new logs exist on the Follower, the log will be truncated by the new Leader and rolled back to Cold, and the membership change fails. If the new Leader has Cold,new logs, the unfinished membership change process continues.
Two phases are required for Joint Consensus membership change because no assumptions are made in terms of the relationship between Cold and Cnew. The two-phase scheme is introduced to avoid the double Quorum issue resulting from the disassociated Quorum of Cold and Cnew, respectively.
When the membership change restriction is enhanced, assuming that the Quorum intersection between Cold and Cnew is not empty, the double Quorum issue does not occur. Therefore, the membership change can be simplified into one phase.
The key to implementing single-step membership change is to restrict Cold and Cnew and make sure that the Quorum intersection between Cold and Cnew is not empty. How can we do that? The method is to add or delete only one member for each membership change.
Figure 4 – Quorum of Cold and Cnew when adding or removing one member
The situation of adding or removing one member, as shown in figure 4, can be strictly proved mathematically. As long as only one member is allowed to be added or removed at a time, it is impossible to form two disassociated Quorum in Cold and Cnew. By doing so, Cold can be switched to Cnew directly. The transition member configuration is not required to implement the single-step membership change.
You can only change one member at a time in the single-step membership change. For example, you can change or replace multiple members by performing single-step membership change multiple times.
Although the theory of single-step membership change is simple, it causes many problems. In practice, it is not that simple. A previous article entitled Raft Engineering Practices and the Cluster Membership Change has described that topic in detail.
Joint Consensus membership change is more common, but it involves two phases. One membership change requires the submission of two logs. As for single-step membership change, one membership change only requires one log to be submitted, but only one member can be changed at a time. Can the advantages of the two be combined? Can Joint Consensus membership change be implemented in a single-step manner?
During the membership change process of Joint Consensus, the submission of the Cold,new log has helped each node reach consensus on the Cnew configuration. Then, what is the role of Cnew logs? Can the switch from Cold,new to Cnew be achieved once the Cold,new log is submitted? After this, does it mean that the Cnew log is no longer necessary and single-step implementation is achieved?
Considering the function of the Cnew log in Joint Consensus membership change, a proposal is initiated in Cnew after the Cold,new log is submitted. After the node receives and persists the Cnew log, the Cold,new configuration is switched to the Cnew configuration. Members not in the Cnew configuration are disabled after the log is submitted. According to this process, the function of the Cnew log can be summarized below:
If the work of the Cnew log can be finished without using it, doesn't it mean the two-phase Joint Consensus membership change can be achieved in a single step? This approach has been explored systematically.
ZooKeeper supports membership change based on Zab starting from Version 3.5.0. ZooKeeper has the Primary Order feature, while the Joint Consensus membership change using two logs cannot guarantee this feature. To make the membership change universal without losing the Primary Order feature, ZooKeeper proposed its membership change method in a paper titled Dynamic Reconfiguration of Primary/Backup Clusters and applied this method. ZooKeeper did this earlier than Raft.
Figure 5 shows the ZooKeeper membership change protocol. In the figure, the old member configuration is represented by S, and the new member configuration is represented by S', with P being the Leader node. Figure 5 shows the process of replacing nodes B1 and B2 with nodes B3 and B4:
Figure 5 – ZooKeeper membership change protocol
Initialization: To enable the new node to obtain the latest data, the new nodes B3 and B4 in the new member configuration S' will connect to the current master node P first. Then, P transmits its current state to them as their initial state. In the Zab protocol, when the secondary node is connected to the primary node, such transmission occurs automatically, and the secondary node continues to receive all subsequent operation logs (such as Op1 and Op2 in the figure) from the primary node P. In this process, nodes B3 and B4 do not participate in the voting.
If a Failover occurs during membership changes, the following situations may occur:
If the Failover occurs after the COP log is sent and before ACTIVATE, any node in the new or old member configurations may become a new Leader. If no COP log exists on the new Leader, the membership change fails. If the COP log exists on the new Leader, the unfinished membership change process will resume.
If the Failover occurs after ACTIVATE, the membership change has been completed, but there is no guarantee that the new Leader must be in the new member configuration. At this time, the nodes that are not in the new member configuration cannot be disabled. Therefore, a no-op log must be submitted in the new member configuration after the ACTIVATE message is sent. After the no-op log is submitted, it can be ensured that the new Leader is in the new member configuration, and those nodes not included in the new member configuration can be disabled safely.
ZooKeeper uses the asynchronous Commit message, namely the ACTIVATE message, to notify the node to switch from the old member configuration to the new member configuration. Asynchronous no-op log enables nodes that are not in the new member configuration to be disabled safely. The ACTIVATE message and asynchronous no-op log of ZooKeeper serve as the Cnew log in Joint Consensus membership change.
The ZooKeeper membership change protocol is not as concise as the Joint Consensus membership change protocol. The Joint Consensus membership change protocol can be used through two phases to ensure the security of membership changes without imposing too many restrictions. Can the ZooKeeper membership change protocol be improved?
The asynchronous ACTIVATE message and no-op log exist in the ZooKeeper membership change protocol to give play to the function of the Cnew log. If this is understood, the Cnew log of the Joint Consensus membership change can be changed into an asynchronous log. After the Cold,new log is submitted, the membership change is considered completed, and the Cnew log can be submitted asynchronously. Once the Cold,new log is submitted, all nodes have agreed on the new member configuration and will never roll back to the old Member configuration. The remaining process will be completed, and the Cnew log will be submitted.
Another method of improvement is to keep the ACTIVATE message, but it does not use the no-op log. How can we ensure that the node that switches to new member configuration has the priority to be elected? Based on the election security, the node with the latest log has the priority to be elected. Thus, for nodes with the current member configuration, if logs are all the latest, votes are cast preferentially on the nodes that are switched to the new member configuration. By doing so, the nodes that are switched to the new member configuration have the priority to be elected. After most nodes are switched to the new member configuration, nodes that are not in the new member configuration can be disabled safely.
The proposal of Joint Consensus membership change facilitates the engineering application of membership changes significantly. It is simple and versatile but uses two phases. Two logs must be submitted for a change. This article discusses the single-step implementation of two-phase Joint Consensus membership change and proposes some ways of improvement, providing more options for the engineering application of membership change.
Stone Doyle - January 28, 2021
Xiangguang - May 12, 2021
ApsaraDB - February 15, 2021
Apache Flink Community China - August 19, 2021
Alibaba Developer - October 20, 2021
Alibaba Cloud New Products - June 2, 2020
A ledger database that provides powerful data audit capabilities.Learn More
This solution helps Internet Data Center (IDC) operators and telecommunication operators build a local public cloud from scratch.Learn More
A financial-grade distributed relational database that features high stability, high scalability, and high performance.Learn More
ApsaraDB for ClickHouse is a distributed column-oriented database service that provides real-time analysis.Learn More