Raft Engineering Practices and the Cluster Membership Change

This article discusses some issues during the implementation of the joint consensus and some of the problems of Raft's one-step cluster membership changes.

Introduction

Due to the frequent breakdowns, the distributed system needs to add and remove servers dynamically, without affecting system availability in the process.

Cluster membership change refers to a change in the servers running the consensus protocol during cluster operation, such as adding, removing, and replacing servers.

Cluster membership changes the result in a consensus problem; all servers are agreed upon by the new members. However, the change has a particularity because the members participating in the voting process will change.

If the cluster membership change is treated as a general consensus problem, the change request is sent directly to the leader. Then, the leader synchronizes the membership changelogs and commits the request after reaching a majority. Each server switches membership configuration from the old one (C-old) to the new one (C-new) after committing the membership changelog entry.

The switch from C-old to C-new may occur at different times for each server, committing its membership changelog entry at different times. There may be a point where there are two disjointed majorities in C-old and C-new. As a result, two leaders could be elected, forming different resolutions and undermining security.

Figure 1: At some point in the cluster membership change, two disjoint majorities in C-old and C-new exist at the same time.

As shown in Figure 1, the cluster grows from three servers to five. A direct extension may cause server 1 and server 2 to be the majority of the C-old and server 3, server 4, and server 5 to be the majority of the C-new. The disjointed relationship between the two may lead to conflicting resolutions.

Due to its particularity, the cluster membership change cannot be considered as a general consensus problem. Raft provides the joint consensus to solve this problem, a two-step method for membership change.

Cluster Membership Change of the Joint Consensus

The joint consensus refers to the cluster switches from the C-old to a transitional configuration initially, which is a combination of the C-old and the C-new, namely C-old,new. Once the C-old,new entry is committed, then it switches to the C-new.

Figure 2: Cluster membership change of the joint consensus

After receiving the request, the leader synchronizes a C-old,new log entry to C-old and C-new. Then, all log entries must be confirmed by the majorities of C-old and C-new. After that, the C-old,new log entry is not committed until C-old and C-new have reached the consensus. Later, the leader synchronizes a log entry containing only C-new to C-old and C-new, with the log entry confirmed only by the majority of C-new. The C-new log entry only needs to be committed when a majority is reached in C-new, at which point the membership change is completed, and members not in C-new are taken offline automatically.

If a failover occurs or the old leader crashes, any of the servers in C-old,new could become a new leader. If the new leader does not have C-old,new log entries, the C-old will be applied continuously. If there is a C-old,new log entry on the Follower, the log entry will be truncated by the new leader and will revert to the C-old. Therefore, the membership change fails. However, if the C-old,new log entry exists on the new leader, the incomplete change process will continue.

The joint consensus is more common and easy to understand, but the implementation is complex. It is divided into two steps, so there aren't any assumptions about the relationship between C-old and C-new. The two-step solution is introduced to prevent C-old and C-new from forming disjoint majorities and selecting two leaders.

With the increase in the limitations on membership changes, the membership change can be simplified to one step if the intersection of any majority of C-old and C-new is not null. The C-old and C-new cannot form a majority.

One-Step Cluster Membership Change

The key to implementing a one-step membership change is to restrict the intersection of C-old and C-new, so any majority is not null. The method is to add or remove only one server for each change.

Figure 3: Add or remove one server

As shown in figure 3, the situation when adding or removing one server can be proven mathematically. As long as only one server is allowed to be added or removed at a time, C-old and C-new cannot form two disjointed majorities. Therefore, by adding or removing one server at a time, the switch from C-old to C-new can be carried out directly without any transitional configuration, enabling one-step membership changes.

Note: Only one server can be changed at a time. In the case where multiple servers need to be changed, it can be achieved by performing multiple one-step membership changes.

Although the theory is simple, many problems exist in practice.

Problems in One-Step Cluster Membership Change

Among the problems of the Raft one-step member changes, none of them are known better than the famous correctness problem. Moreover, the one-step cluster membership change has a potential availability problem.

The Correctness Problem

The correctness problem emerges when leader switching occurs during the Raft one-step change, causing committed log entries to be overwritten again. Diego Ongaro, the inventor of Raft, discovered this problem as early as 2015 and explained it here.

The following is an example of a Raft one-step change failure. The initial configuration is four servers: a, b, c, and d. If server u and server v need to be added into the cluster and leader switching occurs, the committed log entries will be lost.

Figure 4: The correctness problem of the one-step cluster membership change

t₀: C₀ is the configuration of the servers of a, b, c, and d
t₁: Select server a as a leader and server b and server c as followers in term 0
t₂: Server a synchronizes the changelog entry C_u: only to servers of a and u, and the submission is unsuccessful
t₃: Server a crashes
t₄: Select server d as a leader and server b and server c as followers in term 1
t₅: Server d synchronizes the changelog entry C_v: to servers of c, d, and v, and the submission is successful
t₆: Server d synchronizes the normal log entry E to servers of c, d, and v, and the submission is successful
t₇:: Server d crashes
t₈:: Select server a as a leader and server u and server b as followers in term 2
t₉:: Server a synchronizes the local log entry C_u: to all servers, causing the loss of committed C_v: and E.

Why does the problem occur? The root cause is that the failure occurred before the changelog entries of the previous leader got synchronized to the majority. As soon as a new leader is elected, it commits log entries with the new configuration before changing servers. When the previous leader is elected again, another majority may be formed, overwriting the committed logs. Thus, the data loss appears.

After finding this problem, Ongaro provided a simple solution similar to the Raft log commitment condition. The solution is the new leader must commit a log entry in the current term before allowing synchronization of membership changelogs. In other words, the leader is not allowed to synchronize the changelog entry until its commitment in the current term.

According to this solution, the simplest implementation is to commit a log entry with a no-op command after being elected and then synchronize the membership changelog entry. This no-op log entry can ensure that there is at least one server intersection with the uncommitted membership changelog entry of the previous leader. Hence, it can find that the log entries of the previous leader are too old to prevent the previous leader from being re-elected as the leader, and it prevents disjoint majorities.

Corresponding to the example above, after L₁ is elected as the new leader, a no-op log entry must be committed first, and then log entries C_v and E can be synchronized. Therefore, we see that the log entry in L₂ is old, preventing the L₂ from being elected as the leader again.

Another solution is the joint consensus, which doesn't have a correctness problem.

The Availability Problem

The one-step cluster membership change can only add or remove one server at a time. When performing a server replacement, two changes are required. The first change adds the new server, and the second change removes the old server. If a network partition emerges, there is a risk that the service will be unavailable.

abc -> abcd -> bcd

Given that servers of a, b, and c are deployed in three data centers, now, server a must be replaced with server d in the same data center because of the breakdown of server a. According to the one-step membership change, abc must be changed to abcd first and then to bcd.

abc -> a | bc -> cannot change to ad | bc

The intermediate server state of abcd may cause the entire cluster to be unavailable when the binary network partition (ad | bc) occurs. Server a and server d are located in the same data center, and the binary network partition cannot be ignored.

abc -> a | bc -> bc -> bc | d

One way to solve this problem is to remove the old server first and then add the new server when implementing a replacement. For example, state abc becomes state bc first, and then state bcd to avoid the state of abcd.

abc -> a | bc -> a | bc U bc | d -> bc | d

Another method is the joint consensus, state abc becomes state abc U bcd then state bcd to avoid the state of abcd.

Raft Engineering Practices and the Cluster Membership Change

Although the Raft membership change theory is simple, many aspects need to be considered during engineering implementation. We recommend applying the joint consensus because of the correctness and availability problems of the one-step membership change. This article mainly discusses some issues that must be considered in the implementation of the joint consensus.

Two Different Choices for Engineering Implementation

The strict sequence must be ensured when the Raft protocol is applied, but there is no data on the new servers yet. Therefore, the new servers need to be synchronized with the data after joining the cluster to work properly. There are two choices for engineering implementation. One is to allow the new servers to join first and then synchronize the data. The other is the opposite, synchronizing the data for the new servers first and then joining after the synchronization is completed. These two ways have advantages and disadvantages.

Table 1: The advantages and disadvantages of two different choices for engineering implementation

	Advantages	Disadvantages
New servers join before data synchronization	It is simple and fast, and servers that do not exist yet can join.	It may reduce service availability.
New servers synchronize data before joining	Service availability is not affected.	It is complex and slow, and servers that do not exist yet cannot join.

New servers can join the system first and then synchronize the data. Membership changes can be completed immediately. Since most servers agree to join, it is possible to join servers that do not exist yet and then slowly synchronize the data after joining. However, new servers cannot serve before the data synchronization is completed. Moreover, the participation of new servers may increase the majority set with the new servers temporarily unavailable. If a failover occurs at this point, it is likely that the conditions for most members to survive cannot be met, leaving the service unavailable. As a result, it simplifies the membership change process by adding new servers first and then synchronizing data but may reduce service availability.

When new servers synchronize data before joining, membership changes need to be performed asynchronously in the background. The detailed procedure adds a new server first, as a learner can only synchronize data. It does not have voting rights and does not increase the majority set. Then, the new server can join when the data synchronization is completed. It can start working immediately without affecting service availability. Consequently, a new server synchronizes data before joining, without affecting availability. However, the membership change process is complex, and data synchronization is required for the new server. Therefore, the new server cannot join a server that does not exist.

Configuration of Membership Changelog Entries

Membership changelog entries are designed for the change of membership configuration. Therefore, the option of configuration for changelog entries is critical.

Table 2: Configuration for membership changelog entries of the joint consensus

Membership Changelog Entry	Configuration
C-old,new log entry	C-old,new
C-new log entry	C-new

For a membership change of the joint consensus, the configuration used by the membership changelog entries is determined. C-old,new log entries use the joint consensus configuration (C-old,new) that requires majorities of the C-old and C-new to confirm before submission. The C-new log entries use the C-new that can only be committed by the majority of the C-new. However, the C-new log entries will be also synchronized to the C-old to allow the servers of C-old that are not in the C-new to exit automatically.

Effective Time of Membership Changelog Entries

Membership changes are done through the membership changelog entries, allowing servers to agree on configurations. However, member changelog entries, unlike normal log entries, do not have to wait until logs are committed to take effect.

Table 3: Effective time of membership changelog entries

	Leader	Follower
Membership changes of the joint consensus	Before synchronizing membership changelog entries	After data persistence of membership, changelog entries are completed

For a membership change of the joint consensus, the effective time of membership changelog entries is determined. Logs should take effect before synchronizing on the leader term, while on the follower term, logs should take effect after the data persistence is completed. For this reason, membership changelog entries may be rolled back after leader switching.

Whether Logs Need to Be Committed in Strict Order during Membership Changes

Consider a situation where the number of servers is reduced by the change, thus reducing the majority set, and a smaller majority is easier to achieve. As a result, logs after the membership change reach the majority before the previous logs.

According to the commitIndex of the pushing algorithm in the paper of Raft:

If there exists an N such that N > commitIndex, a majority of matchIndex[I] ≥ N, and log[N].term == currentTerm:
set commitIndex = N

If a log reaches the majority, it advances the commit index to that log. If some logs have not yet reached the majority according to the C-old, they are also committed together.

Will this situation go wrong? No. Since the logs have already been committed with the C-new after the membership change, servers that are not in the C-new cannot be elected leaders and will not overwrite the previous logs. Therefore, logs can be committed safely even if the C-old does not comply with the majority.

The implementation of hashicorp raft is committed in a strict order. Logs can only be committed after they have reached the majority.

Restore the Service, as Only a Minority of Servers Survive

Raft can work normally only when a majority of its servers are alive. In practice, the situation may arise where only a minority of servers survive, at which point restoring the service has become a problem.

Since only a few servers are alive, the majority cannot be reached anymore, data cannot be written, and normal membership changes cannot be made. So, there is a need to provide an interface to force a server configuration change. Here, the list of configurations for each server is set, facilitating recovery from most failures.

For example, if only one server S1 exists, forcibly change the configuration and set the list to {S1} to form a member list with the only S1. By doing so, S1 can continue to provide read and write services. Then, it schedules other servers to change members. The maximum availability mode can be realized by forcibly modifying the member list.

Summary

Raft provides the joint consensus and one-step cluster membership changes, promoting the application of membership changes in projects. This article summarizes some of the problems of Raft's one-step cluster membership changes and engineering practices for membership changes. The joint consensus is universal and cannot go wrong easily. There are many problems with membership changes in the first phase. We recommend using the joint consensus to change members as much as possible during engineering.

Appendix: Raft Engineering Practices and the One-Step Cluster Membership Change

Although one-step membership change is not recommended for projects, here are some engineering practices for one-step membership change.

Configuration of One-Step Membership Changelog Entries

For one-step cluster membership changes, it is a problem for the membership changelog entries to apply whether the C-new or the Clod. There is nothing to break the majority of C-new and C-old in the one-step membership changelog entries, whether the new member is configured with C-new or with C-old because there is at least one server that intersects. Therefore, the C-new and the C-old can be configured for the one-step membership changelogs with advantages and disadvantages separately.

Table 4: Advantages and disadvantages of C-old and C-new for one-step member changelog entries

	Advantages	Disadvantages
C-old	The correctness problem of one-step membership changes can be avoided, and only a smaller majority set may be required for adding servers.	A larger majority set may be required to remove servers.
C-new	The correctness problem of one-step membership changes needs to be solved, and only a smaller majority set may be required for removing servers.	A larger majority set may be required to add servers.

The C-old can avoid the correctness problem of the single-member change. Therefore, the no-op log entry after the leader being elected can be omitted. At the same time, only a smaller majority set may be required when adding servers, but a larger majority set may be required when reducing servers.

The C-new requires the leader to commit a no-op log entry to avoid the correctness problem. Meanwhile, only a smaller majority set may be required when removing servers, but a larger majority set may be required when adding servers.

Whether the one-step membership changelog is configured with the C-new or the C-old, it is best to synchronize it to all servers in both configurations. By doing so, new servers can be notified when they are added sooner or later, and the deleted servers can also receive a notification and exit automatically when they are removed.

For the configuration of the one-step membership changelog entries, Raft chooses the C-new while the etcd chooses the C-old.

Effective Time of One-Step Membership Changelog Entries

Table 5: Effective time of one-step membership changelog entries

	Leader	Follower
C-new	Before synchronizing membership changelogs	After data persistence of membership, changelogs are completed
C-old	In theory, it only needs to take effect before the next membership change starts. While in practice, it generally takes effect after the membership changelog entries are committed.

For a one-step membership change, if the membership changelog entry uses the C-new, it is the same as for the joint consensus. Logs should take effect before synchronizing on the leader term. While on the follower term, logs should take effect after the data persistence is completed. If the membership changelog entry uses C-old, in theory, it only needs to take effect before the next membership change starts. However, in practice, it generally takes effect after the membership changelog is committed for new servers to start service as soon as possible.

For the configuration of the one-step membership changelog entries, Raft chooses the C-new to take effect after the data persistence of membership changelog entries is completed. The etcd chooses the C-old to take effect after submission.

Community

Raft Engineering Practices and the Cluster Membership Change

Introduction

Cluster Membership Change of the Joint Consensus

One-Step Cluster Membership Change

Problems in One-Step Cluster Membership Change

The Correctness Problem

The Availability Problem

Raft Engineering Practices and the Cluster Membership Change

Two Different Choices for Engineering Implementation

Configuration of Membership Changelog Entries

Effective Time of Membership Changelog Entries

Whether Logs Need to Be Committed in Strict Order during Membership Changes

Restore the Service, as Only a Minority of Servers Survive

Summary

Appendix: Raft Engineering Practices and the One-Step Cluster Membership Change

Configuration of One-Step Membership Changelog Entries

Effective Time of One-Step Membership Changelog Entries

Read previous post:

Read next post:

Xiangguang

You may also like

Comments

Xiangguang

Related Products

Function Compute

Elastic High Performance Computing Solution

Quick Starts

ECS(Elastic Compute Service)