Community Blog Data Consistency Problems- Part 9 of About Distributed Systems

Data Consistency Problems- Part 9 of About Distributed Systems

The combination of the two characteristics of replication - master-slave and timeliness, is causing data consistency risks.

Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.

In the last blog, we finally led to the second core problem of distributed systems - availability. It is also mentioned that replication is the only way out for high availability.

As mentioned at the end of the last article, in addition to providing high availability, replication may also have serious consequences.
For example, when data is replicated asynchronously, the data may not be replicated to the slave replica in time due to network jitter. At this time, if there is a request to read the slave replica, the latest data cannot be read.
For another example, in a multi-master scenario, two masters may receive requests to modify the same data at the same time. At this time, after the two write operations are successfully returned to the client, and then copied to the other party, the update may fail due to data conflict.

Similar problems can be collectively referred to as data consistency problems.

In the last article, we mainly focused on two features of replication -- master-slave and timeliness. Combined, the possible data consistency risks are as follows:

Synchronization Multi-Master Consistency Risk
sync single master none
sync multi-master has
async single master has
async multi-master has

Obviously, both multi-master and asynchronous may bring data consistency risks. (Leaderless without a master can be regarded as a master, a bit of a feeling that extremes must be reversed.)
Asynchrony brings replication lag, resulting in untimely data synchronization.
Multiple masters bring concurrent writes, resulting in data conflicts.

Once the data consistency is not guaranteed, it will be like schizophrenia in the system; outside the system, it will bring many practical problems. for example:

  1. The user just paid for a concert ticket, and then refreshed the page. Since the copy of the data forwarded to the request has not been updated in time, it is found that there are no tickets in the account.
  2. The user receives the push, and his article has a new message, but after clicking in, because the accessed copy has not been updated, it is found that there is no new message.
  3. One user asked a question and another user answered it, but for a certain copy of the data, the answer may be copied before the question, and the third user will see the strangeness of the answer first and the question later. Phenomenon.

These practical problems make the system untrustworthy at the application layer. Clearly, the cost of losing trust is severe.

Therefore, solving the consistency problem has become a major issue for distributed systems.

There are two main types of solutions:
Prevention class, try to avoid the occurrence of consistency problems and provide the strongest consistency guarantee.
Pollution first, then governance classes, allowing inconsistencies and providing weaker consistency guarantees.

From the point of view of convergence, the first type of method enforces real-time convergence of inconsistent data, while the second type of method allows inconsistent data to diverge first and then gradually converge.

From the perspective of message order, the strong consistency of the prevention class ensures that for any node, the data generated first will not be followed by the data generated after the existence of errors due to replication lag and other problems. That is to say, the global linearizability of the entire system to messages is maintained. The second type of method is non-linerizable. (Message order is very important and will be covered in a later article.)

Consistent solution for prevention classes

As the saying goes, preventing problems before they happen, and avoiding problems from the source is naturally the most ideal goal.

In particular, the problem of data consistency, which is serious and difficult to solve, should be avoided if it can be avoided.

So let's first look at the first category, the consistency of the prevention category.

Single-Master Synchronous Replication

The easiest way is the single-master synchronization mode of single leader + synchronous replication mentioned earlier.
A single leader ensures that all data is processed by only a single node, avoiding write conflicts.
Synchronous replication ensures that all replicas update data before returning to the client, avoiding data loss caused by a single machine failure.

In this way, the strong consistency we want is achieved, and the entire distributed system looks like a single-machine system, as if there are no copies. Access the system from anywhere at any time for a consistent experience. Therefore, some people call this consistency single-copy consistency.

But looking deeper, it seems that there are still some corner cases. For example, the following example A:

  1. After the master receives the client request, it persists and sends it to the slave
  2. After the slave receives the forwarded request, it persists, and then returns an ACK to the master
  3. After the master receives the ACK from the slave, it hangs up before returning it to the client.

In this case, the client will think that the system has not successfully processed the request, but in fact both the master and the slave have persisted the data, and the cognition of the client and the server is inconsistent.

Take example B again:

  1. After the master is mistakenly judged to be offline due to network jitter
  2. The system failsover, and the slave becomes the new master
  3. The network is restored, and the original master is back to normal

At this time, there will be two masters, the so-called split brain phenomenon, and even the premise of a single master will be destroyed.

In this case, the system unexpectedly becomes multi-leader. It is difficult for a carefully designed multi-leader system to guarantee strong consistency, let alone such an exception trap.

Finally look at an example C:

  1. In the case of three copies, the master synchronizes data with the other two copies, such as debiting an account by 1 yuan
  2. One of the replicas successfully gets the data and persists it locally, and then sends an ACK back to the master
  3. But after another copy persists the result, the ACK sent back is lost due to network jitter
  4. The master did not receive the ACK of the second copy, and the judgment failed, so it resends

In this way, the data between replicas is inconsistent, and the account on the first replica will be deducted by 1 yuan, while the second will be deducted by 2 yuan.

Therefore, the single-master synchronization method does not provide absolute strong consistency, but only guarantees the best-effort guarantee under normal circumstances.

(The above corner cases are also related to the so-called exactly once problem, which will be discussed in the follow-up articles of this series, and will not be expanded here.)


The corner cases mentioned above, such as the split-brain problem, seem to be very special cases, but there may be a very general fact behind them.

Just imagine, what would cause a node to be misjudged as dead, resulting in a split-brain
Network jitter
GC causes program to stall

Similar to these reasons, the communication between nodes is unreachable, at least it seems unreachable in the short term.

Or to use a more professional term, it is called the emergence of network partition (network partition), a cluster is divided into several areas with no network connection.

This leads to the famous CAP theorem.

Consistency (Consistency), Availability (Avaliabily) and Partition Tolerance (Partition Tolerance), at most two are satisfied at the same time.

C and A We've said a lot, and it was because we wanted A that the copy mechanism was introduced, which led to the C crisis. Now there may be another network partition that needs to be dealt with.

And the CAP theorem actually tells us that you don't have to deal with it, you can't deal with it.

Are you so desperate Why I do not believe!

Then let's deduce it.

C and A are required first. If a network partition occurs at this time, the single-master synchronous replication method cannot successfully complete data replication, so P cannot be obtained.
C and P are required first. If a network partition occurs, in order to ensure data consistency, only one of the partitions can work normally, and the other partitions must be suspended, then these partitions are completely unavailable, and A is lost.
A and P are required first. If a network partition occurs, and each partition can work normally, data cannot be synchronized between partitions when writing data. After the communication is restored, there may be unresolved data conflicts, that is, C is lost. .

In this way, it is true that the three cannot be balanced.

In addition, in the derivation process just now, the analysis of each case is based on "if network partition occurs" as the initial condition, which also reveals its uniqueness.

In fact, C-A-P are not at the same level, C and A are goals, and P, although Partition-Tolerance is also a goal, but partition is an unavoidable precondition. Countless production accidents have taught us that network partitions can happen anytime, anywhere.

Therefore, a system that abandons P does not have true high availability.

The implementation of CAP into the design of production-level distributed systems is more of a choice between C and A under the premise of P.


We introduced a copy mechanism for high availability, but the side effect of the copy mechanism is that it will bring about data consistency problems.
The replication lag may cause data synchronization to be untimely.
Multi-master concurrent writes can lead to data conflicts.
The data consistency problem will bring many practical problems at the application level, making the system untrustworthy to the outside world, so it must be solved.
The solutions to the data consistency problem can be divided into two categories: prevention and contamination first.
The most basic way to prevent classes is single-master synchronous replication, but in fact, it can only do best effort gurantee, and cannot solve some corner cases.
Behind these corner cases lies a more fundamental problem, the so-called CAP theorem.


For the above example C, we understand from another angle that data is copied from the master to multiple slaves, which can be regarded as several independent events of writing data to different nodes. It is the partial success of these events that leads to data inconsistency.

If all fails, the big deal is to try again. But if part succeeds and part fails, there is a possibility of repeating the retry.

To avoid partial success of multiple events, or to maintain the atomicity of multiple events - either all succeed or all fail, there is already a reliable solution - transaction (you see, how important it is to grasp the essence of the problem).

However, what we need is a distributed transaction.

In the next blog, we will learn about distributed transactions together.

This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!

0 0 0
Share on

Alibaba Cloud_Academy

32 posts | 26 followers

You may also like