Community Blog Learning about Distributed Systems - Part 14: Causes of Inconsistency

Learning about Distributed Systems - Part 14: Causes of Inconsistency

Inconsistency is so protruding, and we have tried every means to solve it. We want high availability under scalability.

Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.

Starting from the 8th article in the series, high availability and data replication are introduced, and then we go deep into the problem of data consistency. This chapter has already written 6 articles.

This content involves many concepts and theories, and it is also one of the core challenges faced by distributed systems. Therefore, it is necessary to summarize here first.

First, in order to achieve high availability, the only way is data replication (data replication). Of course, data replication can also bring other benefits such as performance improvements.

For the selection of data replication master and slave, we have introduced 3 methods:

  • Single leader replication. One leader accepts writes, and multiple followers serve as backups.
  • Multi-leader replication. Multiple leaders can accept writes.
  • Leaderless replication. Without a leader, it is equivalent to everyone being a leader.

For the timeliness of data replication, we also introduce 3 methods:

  • Synchronous replication. Synchronous replication, although slow, is safe.
  • Asynchronous replication. Asynchronous replication, fast, but with delays.
  • Semi-synchronous replication. Semi-synchronous replication, relatively compromise, more stable and faster.

Then, we found that multi-master concurrent writing may lead to data conflicts, and the replication lag brought by asynchronous replication may lead to inability to access the latest data. That said, these kinds of issues can lead to data consistency issues.

The problem of data consistency makes the system untrustworthy and causes many practical problems that must be solved.

Methods to address data consistency can be divided into two broad categories:

  • Preventive classes, prevent divergence as much as possible, provide strong consistency, and look like a single-copy system to the outside world, even at the expense of availability and performance (performance is also a degree of availability).
  • Pollution first, then governance, allowing divergences and then convergence, providing weak (final) consistency, and showing it to the public as a multi-node distributed system, sacrificing some consistency to ensure availability.

Consistent methods of prevention are subdivided into three categories:

  • Single-master synchronous replication. It looks like a strong guarantee, but in fact it can't handle many corner cases, only the best-effort guarantee. And behind these corner cases hides a general CAP theorem.
  • Multi-phase commits. Data replication can also be regarded as a type of transaction, which can be solved by distributed transactions. Typical are 2PC and 3PC, but neither can solve the problem of network partition.
  • Consensus. In order to solve the problem of network partition, a class of algorithms based on Paxos has been derived, including Raft, ZAB, etc. The goal of such algorithms is to solve the so-called consensus problem, and data consistency can be regarded as a consensus from another perspective.

The consistent method of pollution first and then treatment can be divided into two categories:

  • Eventual consistency, typically provided by Dynamo, provides probabilistic guarantees. For events with causal sequence, causal consistency can be provided by means such as vector clocks. But for other concurrently written events, the order cannot be determined, but in many cases the order is not important.
  • Typically CRDTs provide strong guarantees of eventual consistency. Under some special data types and the operations they provide, the order in which events are written does not affect the final result, such as the union operation of set. This type of data structure can be released for concurrent writing without the need for negotiation between nodes, or it will automatically converge to achieve eventual consistency. Of course, this type of data structure is relatively limited and cannot meet the needs of all scenarios.

Distributed transactions can be used to solve data consistency problems, but they are also relatively independent and very important applications. And data consistency is divided into two types: strong and weak, then distributed transactions can be summarized into two implementations:

  • ACID on 2PC. Taking 2PC as a typical distributed transaction, it provides standard ACID guarantee. From the perspective of CAP, it provides strong consistency, but the availability (including performance) is not good enough (in fact, the consistency in some scenarios cannot be guaranteed).
  • BASE. BASE (base) theory is an alternative to ACID (acid), which can implement distributed transactions based on Dynamo or MQ, only pursue final consistency, and sacrifice strong consistency all the time in exchange for performance and availability.

At this point, we can summarize and theoretically generalize some consistency models. These models, in essence, are the promises and guarantees provided by the distributed system to the outside world, so that external systems can use the distributed system on the basis of these guarantees without holding unrealistic expectations and completely black boxes. land use.

There are so many categories and implementations of consistency models that we cannot and do not need to list them all. Only the main models covered in our previous articles are summarized here.

strong consistency models

  • Linerizable consistency, linear consistency, guarantees linerizability, global order, is the strongest consistency, looks like a stand-alone system. The main implementation method is various consensus algorithms typical of Paxos.
  • Sequential consistency, which ensures that the order observed by all nodes is consistent, but not necessarily exactly the same as the real global order. Sequential consistency plus true and accurate timing properties equals linerizable consistency. (Be careful not to confuse serializability in transactions.)

weak consistency models

  • Client-centric consistency models, client-centric consistency, does not require complete consistency of the server, but only pursues the consistency of each client.

    • read-after-write consistency, you can read it immediately if you write it yourself. The main implementation is to send the request to a fixed replica.
    • Monotonic reads, cannot read the data before the data that has been read. The main implementation is client-side caching.
  • Eventual consistency models, final consistency, do not pursue consistency all the time, but guarantee that agreement can always be reached after an indeterminate time. It doesn't matter how it is implemented, it's the natural state. More attention should be paid to operation and maintenance efficiency and conflict resolution.
  • Causal consistency models, causal consistency, events that have a logical causal sequence need to guarantee the order, and other situations are not guaranteed. The main implementation is vector clocks.

Roots of Consistency Problems

From the above summary, it is not difficult to see that the consistency problem is too important and has too much impact. In order to solve it, we have tried every means.

As we mentioned earlier, the reason for the consistency problem is that we want high availability under scalability. However, the server may be down or even unable to recover, so only multiple copies can be made, and the data between multiple copies must be synchronized to ensure the same. If the implementation is not good, inconsistency will occur.

In this way, server failures are the root cause of consistency problems. Is it really?

Partially, but not entirely.

Let's step back and go back to the original stand-alone system to see if we can get to the root of the problem.


In a stand-alone system, a program that receives a specific input will only have two results.

Return a specific output.

An error was returned when a failure was encountered.

For a series of inputs, the single-machine system will process the input in the chronological order of the input, and obtain the output in the same order as the input.

Therefore, the feedback of a single-machine system to a specific input is deterministic. This certainty is reflected in two aspects:

output content. When a failure occurs, such as disk corruption, it is better to crash than to give different output.

order of output. When a failure occurs, it is preferable to terminate execution and resume execution after recovery, rather than giving output in a different order.

This kind of certainty is very important, and it is a strong guarantee provided by the stand-alone system to the outside world, and the external system can use the stand-alone system with confidence.

At the same time, this certainty is a one-to-one relationship, which in turn allows the outside world to infer the state of the system through the returned results.

But in a distributed system, in order to ensure availability, the entire system is allowed to continue to operate after a partial failure. But the number, location, duration, etc. of partial failures are uncertain, which leaves the system in a nondeterministic state.

The same input from the outside world can no longer necessarily get the same output; it is no longer possible to judge the state of the system based on the output results.

There is no problem with this judgment. The uncertainty caused by partial failure is the root cause of data consistency. But still a bit abstract. Digging deeper, what exactly is the problem that caused this uncertainty?

Unreliable Network


Take a look at the above simplified network topology diagram. After removing the role attributes, a distributed system is a graph of many nodes.

These nodes, like islands in the Pacific Ocean, exist solitary and know almost nothing about the outside world, and can only be explored through the only way -- a peer-to-peer network.

If you send a message through the Internet, if you get a response, you will know more about the other party; but if you don't get a response, you can't even make a negative judgment that the other party is not online.

Because the actual topology map is more like this:


The nodes are not directly connected, but are connected together through complex network devices such as switches. These network devices (and even network lines) can also fail, jam, and so on.

Therefore, for a node that issues a network request, there are actually two variables: the peer node and the network.

Similar to logic and calculation, if the returned result is 1, it can be judged that both are normal, but if the result is 0, there are three situations:

The peer node is abnormal but the network is normal.

The peer node is normal but the network is abnormal.

The peer node and the network are abnormal.

In this way, different from the one-to-one relationship of the stand-alone system, there is a one-to-many relationship. It is impossible to infer the system state according to the returned result.

Even, the normal node can be subdivided into the real death of the crash class and the suspended death caused by the GC. To distinguish between these two situations, you can only use timeout to probe. But what should the timeout be set to? Different systems and environments will have their own experience values ​​that cannot be guaranteed 100%, which is the so-called unbounded timeout problem.

That's one of the big problems with distributed systems -- unreliable networks.

Unreliable Clock

And for the order of a series of messages, there is no longer a unique time to determine the order like a stand-alone system.

Usually a local clock is used to ensure local ordering, and then a globally synchronized clock is used to ensure overall ordering. But the global synchronization of the clock is difficult to be efficient enough, whether it is the standard NTP, or the synchronization protocol implemented by itself.

This also makes it difficult to achieve total order accurately and efficiently in a distributed system.

This is another big problem faced by distributed systems - unreliable clocks.

More than that, think about it carefully, what is time, and what is the use? Even, does time really exist?

This question can be viewed from many angles, such as philosophy, physics and so on.

Let's look at it from a computer's point of view.

Most modern computers and programming languages ​​adopt the practice of Unix, starting from January 1, 1970 at 0:00:00 seconds, and every second that elapses, the timing is incremented by 1 or 1000 (depending on the data type and precision), that is, The so-called timestamp.

Therefore, computers describe time as a count of timestamps, and datetime, which is more readable to humans, represents the year, month, and day, but it is just a converted display form.

And since time is a count, subtracting the two may also make sense, yes, it is the time interval (interval/duration) that we are familiar with.

So, in fact, time has three dimensions to us:

  • Order. Describes the qualitative precedence relationship.
  • Interval (duration). Describes the quantitative precedence relationship.
  • Interpretation. is an interval that describes the distance from the time base in a human-friendly form.

Corresponding to the distributed system:

Order is used to specify the sequence of events to ensure data consistency.

Intervals are used to measure the boundaries of expected events, such as heartbeats to probe for activity, delays in event processing, etc.

Representability, like other scenarios, is used by humans to contrast the real world.

Once the global clock is not guaranteed, in a distributed system, the deviation in representability is not bad, after all, human beings are not so sensitive to time. However, strong data consistency cannot be guaranteed, and node detection may also be misjudged (affected by both the network and the clock), and the consequences may be serious.

Unreliable networks and unreliable clocks, these two major problems, are the root causes of data consistency problems in distributed systems.

Addressing the Root Cause of Consistency Problems

The root cause is found, and if these two problems are solved, it seems that the consistency problem can be fundamentally solved.

Troubleshoot unreliable network issues

There is no simple and fundamental solution to an unreliable network.

For the network itself, whether it is a transit device such as a switch, or a transmission medium such as an optical fiber line or a telephone line, just like a server, there is a possibility of failure physically, and it is impossible to completely avoid it.

Improving hardware stability and performance, and improving operation and maintenance efficiency can of course effectively improve network quality, but it is certainly impossible to completely solve the problem.

Therefore, we can only explore by sending requests, and then infer the status of the other party based on the returned results.

And whether you adjust the time of overtime detection, or do multi-step verification, etc., you can only do the best-effort guarantee.

In the final analysis, a node is an island, and it is difficult to understand the overall situation by itself in the ocean of distributed systems.

So we need to work together and collaborate. This is also the solution to the various quorom algorithms mentioned in the previous article, so I won't repeat them.

Fix unreliable clock issues

The first method, since it is so difficult to keep the clock globally consistent, bypass it and use no clock.

Anyway, as mentioned in the previous section, the essence of the clock is only a counter. The big deal is to change a counter.

Think about the consistency of data, what we pursue is actually the first attribute of time - order. In this case, an auto-incrementing ID can be used to identify the sequence, which is equivalent to a logical clock.

The first person who proposed this method was Leslie Lamport, the author of the famous Paxos, so this kind of logical clock is also called Lamport timestamp.

However, even if self-incrementing IDs are used, it is still necessary to negotiate IDs between nodes like a clock, or to be a center for distributing IDs, but it will drag down performance.

Then take a step back and do not pursue strong consistency. Causal consistency is sufficient in most application scenarios.

Deduced here, the answer is ready to come out, it is the Version Number and Vector Clock we talked about earlier!

For details, please review the previous article, which will not be repeated here.

Although this can satisfy most scenarios, after all, the consistency of some scenarios cannot be satisfied, and the node detection of the duration attribute that requires time cannot be replaced by ordinary counters.

So the second way is to face and solve the problem head-on, and come out with a truly consistent time.

The most representative is Google's TrueTime API:

Each computer room has some time masters as the clock standard in the computer room.

Each machine has a time slave daemon to ensure the time synchronization with the master in the computer room.

Most time masters are equipped with GPS to synchronize time from satellites to avoid the influence of terrestrial network equipment.

The rest of the time masters are equipped with Atomic Clocks, which rely on atomic resonance frequencies to determine the time, with an error of one second in 20 million years.

In addition to synchronizing the time from GPS or Atomic Clock, the masters will also correct each other, and will also compare with their own local time, and kick themselves out of anomalies.

The slave pulls time from multiple masters (possibly in different computer rooms) every 30 seconds, and kicks off liars through the Marzullo algorithm.

Under typical settings, the local clock will have a drift of 200 us/sec, plus the slave synchronization interval of 30 seconds per round, theoretically there will be a maximum error of 6ms, plus the average transmission overhead of about 1ms, the actual error range is 0-7ms.


The picture above is the benchmark of Goolge under thousands of machines in multiple computer rooms. It can be seen that the overall error is controllable, and 99% of the errors are within 3ms. On the left, after network optimization on March 31, the error further decreased and stabilized, and 99% of the error was controlled within 1ms. The spur in the middle of the picture on the right is due to planned maintenance on the 2 masters.

Traditional clocks provide the exact time, but this is an illusion. It looks like there is no error, but in fact it is unbounded time uncertainty.

  • TT.now(): TTinterval: [earliest, latest]
  • TT.after(t): true if t has defintely passed, t < TT.now()
  • TT.before(t): true if t has definitely not arrived, t > TT.now()

As you can see from the main API of TrueTime above, what TrueTime does is to provide a bounded time uncertainty guarantee.

Negotiation between nodes must have different transmission and processing overhead, and it is impossible to achieve absolute consistency, but it can ensure that the real time is within this extremely small interval.

Since TrueTime returns an interval, in order to ensure the sequence of two times, it is necessary to ensure that the two intervals do not overlap (overlap), that is, t1.latest < t2.earliest.

TrueTime is heavily used in Goolge Cloud's distributed database Spanner. In Spanner, to ensure the serializability of two transactions, it is necessary to submit the second transaction after TT.now() > t1.latest after the first transaction t1 is submitted, that is, the so-called commit wait .

For the two methods mentioned above, the first logical clock cannot be supported in many scenarios, and the second physical clock is too dependent on specific hardware. Therefore, a third implementation called Hybrid Time was born, combining an NTP-based physical clock and an auto-incrementing logical clock to comprehensively judge the sequence of events.

In addition, I mentioned the logical clock earlier and mentioned that the centralized distribution center may drag down the performance, but it is not absolute. If the distribution center only does this, and the service node network is of good and stable quality, such as all in the same IDC, it can also be considered as the fourth clock scheme. In fact, there is already an implementation called Timestamp Oracle in Google Percolator. It will not be expanded here, and interested students can learn about it by themselves.


The first half of this article summarizes the data consistency issues discussed in recent articles and will not be repeated. Only the second half of the content is summarized.

Data consistency problems appear to be caused by server failures, but are actually caused by the uncertainty of distributed systems.

Specifically, unreliable networks and unreliable clocks are the source of consistency problems.

To solve the problem of unreliable network, consensus algorithms such as Paxos have given a way out.

To solve the problem of unreliable clocks, there can be decentralized logical clocks typified by Vector Clock, new physical clocks typified by TrueTime API, hybrid clocks exemplified by Hybrid Time, and centralized logical clocks exemplified by Timestamp Oracle.

At this point, the chapter on data consistency comes to an end. Of course, I have buried the exact once pit several times in the front, and I have not forgotten it. I will pick it up and talk about it in a suitable place later.

After solving the consistency problem, can we enjoy the power of the scalability of the distributed system with peace of mind?

Is a distributed system really completely distributed?

In the next article, let's take a look at the centralization problem in distributed systems.

This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!

0 0 0
Share on

Alibaba Cloud_Academy

60 posts | 46 followers

You may also like