Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
In the previous blogs, we focused on one of the core issues of distribution -- scalability. After solving this problem, the distributed system can truly continue to be distributed.
Scalability allows data to be stored and computation to run. This is the core problem to be solved by distributed systems. After solving this problem, we need to consider how to make this service run stably so that we can continue to benefit from distributed capabilities.
That is to say, only a system with high availability and high SLA is a trustworthy system.
In the next few blogs, we will discuss another core issue of distributed systems - availability.
The only option for high availability
For a system to have high availability, there is only one way - replication.
The reason is very simple: physical failure is unavoidable.
At the software level, no matter how advanced your design is and how complete your implementation is, you can’t stand the sudden shutdown of the server, the sudden power outage in the computer room, and the sudden interruption of the network cable by the construction team.
The impact time of physical failure is unpredictable. Network jitter may be recovered in milliseconds, server downtime may be restarted in a few minutes, and hard disk damage may never be repaired. Even if the server is also down, it may be restored after restarting, or it may need to be returned to the factory for repair.
So you can only get one more copy (replica), always ready. In the event of a failure, immediately switch to the backup to ensure uninterrupted service.
In addition to replicating data, replicating services is often necessary.
To put it bluntly, it is to exchange money for usability.
Of course, the replication strategy is not unique to distributed systems, and has long been practiced in traditional fields. For example, disk data is backed up with RAID; multiple instances of microservices are also backups of services, and so on.
Of course, the impact of physical failures can be estimated. The server downtime will only affect that one machine, and the power outage in the computer room will not affect the computer rooms in other places. This is helpful for us to adopt a different copy strategy.
Master-slave in Replication
Usually doing a copy, the easiest thing to think of is a new copy of standby. Usually, the original copy provides services, and the standby copy has no sense of existence to the outside world. Only when the original copy fails, the switch (failover) to the new copy provides services.
So, it is natural to separate the master and the slave. (leader & follower, master & slave, active & standby, etc., different systems have different names). The failover action also brings the state transition of the master-slave role.
In fact, the master and slave in replication can use many ways to play.
single leader replication
It is the architecture mentioned above, which is very simple and straightforward. There won't be too many hidden dangers.
So many distributed systems use this method. Including HDFS (NameNode) and YARN (Resource Manager) mentioned in our previous article.
But there are also many systems that do not use this method. For example, the DataNode in HDFS has multiple copies of data, but it doesn't matter who is the master and who is the slave, each of which is equal.
Since it's so good, why don't some places use a single leader
On the one hand, it takes a process for failover to take effect. Even if it is only for a few seconds, the response during this period will be blocked or discarded at the client.
On the other hand, since all the extra money is spent for availability, why not let the replica also play a role, such as providing read operations to relieve the performance pressure of the system.
So with a multi-leader replica strategy, each replica can provide services to the outside world.
This method has long been practiced in the database field, which is the so-called master-master mode.
If the externally provided services are subdivided into read and write, the roles in multi-leader replication can be subdivided into read-only and read-write.
Fully peer-to-peer multi-leader, eg MySQL + Tungsten combination.
Read-and-write leader and read-only leader, such as Observer NameNode in HDFS NameNode belong to read-only leader.
However, multiple masters, especially peer-to-peer read-write masters, can easily lead to conflicts and confusion -- or, in more technical parlance we'll get to later -- consistency problems.
Imagine that in the case of concurrency, the same row of data in the database is changed to two different values by requests received by two different leaders at the same time. Which value should be accepted at this time
There is no unified answer to this question, because both requests have been successfully modified by the corresponding leader. (You may react quickly, you can use the timestamp, the first write is overwritten by the later. As we will see later, things are not so simple.)
Even with this potentially serious problem, the multi-master replication architecture is still valuable.
A typical example is a database under multiple IDCs.
Whether it is for disaster recovery, capacity, or response delay considerations, the same kind of data is often stored in computer rooms that are physically far apart. Each computer room has its own master-slave structure. The computer room is relatively independent. The data replication adopts the single leader replication method, but the data replication in the computer room adopts the multi-leader method.
The two replicaiton methods mentioned above, single-leader and multi-leader, may have a failover process, and the service may be at risk of short-term timeout or even interruption, or there may be data consistency problems.
Essentially, as long as there is a leader, and a request is only sent to one replica, at least one of the two problems above will occur.
So there is the so-called leaderless replication, as the name suggests, there is no leader. In this mode, requests are no longer sent to a single leader, but to many nodes.
For example, when the number of replicas is 3, the write request can be sent to 3 nodes at the same time, or 2 nodes.
In this way, it is actually equivalent to implementing an active data replication function on the client side. The data replication of the first two modes is performed in the background of the server.
When reading data, considering the failure of writing, because there is no leader, it is uncertain which node has the latest data, and it is necessary to request multiple nodes at the same time like writing data.
For example, if the number of replicas is n, the number of write replicas is w, and the number of read replicas is r, only when w + r > n can the latest data be guaranteed to be read.
Like Amazon's Dynamo, and its open source implementation Cassandra, both use this method of data replication.
Timeliness of Replication
Multiple replicas provide high availability, and the data on the replicas is replicated. Therefore, the speed of data replication, the so-called timeliness of replication, directly affects the level of availability.
If the data has not had time to be replicated to another copy, a downtime occurs, and it is an irrecoverable downtime, and the data is completely lost.
So the easiest way is to ensure that the data is copied to all replicas, and then return OK to the client. This is called synchronous replication.
The drawbacks of this approach are also obvious, the performance must not be better.
If the network randomly shakes, or the processing performance of the follower cannot keep up, the overall performance will be reduced.
When the switch fails, or the follower machine goes down, all requests are blocked directly.
A system whose performance cannot be guaranteed, even if its availability is high, is not enough to be applied (impratical).
In order to solve performance problems, it is natural to have the idea of asynchronous replication.
After the leadre receives the data, it immediately returns OK to the client, and continues to process other client requests. The data replication is handed over to another thread for asynchronous processing.
In this way, performance is naturally maximized.
But the shortcomings are also very obvious. If the leader goes down before synchronizing data to the follower, there is a possibility of losing data.
The two indicators of performance and availability, in the scenario of data replication, seem to be a fish and bear's paw situation.
Since it is unacceptable to lose any of them, the rest of the way of thinking can only be a compromise.
So there is the so-called semi-synchronous way.
For example, in the case of only 3 replicas, after the leader receives the client data, it will land on the local site immediately, and at the same time replicate to one of the followers in a synchronous manner, and then return OK to the client, and the remaining follower will be Copy data asynchronously.
This compromise, while imperfect, may be a more realistic ideal for most scenarios.
Of course, for a distributed system, it is better to leave this choice to the user. For example, Kafka is a good example, which allows users to independently set the data synchronization method according to the usage scenario.
In this article, we briefly looked at another core problem facing distributed systems - availability.
The only way out for high availability -- replication, because physical failures are unavoidable, and you can only spend money on backups.
Master-slave and timeliness are two important issues to be considered in replication.
There are three main modes of master-slave, single master, multi-master and no master.
There are three main ways of timeliness: synchronous, asynchronous and semi-synchronous.
In the above content, we have inevitably mentioned issues such as timeliness and network jitter. These problems come with the introduction of the replication mechanism to solve the high availability problem. And, yes, these problems are unavoidable and can lead to a lot of serious enough consequences.
In the next article, let's take a look at the price of a copy.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
Alibaba Clouder - April 8, 2018
Alibaba Clouder - January 12, 2021
digoal - April 22, 2021
Alibaba Clouder - April 10, 2018
Alibaba Clouder - December 7, 2018
Apache Flink Community China - July 28, 2020
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
Plan and optimize your storage budget with flexible storage servicesLearn More
Organize and manage your resources in a hierarchical manner by using resource directories, folders, accounts, and resource groups.Learn More
More Posts by Alibaba Cloud_Academy