Design Model of Distributed Locks

1. What is a distributed lock?

What is a distributed lock? For this question, I believe that many students are both familiar and unfamiliar. With the rapid development and wide application of distributed systems, mutually exclusive access to shared resources has become a requirement that many businesses must face. In this scenario, people usually introduce distributed locks to solve the problem. What kind of distributed lock service do we usually use? There are open source MySQL, Redis, ZooKeeper, Etcd and other three-party components to choose from. Of course, there are also distributed lock service providers such as Tair and Nuwa developed by the group. In general, our needs for distributed locks can be roughly divided into the following two application scenarios:

• Realize operation atomicity: In a single-machine environment, in order to achieve the atomicity of multi-process or multi-thread operations on shared resources, we can use the SpinLock or Mutex mechanism provided by the kernel to ensure that only one process or thread operates shared resources. Similar to the requirement for locks in a single-machine environment, in a distributed environment, we usually use distributed locks to control the concurrent operations of nodes on multiple machines to avoid data or state corruption.

• Realize high system availability: For high service availability, it is often necessary to deploy multiple nodes to implement service redundancy to avoid service unavailability caused by a single point of failure. With the master selection function implemented by distributed lock + service discovery, the node decides whether to become the master node to provide external services according to whether the lock is successful or not. When a node goes down, other backup nodes can continue to provide access services by competing for the ownership of the distributed lock.

The business requirements and scenarios of distributed locks seem to be relatively simple, but in fact, in the process of using distributed locks, we always put forward new requirements of this kind and that kind, and it seems that we cannot find a unified and unified scenario for distributed locks. solution. So, how is the distributed lock implemented internally? Or how should it be implemented? This is what we hope to discuss in this article, and we hope that our discussion can give readers and friends a certain understanding of the principle of distributed locks, and can provide more guidance when making technical selections.

2. Design Model

What kind of design model should we establish for distributed locks? This question can be viewed from another perspective. What reasonable properties should we establish to create a distributed lock model? We may wish to refer to two definitions from the industry. The first is Apache Helix (a popular general cluster resource management framework in the open source community that can be used to automatically manage partitioned, replicated distributed resources that exist on cluster nodes) for the nature of the distributed lock manager. : a) evenly distributed, not the first node to acquire all distributed locks; b) the balance of re-scheduling, it is necessary to properly handle the lock resource allocation problem after the node holding the distributed lock exits unexpectedly; c) rebalance, When a new node joins, the lock resources between nodes should be redistributed to achieve a balance. It can be seen that Helix's definition of the distributed lock model emphasizes balance. Considering that it is responsible for partition resource scheduling in the cluster, this focus is not surprising.

Let's look at another well-known Redis definition of the nature of distributed locks. It proposes three principles that the distributed lock model must abide by: a) Absolute mutual exclusion. At the same time, only one client can hold a distributed lock ;b) It is finally available. If the client holding the distributed lock quits unexpectedly, the related distributed lock resources must be able to be reallocated; c) The service is fault-tolerant, and the service providing distributed locks must have fault-tolerant capabilities. Even if some nodes crash, it does not affect the overall distributed lock service.

Based on our own experience, we highly agree with Redis's basic constraints on the distributed lock model. These are actually several attributes that must be considered when implementing a distributed lock service. In addition, Redis related articles also continue to discuss other characteristic constraints of distributed locks. In fact, as shown in Figure 3 below, we summarize the properties that need to be considered in the implementation of a distributed lock model from three dimensions. The first dimension is the most basic constraint, which is exactly the same as that proposed by Redis. We call it: mutual exclusion, fault tolerance, and final availability; some lock characteristics that the distributed lock manager proposed in the second layer needs to pay attention to , such as the efficiency of lock grabbing, the balance of distributed locks, the accuracy of lock switching, the reentrant nature of locks, etc. On top of this, there is another constraint that must be considered when the distributed lock is implemented, which is related to data consistency and correctness assurance, that is, protection and the impact of clock drift.

Regarding the topic of data consistency and correctness that needs to be considered in the actual implementation of the distributed lock manager, one of the topics is the unreliability of the wall time. This can be solved by introducing the non-wall time MonoticTime. This article will not address this issue. More discussion. Another topic, when actually using the distributed lock service to access shared resources, the Fencing capability must be assisted to achieve absolute mutual exclusion of resource access. The great god Martin Kleppmann provided a very good case illustration. As shown in Figure 4 below, Client1 first acquired the ownership of the distributed lock, and GC occurred when operating the data. After a long "Stop-The-World" GC During the process, the ownership of the lock is lost. Client2 competes for the ownership of the lock and starts to operate the data. As a result, after the GC of Client1 is completed, Client1 and Client2 will simultaneously operate the data, resulting in inconsistent data.

In view of the above problems, the solution is to introduce the IO Fence capability of shared resource access. As shown in Figure 5 below, the global lock service provides a global self-incrementing Token. The Token returned by Client1 when the lock is obtained is 33, and it is brought into the storage system, and GC occurs. , when Client2 successfully grabs the lock and returns Token 34 and brings it into the storage system, the storage system will reject subsequent requests with a Token smaller than 34. Then, when Client 1 re-writes data after a long period of GC recovery, the underlying storage system records The Token has been updated, and the request carrying the Token value of 33 will be directly rejected, thus achieving the effect of data protection.

Back to the main point of the article, how to implement an efficient distributed lock manager? First of all, to throw out a point of view, the distributed lock manager can also be divided according to the control plane and the data plane. The distributed lock properties mentioned in Figure 3 can be divided into different planes to be responsible respectively. In fact, in OSDI'20's Best Paper - "Virtual Consensus in Delos", Facebook's research team made an in-depth discussion on the design of consensus protocols, which is very exciting. The article mentions distributed consistency protocols like Raft, which can also be split into the control plane and the data plane. The former is responsible for fault tolerance, member changes, and role adjustment, while the latter is responsible for sequencing and persistence. By decoupling the two planes, the consensus protocol becomes very flexible at once.

Can we refer to similar ideas for the implementation of our distributed lock model? The logic responsible for fault tolerance and member change is transferred to the control plane, while the data plane is responsible for other functions of distributed locks, such as mutual exclusion, final availability, and lock grabbing efficiency. The answer is yes, well, even this idea is not our first. In the database field, there has always been such a school to evolve this type of distributed lock system. They are collectively called DLM (Distributed Lock Manager).

3. What is DLM?

The idea of DLM comes from "The VAX/VMS Distributed Lock Manager", which was first applied to VAX/VMS V4.0 in 1984.

In order to ensure the consistency of the process of writing to the memory page by multiple nodes, the distributed lock manager (DLM) is used to handle the allocation and release of distributed lock resources.

As shown in Figure 7, DLM is a decentralized design, all nodes in the cluster are peer-to-peer, and each node maintains partial lock information. So when applying for locks, who should decide the allocation of locks? In DLM, each lock has the concept of Master, which is coordinated and authorized by the Master to decide whether to allow locking or unlocking, and each node may become a Lock Master. When each node manages these lock resources, these lock resources are organized in a tree structure, and the granularity of locks can be optimized by the parent-child inheritance relationship of tree nodes, and the efficiency of adding and unlocking can be improved.

In the process of locking or unlocking, the following types of nodes are involved: a) Requester: the node that initiates the locking or unlocking; b) DirectoryNode: the directory node of the lock, which node locks the Master that stores the lock to hold such information ;d) Master: The holder of the lock, the actual manager, is responsible for the allocation and release of the lock. Below we use specific examples to describe the specific process of distribution and release of distribution locks. There are three nodes A, B, and C in the example, where A is Requester, B is DirectoryNode, and C is Master node.

3.1 Locking process

Figure 8 shows the process of adding locks to other nodes, which is the most time-consuming of all locking situations, requiring at most 2 rounds of interaction. After the resource is established locally, the subsequent resources with inheritance relationship can be locked locally, without interacting with other nodes:

1. Node A locks resource R1, and first constructs the lock object locally, also known as the shadow of the lock, but at this time node A does not successfully lock; 2. Node A calculates R1 by hashing resource R1 The corresponding directory manager is node B; 3. Node A requests node B, and the record of node B shows that the master of the lock of R1 is on C; 4. Node A sends a request to node C to lock R1; 5. Node C Maintain the lock request queue of R1. If A is allowed to lock, it will return success; 6. A updates the local R1 lock shadow related information, and the lock is completed.

3.2 Unlocking process

Figure 9 shows the unlocking process, which is also relatively intuitive, with the following three steps:
1. Node A unlocks resource R1 and deletes the local lock object; 2. Node A requests node C to release the lock of A; 3. If A is the last requester in the queue, node C will send the request to B. Remove R1 from the directory so that other nodes can become the master of the lock. Otherwise, node C only removes A from the lock queue of R1.

3.3 Membership changes

The above locking and unlocking process is just an ordinary one-time locking and unlocking process. Then, when a node failure occurs in the cluster, and nodes are added or deleted in the cluster, how to control the distributed locks to be routed and allocated normally? In DLM, there is the role of Connection Manager. In addition to being responsible for the network communication of each node, another important function is that when cluster nodes are added or deleted, the nodes first elect a leader node for coordination, and each node may become a leader node. The following process occurs when a node is added or deleted:
• Node topology: The leader node initiates a notification to other nodes in the cluster through a two-stage voting method, informing the current cluster node topology, and other nodes have the right to accept or reject the information. If there is no consensus on the topology map (other nodes reject the topology information), the leader will sleep for a period of time, other nodes will perform leader election, and the new leader will notify other nodes. In this way, all nodes in the cluster can reach a consensus on the global topology map and the routing algorithm of lock resources. During the member change period, lock grab requests can still be initiated, but these requests will be in the request queue and cannot be grabbed successfully. After the membership is changed, these requests are reissued in the order in which they were initiated.
• Node lock information: The leader will notify other nodes to the lock information. The process is divided into multiple stages. When all nodes complete a stage, the leader will notify all nodes in the cluster to enter the next stage. During the process, if any node fails, the election and process needs to be re-initiated. The reconstruction is divided into the following stages: 1) The node clears the directory information (routing table of the lock) and the lock held by the node, because the lock resource information needs to be rerouted; 2) For the lock held by the previous node, according to the original routing strategy and order to re-initiate the lock, this process will re-establish the lock directory information of the entire cluster, and the lock master will be re-determined. Since each node pair only relocks itself, for the deleted node in the event of a failure, the lock Master it previously held will be replaced by the new node; 3) After all nodes complete the relocking process, they can execute normally. the unlocking process.
From the above process, we can see that the recovery process is very complicated when the node member changes in the cluster. In order to reduce the occurrence of this situation, when a node fails to communicate, it will wait for a certain period of time. After the interval is exceeded, the communication cannot be performed normally, and the process of deleting the node will be executed. If a node only restarts and does not reach the threshold that needs to trigger a member change, then only the node needs to be restored. During this process, only the lock-related information of the node is lost, and it has no effect on other nodes in the cluster. During the restart process, send to this section The node's request will be Pending until the node resumes.
On a node that restarts, most of the above locks can still be recovered. The lock on the node consists of two parts, one part is Local Lock, which means that the node itself initiates the lock. The other part is Remote Lock, which represents the lock initiated by other nodes. For Local Lock, other nodes cannot recover without information, but there is no competition and no recovery is required; for RemoteLock, recovery can be performed from the shadow information of other nodes.

3.4 A little thought

From the member change process, we can see that the Connection Manager plays an extremely critical role in DLM, which is also the most complicated part of the entire design. When a node failure occurs, the Connection Manager coordinates the redistribution of locks. In fact, Takes on the work of what we call the distributed lock management and control plane. What are the advantages of DLM? The data plane responsible for distributed lock resource allocation does not need to consider the fault tolerance of the entire system, and can allow more machines to participate in resource allocation in a balanced manner, and the lock resource information does not need to be placed on the disk, and does not need to follow the consensus protocol for fault tolerance, only need to focus on grabbing The mutual exclusion of locks and the efficiency of lock grabbing, this lock grabbing efficiency and service level expansion capability will be very advantageous.
Through the above analysis of the DLM locking, unlocking and member change process, there is still a relatively clear decoupling design between the control plane and the data plane. Of course, the implementation process is very complicated, especially the failover recovery logic. But this kind of thinking is still very good, and it is worth learning from when we do architecture design. In particular, it should be mentioned that, unlike the 1980s when DLM originated, the industry had consensus protocols such as Paxos/Raft/EPaxos in the later period, and we also had consensus protocols based on consensus protocols such as ZooKeeper/Etcd. Our distributed lock manager The management and control planes can use these mature three-party components.

4. Best Practices

Alibaba Cloud's storage department has the world's most complete storage product system ranging from block storage to file storage, object storage, log storage, and table storage. Figure 10 shows a very general system architecture based on the partition scheduling model used by current storage products. The entire business system is divided according to the management and control plane and the data plane. The data plane divides the user's storage space into several partitions according to certain rules. At runtime, a partition will be assigned to a server to provide services, and a server can load multiple partitions at the same time. partition. A partition does not use the native file system to store persistent data, and all data it owns will be stored in a specific directory in the Pangu distributed file system. Based on such a partition scheduling model, when a server goes down, the partitions it carries need to be rescheduled and quickly migrate to other healthy servers to continue providing services.
Figure 10 Cloud storage based on Pangu + Nuwa's general partition scheduling design framework
In the partition scheduling model of cloud storage, mutually exclusive access to partition resources (that is, any partition must be loaded by at most one server at any time and provide read and write access services) is the cornerstone of the storage system to provide data consistency, and must be be guaranteed. In fact, the best practice of cloud storage has a design philosophy similar to DLM, which separates the fault tolerance problem of the distributed lock manager, and realizes it with the help of the main selection function provided by the Nuwa-Feitian distributed collaborative basic service, and then it can be realized. Focus on the scheduling strategy of distributed lock resources:
1) The control scheduler is responsible for the mutually exclusive allocation of specific partition resources. Combining the special needs of different storage services, different scheduling strategies can be evolved, from the balance of distributed locks, the efficiency of distributed locks to lock grabbing, distributed locks Make special optimizations in different dimensions such as the switching accuracy of the
2) The most complex fault tolerance capability in the distributed lock manager is realized by relying on Nuwa to select the main function, and through Nuwa's service discovery ability to realize smooth online and offline of control nodes;
3) The data of the storage system is ultimately stored in the Pangu-Feitian distributed storage file system. From the specific partition data to the metadata of management and scheduling, all these information will be put into Pangu. Pangu provides highly reliable and high-performance storage services and Fencing protection capabilities to ensure data consistency;
Figure 11 Fencing protection based on Pangu distributed file system
The core point of Fencing protection provided by distributed locks is to bring Token checks when accessing shared resources. As a unified base for storage, Pangu implements a similar IO Fence protection capability by introducing a special InlineFile file type in conjunction with the SealFile operation: a) The SealFile operation is used to close open files and prevent distributed locks from old occupants Continue to write data; b) Introduce InlineFile for each partition, and associate the CAS judgment related to InlineFile with the metadata operation of Pangu file, which can prevent the old occupant of the distributed lock from opening new files. As shown in Figure 11, the combination of these two functions actually provides Token checking support for writing data in the storage system.
We have seen that in the DLM implementation of cloud storage, there is a general partition-based scheduler, Nuwa provides fault tolerance, and Pangu provides Fencing protection of resources. This is the best practice of cloud storage.

5. Summary

Distributed locks provide mutually exclusive access to shared resources in a distributed environment, and are widely used in distributed systems. This article discusses the model design of distributed locks from the nature of distributed locks. Regarding the distributed lock system, we discussed the architecture design of the decoupling of the control plane and the data plane, and introduced the best practices of distributed locks in Alibaba Cloud storage scenarios. Hope our sharing will be helpful to readers and friends.

Design Model of Distributed Locks

1. What is a distributed lock?

2. Design Model

3. What is DLM?

3.1 Locking process

3.2 Unlocking process

3.3 Membership changes

3.4 A little thought

4. Best Practices

5. Summary

Related Articles

6 Optional Technologies for Data Storage

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse