NFS file lock consistency design principle

file lock
File locks are one of the most basic features of the file system. With the help of file locks, applications can control the concurrent access to files by other applications. NFS, as the standard network file system of UNIX-like systems, gradually supports file locks natively during the development process (starting from NFSv4). NFS has released three versions since its birth in the 1980s: NFSv2, NFSv3, and NFSv4. The biggest change in NFSv4 is the "state". Certain operations require the server to maintain related states, such as file locks. For example, if a client applies for a file lock, the server needs to maintain the state of the file lock, otherwise conflicting accesses from other clients cannot be detected. If it is NFSv3, it needs the assistance of NLM to realize the file lock function, but sometimes the two are not well coordinated and it is easy to make mistakes. However, NFSv4 is designed as a stateful protocol, which can realize the file lock function by itself, so there is no need for the NLM protocol.

application interface
Applications can manage NFS file locks through the fcntl() or flock() system calls. The following is the calling process for obtaining file locks when the NAS is mounted using NFSv4:

It is easy to see from the call stack in the above figure that the NFS file lock implementation logic basically reuses the VFS layer design and data structure. After successfully obtaining the file lock from the Server through RPC, the locks_lock_inode_wait() function is called to transfer the obtained file lock to the VFS layer for management. , there are many relevant materials about the design of VFS layer file locks, so I will not describe them here.

Principle of EOS
File lock is a typical non-idempotent operation. Retry and failover of file lock operation will cause inconsistency between the client and server in the file lock status view. NFSv4 uses the SeqId mechanism to design a mechanism that can be executed at most once. The specific method is as follows:

For each open/lock state, the client and server independently maintain the seqid at the same time. When the client initiates an operation that will cause a state change (open/close/lock/unlock/release_lockowner), it will add 1 to the seqid and send it to the server as a parameter. Assuming The seqid sent by the client is R, and the seqid maintained by the server is L, then:
1) If R == L +1, it means a legal request and it should be processed normally;
2) If R == L, it means to retry the request, and the server just returns the cached reply;
3) In other cases, it is an illegal request and access is absolutely prohibited.
According to the above rules, Server can judge whether the operation is normal, retry or illegal request.

This method can ensure that each file lock operation is executed at most once on the server side, which solves the problem of repeated execution caused by RPC retries, but this alone is not enough. For example, after the LOCK operation is sent, the calling thread is interrupted by a signal, and then the server successfully accepts and executes the LOCK operation, so that the server records that the client holds the lock, but the client does not maintain the lock because of the interruption. , an inconsistent view of the lock state between the client and the server occurs. Therefore, the client also needs to cooperate in handling abnormal scenarios, and finally achieve file lock view consistency.

exception handling
From the analysis in the previous section, we can see that the client needs to cooperate in handling abnormal scenarios to ensure the consistency of the file view, so what kind of cooperative design did the client designer mainly do? At present, the client mainly solves this problem through the mutual cooperation of the two dimensions of the SunRPC and NFS protocols. The following will introduce how the design of these two dimensions ensures the consistency of the file lock status view.

SunRPC design
SunRPC is a network communication protocol specially designed by Sun for remote procedure calls. Here, we will understand the design concept of SunRPC implementation level from the dimension of ensuring file lock view consistency:
1) The client uses the int32_t type xid to identify each remote procedure call process initiated by the upper layer user, and the multiple RPC retries of each remote procedure call use the same xid identifier, thus ensuring that any A return can inform the upper layer that the remote procedure call has been successful, ensuring that the server can get the result even if the execution of the remote procedure call takes a long time. This is the same as the traditional netty/mina/brpc and so on. Each RPC needs to There are independent xid/packetid;
2) The server has designed the DRC (duplicate request cache) to cache the recently executed RPC results. When receiving the RPC, it will first retrieve the DRC cache through the xid. If it is hit, it indicates that the RPC is a retry operation, and the cached result can be returned directly. To a certain extent, the problem of repeated execution caused by RPC retry is avoided. In order to avoid xid reuse causing the DRC cache to return unexpected results, developers further effectively reduce the probability of errors caused by reuse through the following design:
a) When the client establishes a new connection, the initial xid adopts a random value:
b) The server-side DRC will additionally record the verification information of the request, which will be verified at the same time when the cache hits;
3) The client is allowed to retry indefinitely before getting the response from the server to ensure that the caller can obtain the deterministic execution result of the server. Of course, such a strategy will cause the caller to hang all the time when there is no response;
4) NFS allows users to specify the retry strategy of SunRPC through the soft/hard parameter when mounting. The soft mode prohibits retrying after a timeout, and the hard mode keeps retrying. When the user mounts in soft mode, the NFS implementation does not guarantee the consistency of the state views of the client and the server. When a remote procedure call returns a timeout, the application program is required to cooperate with the cleanup and recovery of the state, such as closing files that have access errors, etc. However, In practice, few applications will cooperate, so in general, NAS users use hard mode to mount;
In short, one of the core problems SunRPC needs to solve is that the execution time of remote procedure calls is uncontrollable. Protocol designers customize the design for this, and try to avoid the side effects caused by non-idempotent RPC retries.

signal interruption
Applications are allowed to be interrupted by signals while waiting for the result of a remote procedure call. When a signal interruption occurs, because the execution result of the remote procedure call is not obtained, the states of the client and the server may be inconsistent. For example, the lock operation has been successfully executed on the server, but the client does not know this situation. This requires the client to do extra work to restore the state to the server. The following briefly analyzes the process of obtaining file locks after being interrupted by signals to illustrate the consistent design of the NFS protocol implementation level.
Through the process of obtaining NFSv4 file locks, it can be seen that NFSv4 will eventually call the _nfs4_do_setlk() function to initiate an RPC operation when obtaining a file lock, and finally call nfs4_wait_for_completion_rpc_task() to wait. The following is the relevant code:

5684 static int _nfs4_do_setlk(struct nfs4_state state, int cmd, struct file_lock fl, int recovery_type)
5685 {

5718 task = rpc_run_task(&task_setup_data);
5719 if (IS_ERR(task))
5720 return PTR_ERR(task);
5721 ret = nfs4_wait_for_completion_rpc_task(task);
5722 if (ret == 0) {
5723 ret = data->rpc_status;
5724 if (ret)
5725 nfs4_handle_setlk_error(data->server, data->lsp,
5726 data->arg.new_lock_owner, ret);
5727 } else
5728 data->cancelled = 1;

By analyzing the implementation of nfs4_wait_for_completion_rpc_task(), it can be seen that when ret < 0, it indicates that the process of acquiring the lock is signaled, and the canceled member record of struct nfs4_lockdata is used. Continue to look at the callback function nfs4_lock_release() when the rpc_task is released after completion:

As can be seen from the code in the red box above, when nfs4_lock_release() detects that there is a signal interruption, it will call the nfs4_do_unlck() function to try to release the file lock that may be successfully obtained. Note that the nfs_free_seqid() function is not called at this time to release the held nfs_seqid. This is for:
1) Ensure that there will be no concurrent lock or release operations initiated by the user during the process of correcting the state, simplifying the implementation;
2) Ensure that the UNLOCK operation in hard mode will only be sent after the LOCK operation returns, ensuring that the acquired lock can be released;
Through the above method, the client can effectively guarantee the final consistency of the lock status of the client and the server after the signal is interrupted, but it is also at the cost of losing part of the availability.

File lock is a basic feature natively supported by the file system. As a shared file system, NAS has to face the problem of consistency of lock status views between the client and the server. NFSv4.0 solves this problem to a certain extent. Of course, the pace of technological advancement It will not stop, and the update iteration of NFS will not stop, and there will be more expectations for NFS in the future.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us