By Kan Junbao (Junbao), Senior Technical Expert at Alibaba
Container storage is a basic Kubernetes component that provides data persistence and is an important guarantee for stateful services. By default, Kubernetes provides the mainstream volume access solution In-Tree and the plug-in mechanism Out-Of-Tree, which allows other storage services to access Kubernetes services. This article describes the Kubernetes storage architecture and the principles and implementation of volume plug-ins.
The following example shows how to mount a volume.
As shown in the following figure, the YAML template on the left defines a StatefulSet application. In the template, the volume named disk-pvc is defined and mounted to the /data directory in the pod. disk-pvc is a persistent volume claim (PVC) volume where a storageClassName is defined.
Therefore, this template is a typical dynamic storage template. The right part of the preceding figure shows the process of mounting the volume in six steps.
The PV controller tries to find a suitable PV in the cluster. If no suitable PV is found, the volume plug-in is called for provisioning. Provisioning creates a volume from a specific remote storage medium, creates a PV in the cluster, and then binds the PV to the PVC.
When a pod is running, the scheduler must select a node. The scheduler performs scheduling based on multiple references, such as nodeSelector and nodeAffinity defined in the pod and some tags defined in the volume.
You can add some tags to the volume. In this way, the pod using this PV will be scheduled by the scheduler to the expected node based on the tags.
This section describes the storage architecture of Kubernetes.
The PV controller, AD controller, and volume manager call operations, while volume plug-ins implement the operations.
Next, we will describe the functions of each component.
The following describes several basic concepts:
defines the parameters of the pre-attached storage space in detail.
For example, when we attach a remote Network Attached Storage (NAS) file system, the parameters of the NAS must be defined in a PV. A PV is not restricted by a namespace and is generally created and maintained by the administrator.
is a storage interface called by users in a namespace. It does not perceive storage details and defines the Size and AccessMode parameters for some basic storage.
defines a template based on which a PV is created for a dynamic volume. The template contains some parameters required for creating the template and the provisioner for creating the PV.
The PV controller completes lifecycle management of PVs and PVCs, such as creating and deleting PVs and changing the PV and PVC status. In addition, the PV controller binds PVCs and PVs. A PVC is available only after being bound to a PV. One PVC can be bound to only one PV, and one PV can be bound to only one PVC.
The following figure shows the status changes of a PV.
After a PV is created, it is in the Available state. When a PVC is bound to the PV, the PV enters the Bound state. After the PVC is deleted, the PV enters the Released state.
A PV in the Released state determines whether to enter the Available or Deleted state according to the ReclaimPolicy field. If the ReclaimPolicy field is set to Recycle, the PV enters the Available state. If the status change fails, it enters the Failed state.
The following figure shows the status changes of a PVC, which are simpler.
A created PVC is in the Pending state. After being bound to a PV, the PVC enters the Bound state. When the PV bound to the PVC is deleted, the PVC enters the Lost state. If a PV is created and bound to the PVC in the Lost state, the PVC enters the Bound state.
The following figure shows how to select a PV to be bound to a PVC.
The storage class that is specified by the StorageClassName tag creates a PV when no PV is available for the PVC. When a PV meeting the conditions is available, it directly binds the PV to the PVC.
AccessMode is usually defined for the tags such as ReadWriteOnce and RearWriteMany in the PVC. The PVC and PV to be bound must have the same AccessMode.
The size of a PVC cannot exceed that of a PV because the PVC is a declared volume and can be bound only when the actual volume is at least equal to the declared volume.
Next, let's take a look at the PV controller implementation.
The PV controller can be implemented by the ClaimWorker and VolumeWorker logic.
ClaimWorker implements PVC status change.
The system tag pv.kubernetes.io/bind-completed
is used to identify the status of a PVC.
And we need to filter all PVs in the cluster. findBestMatch can be used to filter all PVs based on the preceding five binding conditions. If a PV is found, bind it to the PVC. Otherwise, create a PV.
VolumeWorker implements PV status changes.
The ClaimRef tag in the PV is used to determine the PV status. If the tag is empty, the PV is in the Available state and we only need to synchronize the status. If the tag is not empty, we can find a PVC in the cluster based on the tag. If the PVC exists, the PV is in the Bound state and we can synchronize the status. If the PVC does not exist, the PV is in the Released state because its bound PVC has been deleted. Then, we can determine whether to delete the volume or synchronize the status based on ReclaimPolicy.
That is the simple implementation logic for the PV controller.
The name AD controller is short for the attach and detach controller.
It has two core objects: desiredStateOfWorld and actualStateOfWorld.
The AD controller has two core logic parameters, desiredStateOfWorldPopulator and reconciler.
The following table provides an example of desiredStateOfWorld and actualStateOfWorld objects.
The following figure shows a logical diagram of AD controller implementation.
The AD controller contains many informers, which synchronize the pod statuses, PV statuses, node statuses, and PVC statuses in the cluster to the local machine.
During initialization, populateDesireStateOfWorld and populateActualStateOfWorld are called to initialize desiredStateOfWorld and actualStateOfWorld.
During execution, desiredStateOfWorldPopulator synchronizes data statuses in the cluster to desireStateofWorld. reconciler synchronizes the data of actualStateOfWorld and desiredStateOfWorld in polling mode. During synchronization, it calls the volume plug-in for attach and detach operations and also calls nodeStatusUpdater to update the node status.
This is the simple implementation logic of the AD controller.
The volume manager is one of the managers in the kubelet. It is used to attach, detach, mount, and unmount volumes on the local node.
Like the AD controller, it contains desiredStateOfWorld and actualStateOfWorld. It also contains volumePluginManager, which manages plug-ins on nodes. Its core logic is similar to that of the AD controller. It synchronizes data through desiredStateOfWorldPopulator and calls APIs through reconciler.
The following describes the attach and detach operations:
As mentioned above, the AD controller also performs the attach and detach operations. We can use the --enable-controller-attach-detach
tag to specify whether the AD controller or the volume manager performs the operations. If the value is True, the AD controller performs the operations. If the value is False, the volume manager performs the operations.
This is a tag of the kubelet, which can only define the behavior of a node. Therefore, if a cluster contains 10 nodes and the tag is defined as False on five nodes, the kubelet on the five nodes performs the attach and detach operations while the AD controller on the other five nodes performs these operations.
The following figure shows the implementation logic of the volume manager.
The outermost layer is a loop while the inner layer is a poll based on different objects, including the desiredStateOfWorld and actualStateOfWorld objects.
For example, the actualStateOfWorld.MountedVolumes object is polled. If a volume of the object also exists in desiredStateOfWorld, the actual and desired volumes are mounted. Therefore, we do not have to do anything. If the volume does not exist in desiredStateOfWorld, the volume is expected to be in the Umounted state. In this case, run UnmountVolume to change its status to that in desiredStateOfWorld.
In this process, underlying APIs are called to perform the corresponding operations based on the comparison between desiredStateOfWorld and actualStateOfWorld. The desiredStateOfWorld.UnmountVolumes and actualStateOfWorld.AttachedVolumes operations are similar.
The PV controller, AD controller, and volume manager mentioned above manage PVs and PVCs by calling volume plug-in APIs, such as Provision, Delete, Attach, and Detach. The implementation logic of these APIs is in volume plug-ins.
Volume plug-ins can be divided into the In-Tree and Out-of-Tree plug-ins based on the source code locations.
Volume plug-ins are the libraries called by the PV controller, AD controller, and volume manager. They are divided into In-Tree and Out-of-Tree plug-ins. The volume plug-in calls remote storage based on these implementations and mounts a remote storage to a local device. For example, the mount -t nfs ***
command for mounting a NAS file system is implemented in a volume plug-in.
Volume plug-ins can be divided into many types. In-Tree plug-ins contain dozens of common storage implementations. However, some companies define their own types and have their own APIs and parameters, which are not supported by common storage plug-ins. In this case, Out-of-Tree storage implementations need to be used, such as CSI and FlexVolume.
We will discuss the specific implementations of volume plug-ins later. Here, let's take a look at the management of volume plug-ins.
Kubernetes manages plug-ins in the PV controller, AD controller, and volume manager by using VolumePlguinMg, mainly with the plug-ins and prober data structures.
Plugins is an object used to save the plug-in list, while prober is a probe used to discover new plug-ins. For example, FlexVolume and CSI are extension plug-ins, which are dynamically created and generated and can be discovered only by a probe.
The following figure shows the entire process of plug-in management.
When the PV controller, AD controller, or volume manager starts, the InitPlugins method is executed to initialize VolumePluginMgr.
All In-Tree plug-ins are added to the plug-in list and the init method of the prober is called. This method first calls initWatcher to constantly check a directory (such as /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ in the figure). When a new plug-in is generated in the directory, a new FsNotify.Create event is generated and added to EventsMap. Likewise, if a plug-in is deleted from the directory, an FsNotify.Remove event is generated and added to EventsMap.
When the upper layer calls refreshProbedPlugins, the prober updates the plug-ins. If FsNotify.Create is called, the new plug-in is added to the plug-in list. If FsNotify.Remove is called, a plug-in is deleted from the plug-in list.
Now you understand the plug-in management mechanism of volume plug-ins.
Pods must be scheduled to a worker before running. When scheduling pods, we use different schedulers for filtering, including some volume-related schedulers, such as VolumeZonePredicate, VolumeBindingPredicate, and CSIMaxVolumLimitPredicate.
VolumeZonePredicate checks the tags in PVs, such as the failure-domain.beta.kubernetes.io/zone tag. If the tag defines the zone information, VolumeZonePredicate determines that only the nodes in the corresponding zone can be scheduled.
In the left part of the following figure, a tag defines the zone cn-shenzhen-a. In the right part of the following figure, nodeAffinity is defined for the PV and contains the tag of the PV-expected node. The tag is filtered by VolumeBindingPredicate.
For more information about how to schedule volumes, see Getting Started with Kubernetes - App Storage and Persistent Volumes: Snapshot Storage and Topology Scheduling.
FlexVolume is an extension of volume plug-ins. It implements the attach, detach, mount, and unmount operations. These operations were originally implemented by volume plug-ins. However, for some storage types, we need to extend them and implement the operations outside volume plug-ins.
As shown in the following figure, volume plug-ins contain some FlexVolume implementation code. However, the code has only the proxy function.
For example, when calling the attach operation of a plug-in, the AD controller first calls the attach operation of FlexVolume in the volume plug-in. However, this operation only transfers the call to the Out-Of-Tree implementation of the FlexVolume.
FlexVolume is a kubelet-driven executable file. Each call is equivalent to running the Is script of shell. It is not a resident memory daemon.
Stdout of FlexVolume is a call result of the kubelet in JSON format.
The default storage address on FlexVolume is /usr/libexec/kubernetes/kubelet-plugins/volume/exec/alicloud~disk/disk.
The following figure shows a command format and call example.
FlexVolume provides the following APIs:
Not all of the preceding operations must be implemented. If an operation is not implemented, define the result as follows to notify the callers that the operation was not implemented:
{
"status": "Not supported",
"message": "error message"
}
In addition to serving as a proxy, the API provided by FlexVolume in a volume plug-in also provides some default implementations, such as the mount operation. Therefore, if this API is not defined in your FlexVolume, the default implementation will be called.
When defining a PV, we can use the secretRef field to define certain secret functions. For example, secretRef can transfer in the username and password required for mounting.
Let's take a look into the mounting and unmounting processes of FlexVolume.
First, the Attach operation calls a remote API to attach the storage to a device on the target node. Then, the MountDevice operation mounts the local device to the GlobalPath and performs certain operations, such as formatting. The Mount operation (SetUp) mounts the GlobalPath to the PodPath, which is a directory mapped upon pod startup.
For example, assume the volume ID of a disk is d-8vb4fflsonz21h31cmss. After the Attach and WaitForAttach operations are completed, it is mounted to the /dec/vdc directory on the target node. After the MountDevice operation is performed, the device is formatted and mounted to a local GlobalPath. After the Mount operation is performed, the GlobalPath is mapped to a pod-related subdirectory. Finally, the Bind operation maps the local directory to the pod. This completes the mounting process.
The unmounting process is an inverse process. The preceding describes the process of mounting a block device. Only the Mount operation needs to be performed for file storage. Therefore, only the Mount and Unmount operations need to be performed to implement FlexVolume for the file system.
The init(), doMount(), and doUnmount() methods are implemented. When the script is executed, the input parameters determine the command to be run.
GitHub provides many FlexVolume examples for your reference. Alibaba Cloud provides a FlexVolume Implementation for reference.
The following figure shows a PV template of the FlexVolume type. It is similar to other templates, except that the type is defined as flexVolume. In flexVolume, driver, fsType, and options are defined.
Similar to other types, we can define filtering conditions by using matchLabels in selector. We can also define some scheduling information, such as defining zone as cn-shenzhen-a.
The following figure shows a detailed running result. A disk in /dev/vdb is attached to the pod. We can run the mount | grep disk command to view the mounting directory. It mounts /dev/vdb to the GlobalPath, runs the mount command to mount the GlobalPath to the local subdirectory defined in the pod, and then maps the local subdirectory to /data.
Similar to FlexVolume, CSI is an abstract interface that provides volumes for third-party storage.
Why is CSI required since we can use FlexVolume?
FlexVolume is only used by the Kubernetes orchestration system, while CSI can be used by different orchestration systems, such as Mesos and Swarm.
In addition, CSI adopts container-based deployment, which is less environment-dependent but more secure and provides rich plug-in functions. FlexVolume is a binary file in the host space. Running FlexVolume is equivalent to running a local shell command, which allows us to install some dependencies when installing FlexVolume. These dependencies may affect customer applications. Therefore, this hurts security and environment dependencies.
When implementing operators in the Kubernetes ecosystem, we often use role-based access control (RBAC) to call Kubernetes APIs to implement certain functions in containers. However, these functions cannot be implemented in the FlexVolume environment because it is a binary program in host space. These functions can be implemented by RBAC for CSI.
CSI includes the CSI controller server and the CSI node server.
The following figure shows the CSI communication process. The CSI controller server and external CSI sidecar communicate with each other through a UNIX Socket, while the CSI node server and the kubelet communicate with each other through a UNIX Socket.
The following table lists the CSI APIs. The APIs are divided into general management APIs, node management APIs, and central management APIs.
CSI is implemented in the form of CRD. Therefore, it introduces the following object types: VolumeAttachment, CSINode, CSIDriver, and implementation of the CSI controller server and CSI node server.
The AD controller and volume plug-in in the CSI controller server are similar to those in Kubernetes, and they create the VolumeAttachment object.
In addition, the CSI controller server contains multiple external plug-ins, each implementing a certain function when combined with a CSI plug-in. For example:
The CSI node server contains kubelet components, including the volume manager and volume plug-in. The components call CSI plug-ins for mount and unmount operations. The driver registrar component implements the registration function for CSI plug-ins.
This is the overall topology of CSI. The following sections describe different objects and components.
Now, we will introduce the VolumeAttachment, CSIDriver, and CSINode objects.
VolumeAttachment describes information about mounting and unmounting a volume in a pod. For example, if a volume is mounted to a node, we use VolumeAttachment to track it. The AD controller creates a VolumeAttachment, while external-attacher mounts or unmounts volumes according to the status of the VolumeAttachment.
The following figure shows an example of a VolumeAttachment. For the VolumeAttachment, kind is set to VolumeAttachment. In spec, attacher is set to ossplugin.csi.alibabacloud.com, specifying the operator of mounting, nodeName is set to cn-zhangjiakou.192.168.1.53, specifying the mounting node, and persistentVolumeName is set to oss-csi-pv for source, specifying the mounting and unmounting volume.
In status, attached specifies the attachment status. If the value is false, external-attacher performs an attach operation.
CSIDriver describes the list of CSI plug-ins deployed in the cluster. These plug-ins must be created by the administrator based on the plug-in type.
For example, some CSIDrivers are created, as shown in the following figure. We can run the kuberctl get csidriver command to view the three types of CSIDrivers: disk, Apsara File Storage NAS (NAS), and Object Storage Service (OSS).
The CSIDriver name is defined, and the attachRequired and podInfoOnMount tags are defined in spec.
CSINode is the node information in a cluster and is created upon node-driver-registrar startup. CSINode information is added to the CSINode list each time a new CSI plug-in is registered.
As shown in the following figure, the CSINode list is defined, and each CSINode has its specific information (YAML file on the left). Take cn-zhangjiakou.192.168.1.49 as an example. It contains the disk CSIDriver and the NAS CSIDriver. Each CSIDriver has its nodeID and topologyKeys. If no topology information is available, set topologyKeys to null. In this way, if a cluster contains 10 nodes, we can define CSINodes only for certain nodes.
Node-driver-registrar implements CSI plug-in registration, as shown in the following figure.
After being started, node-driver-registrar calls the GetPluginInfo API of the CSI plug-in. Then, the API returns the listening address of CSI and a driver name of the CSI plug-in.
/var/lib/kuberlet/plugins_registry
directory is started to generate a socket file, such as diskplugin.csi.alibabacloud.com-reg.sock. The kubelet discovers this socket through the watcher and then calls the GetInfo API of the node-driver-registrar through this socket. GetInfo returns the obtained CSI plug-in information to the kubelet, including the listening address of the CSI plug-in and the driver name.The preceding steps complete CSI plug-in registration.
External-attacher calls APIs of the CSI plug-in to mount and unmount volumes. It mounts or unmounts volumes according to the status of the VolumeAttachment. The AD controller calls the CSI attacher in the volume plug-in to create the VolumeAttachment. The CSI attacher is an In-Tree API implemented by Kubernetes.
When the VolumeAttachment status is False, external-attacher calls the attach function at the underlying layer. If the desired value is False, the detach function is implemented through the ControllerPublishVolume API. External-attacher also synchronizes certain PV information.
This section describes the deployment of block storage.
As mentioned earlier, the CSI controller is divided into the CSI controller server and the CSI node server.
Only one CSI controller server needs to be deployed. For multiple backups, two CSI controller servers can be deployed. The CSI controller server is implemented by multiple external plug-ins. For example, multiple external containers and a container containing the CSI controller server can be defined in a pod. In this case, different external components are combined with the CSI controller server to provide different functions.
The CSI node server is a DaemonSet, which is registered on each node. The kubelet directly communicates with the CSI node server through a socket and calls the attach, detach, mount, and unmount methods.
The driver registrar only provides the registration function and is deployed on each node.
The deployment for file storage is similar to that for block storage, except that no external-attacher or VolumeAttachment is available.
The templates for CSI are similar to those of FlexVolume.
The main difference is that the template type is CSI, which defines driver, volumeHandle, volumeAttribute, and nodeAffinity.
The middle part in the following figure shows an example of dynamic scheduling, which is the same as that of other types, except that a CSI provisioner is defined.
The following is a mounting example:
After being started, a pod mounts /dev/vdb to /data. It has a GlobalPath and a PodPath in the cluster. We can mount /dev/vdb to a GlobalPath, which is the unique directory of a CSI PV on the node. A PodPath is a local directory on a pod, which maps the directories on the pod to the container.
In addition to mounting and unmounting, CSI provides some other functions. For example, the username and password information required for the template can be defined by Secret. FlexVolume, mentioned earlier, also supports this function. However, CSI can define different Secret types at different stages, such as Secret at the Mount stage and Secret at the Provision stage.
Topology is a topology awareness function. Not all nodes in the cluster can meet the requirements of a defined volume. For example, we may need to mount different zones in the volume. We can use the topology-aware function in such cases. You can read the related article for reference.
Block Volume defines a volume mode, which can be a block type or a file system type. CSI supports block volumes. That is, when a volume is mounted in a pod, it is a block device rather than a directory.
Skip Attach and PodInfo On Mount are two features of the CSIDriver.
CSI is a new implementation method. Recently, many updates have been released. For example, ExpandCSIVolumes allows us to scale up the file system, VolumeSnapshotDataSource allows us to take snapshots of volumes, VolumePVCDataSource allows us to define the PVC data source, and CSIInlineVolume allows us to directly define some CSIDrivers in volumes.
Alibaba Cloud has provided Alibaba Cloud Kubernetes CSI Plugin on GitHub.
This article introduces the features of volumes in Kubernetes clusters.
I hope that this article helps you with volume design, development, and troubleshooting.
Getting Started with Kubernetes | GPU Management and Device Plugin Implementation
Getting Started with Kubernetes | Stateful Application Orchestration with StatefulSets
507 posts | 48 followers
FollowAlibaba Developer - February 26, 2020
Alibaba Developer - June 22, 2020
Alibaba Cloud Community - July 15, 2022
Alibaba Developer - April 3, 2020
Alibaba Clouder - May 11, 2021
Alibaba Cloud Native Community - November 15, 2023
507 posts | 48 followers
FollowAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreMore Posts by Alibaba Cloud Native Community