Getting Started with Kubernetes | Storage Architecture and Plug-ins

By Kan Junbao (Junbao), Senior Technical Expert at Alibaba

Container storage is a basic Kubernetes component that provides data persistence and is an important guarantee for stateful services. By default, Kubernetes provides the mainstream volume access solution In-Tree and the plug-in mechanism Out-Of-Tree, which allows other storage services to access Kubernetes services. This article describes the Kubernetes storage architecture and the principles and implementation of volume plug-ins.

1. Kubernetes Storage System Architecture

Example: Mount a Volume in Kubernetes

The following example shows how to mount a volume.

As shown in the following figure, the YAML template on the left defines a StatefulSet application. In the template, the volume named disk-pvc is defined and mounted to the /data directory in the pod. disk-pvc is a persistent volume claim (PVC) volume where a storageClassName is defined.

Therefore, this template is a typical dynamic storage template. The right part of the preceding figure shows the process of mounting the volume in six steps.

Step 1: The user creates a pod containing a PVC.
Step 2: The PV controller constantly observes the API server. If it finds that a PVC has been created but not been bound, it tries to bind a persistent volume (PV) to the PVC.

The PV controller tries to find a suitable PV in the cluster. If no suitable PV is found, the volume plug-in is called for provisioning. Provisioning creates a volume from a specific remote storage medium, creates a PV in the cluster, and then binds the PV to the PVC.

Step 3: The scheduler performs scheduling.

When a pod is running, the scheduler must select a node. The scheduler performs scheduling based on multiple references, such as nodeSelector and nodeAffinity defined in the pod and some tags defined in the volume.

You can add some tags to the volume. In this way, the pod using this PV will be scheduled by the scheduler to the expected node based on the tags.

Step 4: If the PV defined by a pod has not been attached after the pod is scheduled to a node, the AD controller calls the volume plug-in to mount the remote volume to the device (for example, /dev/vdb) on the target node.
Step 5: When the volume manager finds that a pod has been scheduled to its node and the volume has been mounted, it mounts the local device (/dev/vdb) to a subdirectory of the pod on the node. It may also perform some additional operations, such as formatting and mounting to GlobalPath.
Step 6: Map the locally mounted volume to containers in the pod.

Kubernetes Storage Architecture

This section describes the storage architecture of Kubernetes.

PV controller: binds PVs and PVCs, manages their lifecycles, and creates or deletes volumes as required.
AD controller: attaches storage devices to and detaches storage devices from target nodes.
Volume manager: manages the mount and unmount operations of volumes, formats volume devices, and mounts volumes to common directories.
Volume plug-in: implements all the mounting functions.

The PV controller, AD controller, and volume manager call operations, while volume plug-ins implement the operations.

Scheduler: schedules pods and performs storage-related scheduling according to certain storage-related definitions.

Next, we will describe the functions of each component.

PV Controller

The following describes several basic concepts:

PV:

defines the parameters of the pre-attached storage space in detail.

For example, when we attach a remote Network Attached Storage (NAS) file system, the parameters of the NAS must be defined in a PV. A PV is not restricted by a namespace and is generally created and maintained by the administrator.

PVC:

is a storage interface called by users in a namespace. It does not perceive storage details and defines the Size and AccessMode parameters for some basic storage.

Storage class:

defines a template based on which a PV is created for a dynamic volume. The template contains some parameters required for creating the template and the provisioner for creating the PV.

The PV controller completes lifecycle management of PVs and PVCs, such as creating and deleting PVs and changing the PV and PVC status. In addition, the PV controller binds PVCs and PVs. A PVC is available only after being bound to a PV. One PVC can be bound to only one PV, and one PV can be bound to only one PVC.

The following figure shows the status changes of a PV.

After a PV is created, it is in the Available state. When a PVC is bound to the PV, the PV enters the Bound state. After the PVC is deleted, the PV enters the Released state.

A PV in the Released state determines whether to enter the Available or Deleted state according to the ReclaimPolicy field. If the ReclaimPolicy field is set to Recycle, the PV enters the Available state. If the status change fails, it enters the Failed state.

The following figure shows the status changes of a PVC, which are simpler.

A created PVC is in the Pending state. After being bound to a PV, the PVC enters the Bound state. When the PV bound to the PVC is deleted, the PVC enters the Lost state. If a PV is created and bound to the PVC in the Lost state, the PVC enters the Bound state.

The following figure shows how to select a PV to be bound to a PVC.

First, ensure that the VolumeMode tag of the PV matches that of the PVC. VolumeMode defines whether the volume is a file system or a block.
Second, after LabelSelector is defined in the PVC, select the PVs that have labels and the match the LabelSelector of the PVC.
Third, check StorageClassName. If a StorageClassName is defined in the PVC, filter the PVs with the same class name.

The storage class that is specified by the StorageClassName tag creates a PV when no PV is available for the PVC. When a PV meeting the conditions is available, it directly binds the PV to the PVC.

Fourth, check AccessMode.

AccessMode is usually defined for the tags such as ReadWriteOnce and RearWriteMany in the PVC. The PVC and PV to be bound must have the same AccessMode.

Finally, check Size.

The size of a PVC cannot exceed that of a PV because the PVC is a declared volume and can be bound only when the actual volume is at least equal to the declared volume.

Next, let's take a look at the PV controller implementation.

The PV controller can be implemented by the ClaimWorker and VolumeWorker logic.

ClaimWorker implements PVC status change.

The system tag pv.kubernetes.io/bind-completed is used to identify the status of a PVC.

If the tag is True, the PVC has been bound. Then, we only need to synchronize some internal statuses.
If the tag is False, the PVC is not bound

And we need to filter all PVs in the cluster. findBestMatch can be used to filter all PVs based on the preceding five binding conditions. If a PV is found, bind it to the PVC. Otherwise, create a PV.

VolumeWorker implements PV status changes.

The ClaimRef tag in the PV is used to determine the PV status. If the tag is empty, the PV is in the Available state and we only need to synchronize the status. If the tag is not empty, we can find a PVC in the cluster based on the tag. If the PVC exists, the PV is in the Bound state and we can synchronize the status. If the PVC does not exist, the PV is in the Released state because its bound PVC has been deleted. Then, we can determine whether to delete the volume or synchronize the status based on ReclaimPolicy.

That is the simple implementation logic for the PV controller.

AD Controller

The name AD controller is short for the attach and detach controller.

It has two core objects: desiredStateOfWorld and actualStateOfWorld.

DesiredStateofWorld indicates the volume mounting status to be achieved in a cluster.
ActualStateOfWorld indicates the actual volume mounting status in a cluster.

The AD controller has two core logic parameters, desiredStateOfWorldPopulator and reconciler.

desiredStateOfWorldPopulator synchronizes some data in a cluster and DSW and ASW data updates. For example, when we create a PVC or a pod in a cluster, it synchronizes the status of the PVC or pod to the DSW.
reconciler synchronizes DSW and ASW statuses. It changes the ASW status to the DSW status. During this status change, it performs operations such as attach and detach.

The following table provides an example of desiredStateOfWorld and actualStateOfWorld objects.

desiredStateOfWorld defines each worker, including the volumes contained in a worker and mounting information.
actualStateOfWorl defines all volumes, including the target node of each volume and the mounting status.

The following figure shows a logical diagram of AD controller implementation.

The AD controller contains many informers, which synchronize the pod statuses, PV statuses, node statuses, and PVC statuses in the cluster to the local machine.

During initialization, populateDesireStateOfWorld and populateActualStateOfWorld are called to initialize desiredStateOfWorld and actualStateOfWorld.

During execution, desiredStateOfWorldPopulator synchronizes data statuses in the cluster to desireStateofWorld. reconciler synchronizes the data of actualStateOfWorld and desiredStateOfWorld in polling mode. During synchronization, it calls the volume plug-in for attach and detach operations and also calls nodeStatusUpdater to update the node status.

This is the simple implementation logic of the AD controller.

Volume Manager

The volume manager is one of the managers in the kubelet. It is used to attach, detach, mount, and unmount volumes on the local node.

Like the AD controller, it contains desiredStateOfWorld and actualStateOfWorld. It also contains volumePluginManager, which manages plug-ins on nodes. Its core logic is similar to that of the AD controller. It synchronizes data through desiredStateOfWorldPopulator and calls APIs through reconciler.

The following describes the attach and detach operations:

As mentioned above, the AD controller also performs the attach and detach operations. We can use the --enable-controller-attach-detach tag to specify whether the AD controller or the volume manager performs the operations. If the value is True, the AD controller performs the operations. If the value is False, the volume manager performs the operations.

This is a tag of the kubelet, which can only define the behavior of a node. Therefore, if a cluster contains 10 nodes and the tag is defined as False on five nodes, the kubelet on the five nodes performs the attach and detach operations while the AD controller on the other five nodes performs these operations.

The following figure shows the implementation logic of the volume manager.

The outermost layer is a loop while the inner layer is a poll based on different objects, including the desiredStateOfWorld and actualStateOfWorld objects.

For example, the actualStateOfWorld.MountedVolumes object is polled. If a volume of the object also exists in desiredStateOfWorld, the actual and desired volumes are mounted. Therefore, we do not have to do anything. If the volume does not exist in desiredStateOfWorld, the volume is expected to be in the Umounted state. In this case, run UnmountVolume to change its status to that in desiredStateOfWorld.

In this process, underlying APIs are called to perform the corresponding operations based on the comparison between desiredStateOfWorld and actualStateOfWorld. The desiredStateOfWorld.UnmountVolumes and actualStateOfWorld.AttachedVolumes operations are similar.

Volume Plug-ins

The PV controller, AD controller, and volume manager mentioned above manage PVs and PVCs by calling volume plug-in APIs, such as Provision, Delete, Attach, and Detach. The implementation logic of these APIs is in volume plug-ins.

Volume plug-ins can be divided into the In-Tree and Out-of-Tree plug-ins based on the source code locations.

In-Tree indicates that the source code is placed inside Kubernetes and released, managed, and iterated with Kubernetes. However, the iteration speed is slow and the flexibility is poor.
Out-of-Tree indicates that the source code is independent of Kubernetes, which is provided and implemented by the storage provider. Currently, the main implementation mechanisms are FlexVolume and Container Storage Interface (CSI), which can implement different storage plug-ins based on storage types. Therefore, we prefer Out-of-Tree.

Volume plug-ins are the libraries called by the PV controller, AD controller, and volume manager. They are divided into In-Tree and Out-of-Tree plug-ins. The volume plug-in calls remote storage based on these implementations and mounts a remote storage to a local device. For example, the mount -t nfs *** command for mounting a NAS file system is implemented in a volume plug-in.

Volume plug-ins can be divided into many types. In-Tree plug-ins contain dozens of common storage implementations. However, some companies define their own types and have their own APIs and parameters, which are not supported by common storage plug-ins. In this case, Out-of-Tree storage implementations need to be used, such as CSI and FlexVolume.

We will discuss the specific implementations of volume plug-ins later. Here, let's take a look at the management of volume plug-ins.

Kubernetes manages plug-ins in the PV controller, AD controller, and volume manager by using VolumePlguinMg, mainly with the plug-ins and prober data structures.

Plugins is an object used to save the plug-in list, while prober is a probe used to discover new plug-ins. For example, FlexVolume and CSI are extension plug-ins, which are dynamically created and generated and can be discovered only by a probe.

The following figure shows the entire process of plug-in management.

When the PV controller, AD controller, or volume manager starts, the InitPlugins method is executed to initialize VolumePluginMgr.

All In-Tree plug-ins are added to the plug-in list and the init method of the prober is called. This method first calls initWatcher to constantly check a directory (such as /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ in the figure). When a new plug-in is generated in the directory, a new FsNotify.Create event is generated and added to EventsMap. Likewise, if a plug-in is deleted from the directory, an FsNotify.Remove event is generated and added to EventsMap.

When the upper layer calls refreshProbedPlugins, the prober updates the plug-ins. If FsNotify.Create is called, the new plug-in is added to the plug-in list. If FsNotify.Remove is called, a plug-in is deleted from the plug-in list.

Now you understand the plug-in management mechanism of volume plug-ins.

Kubernetes Volume Scheduling

Pods must be scheduled to a worker before running. When scheduling pods, we use different schedulers for filtering, including some volume-related schedulers, such as VolumeZonePredicate, VolumeBindingPredicate, and CSIMaxVolumLimitPredicate.

VolumeZonePredicate checks the tags in PVs, such as the failure-domain.beta.kubernetes.io/zone tag. If the tag defines the zone information, VolumeZonePredicate determines that only the nodes in the corresponding zone can be scheduled.

In the left part of the following figure, a tag defines the zone cn-shenzhen-a. In the right part of the following figure, nodeAffinity is defined for the PV and contains the tag of the PV-expected node. The tag is filtered by VolumeBindingPredicate.

For more information about how to schedule volumes, see Getting Started with Kubernetes - App Storage and Persistent Volumes: Snapshot Storage and Topology Scheduling.

2. FlexVolume Introduction

FlexVolume is an extension of volume plug-ins. It implements the attach, detach, mount, and unmount operations. These operations were originally implemented by volume plug-ins. However, for some storage types, we need to extend them and implement the operations outside volume plug-ins.

As shown in the following figure, volume plug-ins contain some FlexVolume implementation code. However, the code has only the proxy function.

For example, when calling the attach operation of a plug-in, the AD controller first calls the attach operation of FlexVolume in the volume plug-in. However, this operation only transfers the call to the Out-Of-Tree implementation of the FlexVolume.

FlexVolume is a kubelet-driven executable file. Each call is equivalent to running the Is script of shell. It is not a resident memory daemon.

Stdout of FlexVolume is a call result of the kubelet in JSON format.

The default storage address on FlexVolume is /usr/libexec/kubernetes/kubelet-plugins/volume/exec/alicloud~disk/disk.

The following figure shows a command format and call example.

Introduction to FlexVolume APIs

FlexVolume provides the following APIs:

init: performs initialization operations. For example, if initialization operations are performed during plug-in deployment or update, the data structure of the DriveCapabilities type is returned to describe the features of the FlexVolume plug-in.
GetVolumeName: returns the plug-in name.
Attach: implements the attach function. The --enable-controller-attach-detach tag determines whether the AD controller or the kubelet initiates the attach operation.
WaitforAttach: waits for the completion of the asynchronous attach operation before continuing other operations.
MountDevice: is part of the mount operation. In this example, the mount operation is divided into MountDevice and SetUp. MountDevice performs simple preprocessing, such as formatting the device and mounting it to the GlobalMount directory.
GetPath: obtains the local mounting directory for each pod.
Setup: binds devices in GlobalPath to the local directory of the pod.
TearDown, UnmountDevice, and Detach: implement the inverse processes of some of the preceding APIs.
ExpandVolumeDevice: expands volumes. It is called by the expand controller.
NodeExpand: expands the file system. It is called by the kubelet.

Not all of the preceding operations must be implemented. If an operation is not implemented, define the result as follows to notify the callers that the operation was not implemented:

{
    "status": "Not supported",
    "message": "error message"
}

In addition to serving as a proxy, the API provided by FlexVolume in a volume plug-in also provides some default implementations, such as the mount operation. Therefore, if this API is not defined in your FlexVolume, the default implementation will be called.

When defining a PV, we can use the secretRef field to define certain secret functions. For example, secretRef can transfer in the username and password required for mounting.

Analysis of FlexVolume Mounting

Let's take a look into the mounting and unmounting processes of FlexVolume.

First, the Attach operation calls a remote API to attach the storage to a device on the target node. Then, the MountDevice operation mounts the local device to the GlobalPath and performs certain operations, such as formatting. The Mount operation (SetUp) mounts the GlobalPath to the PodPath, which is a directory mapped upon pod startup.

For example, assume the volume ID of a disk is d-8vb4fflsonz21h31cmss. After the Attach and WaitForAttach operations are completed, it is mounted to the /dec/vdc directory on the target node. After the MountDevice operation is performed, the device is formatted and mounted to a local GlobalPath. After the Mount operation is performed, the GlobalPath is mapped to a pod-related subdirectory. Finally, the Bind operation maps the local directory to the pod. This completes the mounting process.

The unmounting process is an inverse process. The preceding describes the process of mounting a block device. Only the Mount operation needs to be performed for file storage. Therefore, only the Mount and Unmount operations need to be performed to implement FlexVolume for the file system.

FlexVolume Code Example

The init(), doMount(), and doUnmount() methods are implemented. When the script is executed, the input parameters determine the command to be run.

GitHub provides many FlexVolume examples for your reference. Alibaba Cloud provides a FlexVolume Implementation for reference.

Use of FlexVolume

The following figure shows a PV template of the FlexVolume type. It is similar to other templates, except that the type is defined as flexVolume. In flexVolume, driver, fsType, and options are defined.

driver defines an implemented driver, such as the alicloud/disk driver in the figure or the alicloud/nas driver.
fsType defines the file system type, such as ext4.
options contains specific parameters, such as volumeId.

Similar to other types, we can define filtering conditions by using matchLabels in selector. We can also define some scheduling information, such as defining zone as cn-shenzhen-a.

The following figure shows a detailed running result. A disk in /dev/vdb is attached to the pod. We can run the mount | grep disk command to view the mounting directory. It mounts /dev/vdb to the GlobalPath, runs the mount command to mount the GlobalPath to the local subdirectory defined in the pod, and then maps the local subdirectory to /data.

3. CSI Introduction

Similar to FlexVolume, CSI is an abstract interface that provides volumes for third-party storage.

Why is CSI required since we can use FlexVolume?

FlexVolume is only used by the Kubernetes orchestration system, while CSI can be used by different orchestration systems, such as Mesos and Swarm.

In addition, CSI adopts container-based deployment, which is less environment-dependent but more secure and provides rich plug-in functions. FlexVolume is a binary file in the host space. Running FlexVolume is equivalent to running a local shell command, which allows us to install some dependencies when installing FlexVolume. These dependencies may affect customer applications. Therefore, this hurts security and environment dependencies.

When implementing operators in the Kubernetes ecosystem, we often use role-based access control (RBAC) to call Kubernetes APIs to implement certain functions in containers. However, these functions cannot be implemented in the FlexVolume environment because it is a binary program in host space. These functions can be implemented by RBAC for CSI.

CSI includes the CSI controller server and the CSI node server.

The CSI controller server provides the create, delete, attach, and detach functions on the control side.
The CSI node server provides the mount and unmount functions on nodes.

The following figure shows the CSI communication process. The CSI controller server and external CSI sidecar communicate with each other through a UNIX Socket, while the CSI node server and the kubelet communicate with each other through a UNIX Socket.

The following table lists the CSI APIs. The APIs are divided into general management APIs, node management APIs, and central management APIs.

General management APIs return general information about CSI, such as the plug-in names, driver identity, and plug-in capabilities.
The NodeStageVolume and NodeUnstageVolume node management APIs are equivalent to MountDevice and UnmountDevice in FlexVolume. The NodePublishVolume and NodeUnpublishVolume APIs are equivalent to the SetUp and TearDown APIs.
The CreateVolume and DeleteVolume central management APIs create and delete volumes, while the ControllerPublishVolume and ControllerUnPublishVolume APIs are used for attach and detach operations.

CSI System Structure

CSI is implemented in the form of CRD. Therefore, it introduces the following object types: VolumeAttachment, CSINode, CSIDriver, and implementation of the CSI controller server and CSI node server.

The AD controller and volume plug-in in the CSI controller server are similar to those in Kubernetes, and they create the VolumeAttachment object.

In addition, the CSI controller server contains multiple external plug-ins, each implementing a certain function when combined with a CSI plug-in. For example:

External-provisioner and controller server are combined to create and delete volumes.
External-attacher and the controller server are combined to mount and unmount volumes.
External-resizer and controller server are combined to scale out volumes.
External-snapshotter and controller server are combined to create and delete snapshots.

The CSI node server contains kubelet components, including the volume manager and volume plug-in. The components call CSI plug-ins for mount and unmount operations. The driver registrar component implements the registration function for CSI plug-ins.

This is the overall topology of CSI. The following sections describe different objects and components.

CSI Objects

Now, we will introduce the VolumeAttachment, CSIDriver, and CSINode objects.

VolumeAttachment describes information about mounting and unmounting a volume in a pod. For example, if a volume is mounted to a node, we use VolumeAttachment to track it. The AD controller creates a VolumeAttachment, while external-attacher mounts or unmounts volumes according to the status of the VolumeAttachment.

The following figure shows an example of a VolumeAttachment. For the VolumeAttachment, kind is set to VolumeAttachment. In spec, attacher is set to ossplugin.csi.alibabacloud.com, specifying the operator of mounting, nodeName is set to cn-zhangjiakou.192.168.1.53, specifying the mounting node, and persistentVolumeName is set to oss-csi-pv for source, specifying the mounting and unmounting volume.

In status, attached specifies the attachment status. If the value is false, external-attacher performs an attach operation.

CSIDriver describes the list of CSI plug-ins deployed in the cluster. These plug-ins must be created by the administrator based on the plug-in type.

For example, some CSIDrivers are created, as shown in the following figure. We can run the kuberctl get csidriver command to view the three types of CSIDrivers: disk, Apsara File Storage NAS (NAS), and Object Storage Service (OSS).

The CSIDriver name is defined, and the attachRequired and podInfoOnMount tags are defined in spec.

attachRequired specifies whether a plug-in supports the attach function. This is used to distinguish block storage from file storage. For example, if the Attach operation is not required for file storage, we can set the tag to false.
podInfoOnMount defines whether Kubernetes carries pod information when calling the mount API.

CSINode is the node information in a cluster and is created upon node-driver-registrar startup. CSINode information is added to the CSINode list each time a new CSI plug-in is registered.

As shown in the following figure, the CSINode list is defined, and each CSINode has its specific information (YAML file on the left). Take cn-zhangjiakou.192.168.1.49 as an example. It contains the disk CSIDriver and the NAS CSIDriver. Each CSIDriver has its nodeID and topologyKeys. If no topology information is available, set topologyKeys to null. In this way, if a cluster contains 10 nodes, we can define CSINodes only for certain nodes.

Node-driver-registrar

Node-driver-registrar implements CSI plug-in registration, as shown in the following figure.

Step 1: A convention is specified upon startup. For example, adding a file to the /var/lib/kuberlet/plugins_registry directory is equivalent to adding a plug-in.

After being started, node-driver-registrar calls the GetPluginInfo API of the CSI plug-in. Then, the API returns the listening address of CSI and a driver name of the CSI plug-in.

Step 2: The node-driver-registrar listens to the GetInfo and NotifyRegistrationStatus APIs.
Step 3: A socket in the /var/lib/kuberlet/plugins_registry directory is started to generate a socket file, such as diskplugin.csi.alibabacloud.com-reg.sock. The kubelet discovers this socket through the watcher and then calls the GetInfo API of the node-driver-registrar through this socket. GetInfo returns the obtained CSI plug-in information to the kubelet, including the listening address of the CSI plug-in and the driver name.
Step 4: The kubelet calls the NodeGetInfo API of the CSI plug-in based on the obtained listening address.
Step 5: After calling the API, the kubelet updates some status information, such as the annotations, tags, and status.allocatable of the node, and creates a CSINode object.
Step 6: By calling the NotifyRegistrationStatus API of the node-driver-registrar, the kubelet notifies us that the CSI plug-in is registered.

The preceding steps complete CSI plug-in registration.

External-attacher

External-attacher calls APIs of the CSI plug-in to mount and unmount volumes. It mounts or unmounts volumes according to the status of the VolumeAttachment. The AD controller calls the CSI attacher in the volume plug-in to create the VolumeAttachment. The CSI attacher is an In-Tree API implemented by Kubernetes.

When the VolumeAttachment status is False, external-attacher calls the attach function at the underlying layer. If the desired value is False, the detach function is implemented through the ControllerPublishVolume API. External-attacher also synchronizes certain PV information.

CSI Deployment

This section describes the deployment of block storage.

As mentioned earlier, the CSI controller is divided into the CSI controller server and the CSI node server.

Only one CSI controller server needs to be deployed. For multiple backups, two CSI controller servers can be deployed. The CSI controller server is implemented by multiple external plug-ins. For example, multiple external containers and a container containing the CSI controller server can be defined in a pod. In this case, different external components are combined with the CSI controller server to provide different functions.

The CSI node server is a DaemonSet, which is registered on each node. The kubelet directly communicates with the CSI node server through a socket and calls the attach, detach, mount, and unmount methods.

The driver registrar only provides the registration function and is deployed on each node.

The deployment for file storage is similar to that for block storage, except that no external-attacher or VolumeAttachment is available.

CSI Cases

The templates for CSI are similar to those of FlexVolume.

The main difference is that the template type is CSI, which defines driver, volumeHandle, volumeAttribute, and nodeAffinity.

driver defines the plug-in to be used for mounting.
volumeHandle indicates the unique tag of the PV.
volumeAttribute is used to add parameters. For example, if the PV is defined as OSS, you can define information such as the bucket and access address in volumeAttribute.
nodeAffinity defines some scheduling information. Similar to FlexVolume, some binding conditions can be defined in selector and labels.

The middle part in the following figure shows an example of dynamic scheduling, which is the same as that of other types, except that a CSI provisioner is defined.

The following is a mounting example:

After being started, a pod mounts /dev/vdb to /data. It has a GlobalPath and a PodPath in the cluster. We can mount /dev/vdb to a GlobalPath, which is the unique directory of a CSI PV on the node. A PodPath is a local directory on a pod, which maps the directories on the pod to the container.

Other CSI Features

In addition to mounting and unmounting, CSI provides some other functions. For example, the username and password information required for the template can be defined by Secret. FlexVolume, mentioned earlier, also supports this function. However, CSI can define different Secret types at different stages, such as Secret at the Mount stage and Secret at the Provision stage.

Topology is a topology awareness function. Not all nodes in the cluster can meet the requirements of a defined volume. For example, we may need to mount different zones in the volume. We can use the topology-aware function in such cases. You can read the related article for reference.

Block Volume defines a volume mode, which can be a block type or a file system type. CSI supports block volumes. That is, when a volume is mounted in a pod, it is a block device rather than a directory.

Skip Attach and PodInfo On Mount are two features of the CSIDriver.

Latest CSI Features

CSI is a new implementation method. Recently, many updates have been released. For example, ExpandCSIVolumes allows us to scale up the file system, VolumeSnapshotDataSource allows us to take snapshots of volumes, VolumePVCDataSource allows us to define the PVC data source, and CSIInlineVolume allows us to directly define some CSIDrivers in volumes.

Alibaba Cloud has provided Alibaba Cloud Kubernetes CSI Plugin on GitHub.

4. Summary

This article introduces the features of volumes in Kubernetes clusters.

The first part describes Kubernetes storage architecture, which includes the volume concept, mounting process, and system components.
The second part describes the implementation principles, deployment architecture, and usage examples of the FlexVolume plug-in.
The third part describes the implementation principles, resource objects, components, and usage examples of the CSI plug-in.

I hope that this article helps you with volume design, development, and troubleshooting.

Community