When you need multiple nodes to concurrently read and write to the same cloud disk to achieve efficient data sharing and fast failover, you can use the multi-attach feature to attach a single ESSD, ESSD AutoPL, or other types of cloud disks to multiple nodes that support the NVMe protocol in the same zone, or attach a single zone-redundant storage ESSD to multiple nodes in the same region. This topic demonstrates how to use the NVMe cloud disk multi-attach and Reservation features in an ACK cluster.
Before you begin
To better use the NVMe cloud disk multi-attach and Reservation features, we recommend that you understand the following information before reading this document:
For information about the NVMe protocol, see Overview of the NVMe protocol.
For information about the cloud disk multi-attach feature and its limits, see Cloud disk multi-attach feature.
Scenarios
The multi-attach feature is suitable for the following scenarios:
Limits
A single NVMe cloud disk can be attached to a maximum of 16 ECS instances in the same zone at the same time.
If you want to read and write to a cloud disk from multiple nodes at the same time, you must mount the cloud disk by using volumeDevices. This method mounts the cloud disk as a block device and does not support access through a file system.
For more information about the limits, see Limits of the multi-attach feature.
Preparations
An ACK managed cluster is created, and the Kubernetes version of the cluster is 1.20 or later. For more information, see Create an ACK managed cluster.
The csi-plugin and csi-provisioner components are installed, and the version of the components is v1.24.10-7ae4421-aliyun or later. For information about how to upgrade the csi-plugin and csi-provisioner components, see Manage the csi-plugin and csi-provisioner components.
The cluster contains at least two nodes that are in the same zone and support the multi-attach feature. For information about the instance families that support the multi-attach feature, see Limits of the multi-attach feature.
A business application that meets the following requirements is prepared and packaged into a container image for deployment in the ACK cluster:
The application supports accessing data on the same cloud disk from multiple replicas at the same time.
The application can ensure data consistency by using standard features such as NVMe Reservation.
Billing description
The multi-attach feature does not incur additional fees. Resources that support the NVMe protocol are still billed based on their original billing methods. For more information about the billing of cloud disks, see Elastic Block Storage volumes.
Application example
This topic uses the source code and Dockerfile of the following application example. After the application is built, upload it to an image repository for deployment in the cluster. In this application example, multiple replicas jointly manage a lease, but only one replica holds the lease. If the replica cannot work properly, other replicas automatically take over the lease. Note the following when you write an application:
In the example,
O_DIRECTis used to open the block device for read and write operations to prevent any cache from affecting the test.In the example, the simplified interface of Reservation provided by the Linux kernel is used. You can also use one of the following methods to run Reservation-related commands. These methods require privileges.
C code:
ioctl(fd, NVME_IOCTL_IO_CMD, &cmd);Command line interface:
nvme-cli
For more information about the NVMe Reservation feature, see NVMe Specification.
Step 1: Deploy the application and configure the multi-attach feature
Create a StorageClass named alicloud-disk-shared and enable the multi-attach feature for cloud disks.
Create a PVC named data-disk and set accessModes to ReadWriteMany and volumeMode to Block.
Create a StatefulSet application named lease-test and use the image of the application example in this topic.
Create a lease.yaml file with the following content.
Replace the container image address in the following YAML with the actual image address of your application.
ImportantBecause NVMe Reservation takes effect at the node level, multiple pods on the same node may interfere with each other. Therefore,
podAntiAffinityis used in this example to prevent multiple pods from being scheduled to the same node.If your cluster includes other nodes that do not use the NVMe protocol, you need to configure affinity to ensure that pods are scheduled to nodes that use the NVMe protocol.
Parameter
Configuration description for the multi-attach feature
Configuration description for normal mounting
StorageClass
parameters.multiAttach
Set to true to enable the multi-attach feature for cloud disks.
No configuration required
PVC
accessModes
ReadWriteMany
ReadWriteOnce
volumeMode
Block
Filesystem
Storage volume mounting method
volumeDevices: Directly access data on the cloud disk through a block device.
volumeMounts: Mainly used to mount volumes of the file system type.
Run the following command to deploy the application:
kubectl apply -f lease.yaml
Step 2: Verify the multi-attach and Reservation effects
To ensure data consistency on the NVMe cloud disk, you can control read and write permissions through Reservation in your application. If one pod performs a write operation, other pods can only perform read operations.
Multiple nodes can read and write to the same cloud disk
Run the following command to view the pod logs:
kubectl logs -l app=lease-test --prefix -fExpected results:
[pod/lease-test-0/lease] Register as key 4745d0c5cd9a2fa4
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed leaseThe expected results indicate that Pod lease-test-1 can immediately read the content written by Pod lease-test-0.
NVMe Reservation is created successfully
Run the following command to obtain the cloud disk ID:
kubectl get pvc data-disk -ojsonpath='{.spec.volumeName}'Log on to either of the two nodes and run the following command to check whether NVMe Reservation is created successfully:
Replace
2zxxxxxxxxxxxin the following code with the content afterd-in the cloud disk ID that you obtained in the previous step.nvme resv-report -c 1 /dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_2zxxxxxxxxxxxExpected results:
NVME Reservation status: gen : 3 rtype : 1 regctl : 1 ptpls : 1 regctlext[0] : cntlid : ffff rcsts : 1 rkey : 4745d0c5cd9a2fa4 hostid : 4297c540000daf4a4*****The expected results indicate that NVMe Reservation is created successfully.
Reservation can block write I/O operations from abnormal nodes
Log on to the node where Pod lease-test-0 is located and run the following command to pause the process to simulate a failure scenario:
pkill -STOP -f /usr/local/bin/leaseWait for 30 seconds and then run the following command to view the logs again:
kubectl logs -l app=lease-test --prefix -fExpected results:
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease [pod/lease-test-1/lease] Remote is dead, preempting [pod/lease-test-1/lease] Register as key 4745d0c5cd9a2fa4 [pod/lease-test-1/lease] Refreshed lease [pod/lease-test-1/lease] Refreshed lease [pod/lease-test-1/lease] Refreshed leaseThe expected results indicate that Pod lease-test-1 has taken over and holds the lease as the primary node of the service.
Log on to the node where Pod lease-test-0 is located again and run the following command to resume the paused process:
pkill -CONT -f /usr/local/bin/leaseRun the following command to view the logs again:
kubectl logs -l app=lease-test --prefix -fExpected results:
[pod/lease-test-0/lease] failed to write lease: Invalid exchangeThe expected results indicate that Pod lease-test-0 can no longer write to the cloud disk, and the lease container automatically restarts. This indicates that the write I/O operation has been successfully blocked by Reservation.
References
If your NVMe cloud disk does not have enough space or is full, see Expand a cloud disk volume.