All Products
Search
Document Center

Container Service for Kubernetes:FAQ about disk volumes

Last Updated:Feb 26, 2024

This topic provides answers to some frequently asked questions about disk volumes.

Category

Issue

Disk creation

Disk mounting

Disk unmounting

Disk expansion

Disk usage

Why does the system prompt input/output error when an application performs read and write operations on the mount directory of a disk volume?

Why does the system prompt InvalidDataDiskCatagory.NotSupported when I create a dynamically provisioned PV?

Issue

Why does the system prompt InvalidDataDiskCatagory.NotSupported when I create a dynamically provisioned PV?

Cause

The zone does not support the disk type that is specified in the StorageClass used to create the persistent volume (PV) or the disks of the specified type are out of stock in the zone.

Solution

Why does the system prompt The specified AZone inventory is insufficient when I create a dynamically provisioned PV?

Issue

You failed to create a dynamically provisioned PV and the system generates a persistent volume claim (PVC) event: The specified AZone inventory is insufficient.

Cause

The system failed to create the disk because Elastic Compute Service (ECS) instances are out of stock.

Solution

Why does the system prompt disk size is not supported when I create a dynamically provisioned PV?

Issue

Your attempt to dynamically provision a PV failed and the system generates a PVC event: disk size is not supported.

Cause

The disk size specified in the PVC is invalid. The minimum disk size limit varies based on the disk type. For example, the size of an ultra disk or SSD must be 20 GiB or larger. For more information, see Disks.

Solution

Modify the disk size specified in the PVC to meet the requirement.

Why does the system prompt waiting for first consumer to be created before binding when I create a dynamically provisioned PV?

Issue

You failed to create a PV by using a StorageClass in WaitForFirstConsumer mode and the system generates a PVC event: persistentvolume-controller waiting for first consumer to be created before binding.

Cause

The relevant PVC is not aware of the node to which the application pod is scheduled.

  • The nodeName parameter is specified in the configurations of the application that uses the relevant PVC. In this case, the scheduling of the pod bypasses the scheduler. As a result, the PVC is not aware of the node to which the application pod is scheduled. Therefore, you cannot use StorageClasses in WaitForFirstConsumer mode to dynamically provision PVs for applications that have the nodeName parameter specified.

  • The relevant PVC is not used by pods. In this case, create a pod and specify the PVC in the pod configurations.

Solution

  • Delete the nodeName parameter from the application configurations.

  • Create a pod and specify the PVC in the pod configurations.

Why does the system prompt no topology key found on CSINode node-XXXX and fail to create a dynamically provisioned PV?

Issue

You failed to create a dynamically provisioned PV and the system generates a PVC event: no topology key found on CSINode node-XXXX.

Cause

  • Cause 1: The CSI plug-in on node XXXX failed to start up.

  • Cause 2: The driver used by the volume is not supported. By default, only disks, NAS file systems, and Object Storage Service (OSS) buckets are supported.

Solution

  1. Run the following command to query the status of the pod:

    kubectl get pods -nkube-system -owide | grep node-XXXX
    • If the pod is in an abnormal state, run the kubectl logs csi-plugin-xxxx -nkube-system -c csi-plugin command to print the error log. The error log indicates that most ports of the node are occupied. In this scenario, perform the following steps:

      • Terminate the processes that occupy the ports.

      • Run the following command to add the SERVICE_PORT environment parameter to the CSI plug-in.

        kubectl set env -nkube-system daemonset/csi-plugin --containers="csi-plugin" SERVICE_PORT="XXX"
    • If the pod is in a normal state, proceed to the next step.

  2. Mount a volume that uses a supported driver, such as a disk volume, NAS volume, or OSS volume. To use other drivers, submit a ticket.

Why does the system prompt "selfLink was empty, can't make reference" when I create a dynamically provisioned PV?

Issue

You failed to create a PV and the system generates a PVC event: selfLink was empty, can't make reference.

Cause

  1. The Kubernetes version of the cluster does not match the CSI version.

  2. The cluster uses FlexVolume.

Solution

  1. Update the CSI version. Make sure that the volume plug-in version matches the Kubernetes version of the cluster. For example, if the Kubernetes version of your cluster is 1.20, install CSI 1.20 or later.

  2. If your cluster uses FlexVolume, migrate from FlexVolume to CSI. For more information, see Use csi-compatible-controller to migrate from FlexVolume to CSI.

Why does the system prompt had volume node affinity conflict when I launch a pod that has a disk mounted?

Issue

You failed to launch a pod that has a disk mounted and the system prompts had volume node affinity conflict.

Cause

You set the nodeaffinity attribute in the PV configurations to a value different from the nodeaffinity attribute in the pod configurations. Therefore, the pod cannot be scheduled to an appropriate node. Each PV has a nodeaffinity attribute.

Solution

Modify the nodeaffinity attribute of the PV or pod to ensure that the PV and pod use the same value.

Why does the system prompt can't find disk when I launch a pod that has a disk mounted?

Issue

You failed to launch a pod that has a disk mounted and the system prompts can't find disk.

Cause

  • You set the DiskID parameter to a value that corresponds to a disk in a region other than the region where the pod is deployed.

  • You entered an invalid value for the DiskID parameter when you configure the PV.

  • Your account does not have the permissions to modify the DiskID parameter. The disk that you specified may not belong to the current account.

Solution

Check whether the disk is mounted as a statically provisioned volume or a dynamically provisioned volume.

  • If the disk is mounted as a statically provisioned volume, make sure that the DiskID parameter meets the following requirements:

    • The value of the DiskID parameter corresponds to a disk in the region where the pod is deployed.

    • The value of the DiskID parameter is set to the ID of the disk that you want to use.

    • The value of the DiskID parameter corresponds to a disk that belongs to the same Alibaba Cloud account as the cluster.

  • If the disk is mounted as a dynamically provisioned volume, make sure that the permissions of the CSI plug-in used by the cluster meet the following requirements:

    Check whether Addon Token exists in the cluster.

    • If Addon Token exists in the cluster, update the CSI plug-in to the latest version and try again.

    • If Addon Token does not exist in the cluster, the CSI plug-in uses the AccessKey ID and AccessKey secret of the Resource Access Management (RAM) role assigned to the worker node. You must check the RAM policy that is attached to the RAM role.

Why does the system prompt Previous attach action is still in process when I launch a pod that has a disk mounted?

Issue

The system prompts Previous attach action is still in process when I launch a pod that has a disk mounted. The pod is started after a few seconds.

Cause

You cannot mount multiple disks to an ECS instance at the same time. When multiple pods that have disks mounted are scheduled to an ECS instance, only one disk can be mounted at a time. If the system prompts "Previous attach action is still in process", a disk is being mounted.

Solution

No operation is required. The system automatically retries until the pod is launched.

Why does the system prompt InvalidInstanceType.NotSupportDiskCategory when I launch a pod that has a disk mounted?

Issue

You failed to launch a pod that has a disk mounted and system prompts InvalidInstanceType.NotSupportDiskCategory.

Cause

The disk type is not supported by ECS instances.

Solution

Refer to Overview of instance families and check the types of disks that are supported by ECS instances. Mount a disk that is supported by ECS instances to the pod.

Why does the system prompt diskplugin.csi.alibabacloud.com not found in the list of registered CSI drivers when I launch a pod that has a disk mounted?

Issue

The following warning appears when you start a pod:

Warning  FailedMount       98s (x9 over 3m45s)  kubelet, cn-zhangjiakou.172.20.XX.XX  MountVolume.MountDevice failed for volume "d-xxxxxxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name diskplugin.csi.alibabacloud.com not found in the list of registered CSI drivers

Cause

  1. The warning usually appears on newly created nodes. The system starts the CSI pods and the service pods at the same time, and it requires a period of time to register CSI. Therefore, the CSI registration may not have been completed when you mount a volume to a service pod. This triggers the warning.

  2. The CSI registration failed because the CSI plug-in did not run as normal.

Solution

  1. No operation is required. The system automatically retries until the pod is launched.

  2. Check the status and log of the CSI plug-in. If the CSI plug-in runs as expected, join DingTalk group 35532895 to request technical support.

Why does the system prompt Multi-Attach error for volume when I launch a pod that has a disk mounted?

Issue

You failed to start a pod that uses a disk volume and the system prompts warning failedAttachVolume xxx xxx Multi-Attach error for volume "xxx". After you run the kubectl describe pvc <pvc-name> command, the output indicates that multiple pods use the same PVC.

Cause

  • Cause 1: By default, each disk can be used by only one pod. You cannot mount a disk to multiple pods.

  • Cause 2: The original pod that uses the PVC is deleted but the PVC is still retained. Log on to the ECS console and view the node to which the PV corresponding to the PVC is mounted. Print the pod log of csi-plugin on the node. The log displays Path is mounted, no remove: /var/lib/kubelet/plugins/kubernetes.io/csi/diskplugin.csi.alibabacloud.com/xxx/globalmount.

Solution

Solution to Cause 1:

Make sure that the PVC is used by only one pod.

Solution to Cause 2:

Run the following command to check whether csi-plugin mounts the volume to /var/run HostPath.

kubectl get ds -n kube-system -o yaml csi-plugin

If yes, go to the Add-ons page of the ACK console. Reinstall csi-plugin. The latest deployment file can prevent csi-plugin from mounting PVs to /var/run.

Why does the system prompt Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition when I start a pod that uses a disk volume?

Issue

You failed to start a pod that uses a disk volume and the system prompts Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition.

Cause

The preceding pod event is generated by the kubelet. The kubelet periodically checks the status of the volumes used by pods on cluster nodes. If a volume is not ready, the kubelet generates the event.

The event only indicates that the volume has not been mounted to the pod at the specified point in time due to one of the following reasons:

  • Reason 1: A mounting error occurs and the error remains unsolved for a long period of time. As a result, the event is overwritten by another event. You can only view the same event generated by the kubelet.

  • Reason 2: A timeout error occurs when the kubelet obtains configmmap/serviceaccount defaulttoken. This error is caused by the node network. Select another node and try again.

  • Reason 3: The pod uses fsGroup settings. The disk volume has been mounted to the pod. However, it is time-consuming to modify the file attributes.

  • Reason 4: If the disk volume is statically provisioned, check whether the value of driver field of the disk volume is valid. For example, check whether the value contains spelling errors. If the value is invalid, the kubelet may fail to find the driver field. As a result, the disk volume is not ready.

Solution

  • Solution to Reason 1: Delete the pod and wait for the system to recreate the pod. Then, find the event that corresponds to the error and locate the cause based on the event content.

  • Solution to Reason 2: Schedule the pod to another node. For more information, see Schedule pods to specific nodes.

  • Solution to Reason 3: Check whether the pod contains fsGroup settings. Using fsGroup settings results in mounting timeouts when a large number of files are stored in the disk. To resolve this issue, modify the pod configuration to run the chgrp command in init containers.

  • Solution to Reason 4: Set the driver name to a valid value. Examples:

    • diskplugin.csi.alibabacloud.com

    • nasplugin.csi.alibabacloud.com

    • ossplugin.csi.alibabacloud.com

Why does the system prompt validate error Device /dev/nvme1n1 has error format more than one digit locations when I start a pod that uses a disk volume?

Issue

You failed to start a pod that uses a disk volume and the system prompts validate error Device /dev/nvme1n1 has error format more than one digit locations.

Cause

The g7se, r7se, or c7se instance type is used in your cluster and the version of the CSI plug-in used by your cluster does not support Non-Volatile Memory Express (NVMe) SSDs.

Solution

Update your cluster to 1.20 or later and update the version of the CSI plug-in used by your cluster to 1.22.9-30eb0ee5-aliyun or later. For more information about how to update a plug-in, see Manage components.

Note

The FlexVolume plug-in is not supported. To migrate from FlexVolume to CSI, join DingTalk group 35532895.

Why does the system prompt ecs task is conflicted when I launch a pod that has a disk mounted?

Issue

You failed to launch a pod that has a disk mounted and the system generates a pod event: ecs task is conflicted.

Cause

Specific ECS tasks must be executed one by one. If multiple requests are received by an ECS instance at the same time, conflicts may occur among ECS tasks.

Solution

  1. Wait for the CSI plug-in to automatically retry volume provisioning. After the conflicting ECS tasks are executed, the CSI plug-in automatically mounts a disk to the pod.

  2. If the issue persists, submit a ticket to the ECS team.

Why does the system prompt wrong fs type, bad option, bad superblock on /dev/xxxxx missing codepage or helper program, or other error when I launch a pod that has a disk mounted?

Issue

You failed to launch a pod that has a disk mounted and the system generates the following pod event:

wrong fs type, bad option, bad superblock on /dev/xxxxx  missing codepage or helper program, or other error

Cause

The file system of the disk is damaged.

Solution

In most cases, this issue occurs because the disk is incorrectly unmounted. Perform the following steps to resolve the issue.

  1. Check whether the disk meets the requirements.

    • Make sure that the disk is mounted to only one pod.

    • Make sure that no data is written into the disk when you unmount the disk.

  2. Log on to the host of the pod and run the fsck -y /dev/xxxxx command to repair the file system of the disk.

    The /dev/xxxxx error corresponds to the pod event. The system also restores the metadata of the file system when repairing the file system of the disk. If the repair fails, the file system of the disk is completely damaged and can no longer be used.

Why does the system prompt "exceed max volume count" when I launch a pod that has a disk volume mounted?

Issue

After you launch a pod that has a disk volume mounted, the pod remains in the Pending state for a long period of time. Consequently, the pod cannot be scheduled. The ECS instance type of the node indicates that more disk volumes can be mounted to the node. The following pod event is generated:

0/1 nodes are available: 1 node(s) exceed max volume count.

Issue

Pod scheduling is limited by the MAX_VOLUMES_PERNODE environment variable.

Solution

  • csi-plugin v1.26.4-e3de357-aliyun and later can automatically set the number of disk volumes that are mounted. You can run the following command to delete the MAX_VOLUMES_PERNODE environment variable in the csi-plugin DaemonSet in the kube-system namespace. Then, you can specify the number of disk volumes to be mounted based on the ECS instance type.

    kubectl patch -n kube-system daemonset csi-plugin -p '
    spec:
      template:
        spec:
          containers:
          - name: csi-plugin
            env:
            - name: MAX_VOLUMES_PERNODE
              $patch: delete'
  • In csi-plugin versions earlier than v1.26.4-e3de357-aliyun, you can only set the environment variable to specify the number of disk volumes that are mounted. In this case, set the environment variable to the least number of data disks that can be mounted to a node in the cluster.

Important
  • MAX_VOLUMES_PERNODE is automatically configured only when the csi-plugin pod starts up. If you manually mount a data disk to a node or unmount a disk, recreate the csi-plugin pod on the node to trigger csi-plugin to automatically configure MAX_VOLUMES_PERNODE.

  • The MAX_VOLUMES_PERNODE configuration does not support statically provisioned disk volumes. For more information, see Use a statically provisioned disk volume. If your cluster uses statically provisioned disk volumes, the number of schedulable pods decreases.

Why does the system prompt "The amount of the disk on instance in question reach its limits" when I launch a pod that has a disk volume mounted?

Issue

After you launch a pod that has a disk volume mounted, the pod remains in the ContainerCreating state for a long period of time. The following pod event is generated:

MountVolume.MountDevice failed for volume "d-xxxx" : rpc error: code = Aborted desc = NodeStageVolume: Attach volume: d-xxxx with error: rpc error: code = Internal desc = SDK.ServerError
ErrorCode: InstanceDiskLimitExceeded
Message: The amount of the disk on instance in question reach its limits

Cause

The value of the MAX_VOLUMES_PERNODE environment variable is too large.

Solution
  • csi-plugin v1.26.4-e3de357-aliyun and later can automatically set the number of disk volumes that are mounted. You can run the following command to delete the MAX_VOLUMES_PERNODE environment variable in the csi-plugin DaemonSet in the kube-system namespace. Then, you can specify the number of disk volumes to be mounted based on the ECS instance type.

    kubectl patch -n kube-system daemonset csi-plugin -p '
    spec:
      template:
        spec:
          containers:
          - name: csi-plugin
            env:
            - name: MAX_VOLUMES_PERNODE
              $patch: delete'
  • In csi-plugin versions earlier than v1.26.4-e3de357-aliyun, you can only set the environment variable to specify the number of disk volumes that are mounted. In this case, set the environment variable to the least number of data disks that can be mounted to a node in the cluster.

Important
  • MAX_VOLUMES_PERNODE is automatically configured only when the csi-plugin pod starts up. If you manually mount a data disk to a node or unmount a disk, recreate the csi-plugin pod on the node to trigger csi-plugin to automatically configure MAX_VOLUMES_PERNODE.

  • The MAX_VOLUMES_PERNODE configuration does not support statically provisioned disk volumes. For more information, see Use a statically provisioned disk volume. If your cluster uses statically provisioned disk volumes, the number of schedulable pods decreases.

Why does the system prompt The specified disk is not a portable disk when I delete a pod that has a disk mounted?

Issue

The system prompts The specified disk is not a portable disk when you unmount a disk.

Cause

The disk is billed on a subscription basis, or you accidentally switched the billing method of the disk to subscription when you upgraded the Elastic Compute Service (ECS) instance that is associated with the disk.

Solution

Switch the billing method of the disk from subscription to pay-as-you-go.

Why does the system prompt that the disk cannot be unmounted when I delete a pod that has a disk mounted and an orphaned pod which is not managed by ACK is found in the kubelet log?

Issue

You failed to delete a pod and the kubelet generates pod logs that are not managed by ACK.

Cause

When a pod exceptionally exits, the mount target is not removed when the system unmounts the PV. As a result, the system failed to delete the pod. In Kubernetes versions earlier than 1.22, the garbage collection feature of the kubelet for data volumes is not mature. You need to manually or run scripts to remove invalid mount targets.

Solution

Run the following script on the failed node to remove invalid mount targets:

wget https://raw.githubusercontent.com/AliyunContainerService/kubernetes-issues-solution/master/kubelet/kubelet.sh
sh kubelet.sh

What do I do when the system failed to recreate a deleted pod and prompts that the mounting fails?

Issue

The deleted pod cannot be recreated and the system prompts the following error. The error cannot be automatically fixed.

Warning FailedMount 9m53s (x23 over 40m) kubelet MountVolume.SetUp failed for volume "xxxxx" : rpc error: code = Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/xxxxx/globalmount: no such file or directory

This problem occurs if the following conditions are met:

  • The Kubernetes version of your cluster is 1.20.4-aliyun-1.

  • The application is mounted with disks.

  • The application is deployed by using a StatefulSet and the podManagementPolicy: "Parallel" setting is specified.

Cause

For more information, see Pod fails to start after restarting rapidly.

Solution

  • Add new nodes to the cluster and then remove all the original nodes. The pod will be automatically recreated. For more information, see Create a node pool and Remove a node.

  • Set the StatefulSet to orderedready or delete the podManagementPolicy: "Parallel" setting.

  • If the cluster contains a small number of nodes, use the following solution:

    1. Add the cordon label to the node where the pod is deployed to set the node as unschedulable.

    2. Delete the pod and wait until the status of the pod changes to Pending.

    3. Remove the cordon label from the node and wait for the pod to restart.

  • If the cluster contains a large number of nodes, schedule the pod to another node. Then, the pod can be recreated.

Why does the system prompt target is busy when I delete a pod that has a disk mounted?

Issue

The following error is displayed in a pod event or the kubelet log (/var/log/messages) when you delete a pod that has a disk mounted:

unmount failed, output <mount-path> target is busy

Cause

The disk mounted to the pod is in use. Log on to the node on which the pod runs and query the process that uses the disk.

Solution

  1. Run the following command to query the disk in the mount path:

    mount | grep <mount-path>
    /dev/vdtest <mount-path>
  2. Run the following command to query the ID of the process that uses the disk:

    fuser -m /dev/vdtest
  3. Terminate the process.

    After the process is terminated, the disk is automatically unmounted.

Why is the disk retained after I delete the PVC of a dynamically provisioned PV?

Issue

After you delete the PVC of a dynamically provisioned PV, the disk is still retained in the ECS console.

Cause

  1. Check whether the PVC uses an existing StorageClass in the cluster. If not, a statically provisioned PV is used.

  2. Check whether reclaimPolicy in the StorageClass is set to Retain.

  3. The PVC and PV are deleted at the same time or the PV is deleted before the PVC is deleted.

Solution

  1. CSI does not delete statically provisioned PVs and their PVCs. You must log on to the ACK console or call the API to manually delete them.

  2. CSI does not delete statically provisioned PVs and PVCs even if reclaimPolicy is set to Retain. You must log on to the ACK console or call the API to manually delete them.

  3. If the deleteTimestamp annotation is added to the PV, CSI does not reclaim the disk. For more information, see controller. To delete the disk, delete only the PVC. Then, the PV bound to the PVC is automatically deleted.

Why does a PVC still exist after I delete it?

Issue

You failed to delete a PVC in the cluster by using --force.

Cause

Pods in the cluster are still using the PVC. The finalizer of the PVC cannot be deleted.

Solution

  1. Run the following command to query the pods that use the PVC:

    kubectl describe pvc <pvc-name> -n kube-system
  2. Confirm that the pods are no longer in use. Delete the pods and try to delete the PVC again.

Why does the system generate the Waiting for user to (re-)start a pod to finish file system resize of volume on node PVC event and fail to dynamically expand a disk?

Issue

After you expand a PVC, the value of the StorageCapacity parameter in the status information of the PVC is not changed. In addition, a PVC event is generated:

 Waiting for user to (re-)start a pod to finish file system resize of volume on node.

Cause

The expansion of a disk volume includes the expansion of the disk size and the expansion of the file system. The disk size is expanded by using the ECS API. This error indicates that the disk size is expanded but the file system failed to be expanded. The file system expansion failure is caused by issues related to the node.

Solution

Check the type of the node.

  • If the node is deployed on an elastic container instance, submit a ticket to the Elastic Container Instance team.

  • If the node is deployed on an ECS instance, run the kubelet get pods -nkube-system -owide | grep csi | grep <node-name> command to query the status of the csi-plugin component that runs on the node.

    • If the status of csi-plugin is normal, join DingTalk group 35532895.

    • If the csi-plugin component is in an abnormal state, restart the pod that runs the csi-plugin component and try again. If the problem persists, join DingTalk group 35532895.

Why does the system fail to dynamically expand a disk and prompt "only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize" when modifying the PVC?

Issue

After you use the CLI or console to modify the StorageClass of a PVC, the following error is returned:

only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize 

Cause

Cause 1: The PV corresponding to the PVC is statically provisioned. You cannot dynamically expand statically provisioned volumes.

Cause 2: The StorageClass of the PVC sets allowVolumeExpansion to false. The PV cannot be dynamically expanded.

Solution

Solution 1: Manually expand the statically provisioned volume. For more information, see Manually expand a disk volume.

Solution 2: Set allowVolumeExpansion to true in the StorageClass of the PVC. Then, expand the corresponding PV.

Why does the system prompt input/output error when an application performs read and write operations on the mount directory of a disk volume?

Issue

You successfully launched an application that has a disk mounted. However, the system prompts input/output error shortly after the application is launched.

Cause

The disk mounted to the application is missing.

Solution

Check the disk status and fix the issue.

  1. Find the PVC used to mount the disk based on the mount directory and the VolumeMount parameter in the configurations of the application pod.

  2. Run the kubectl get pvc <pvc-name> command to query the status of the PVC. Record the name of the PV that is provisioned by using the PVC.

  3. Find the YAML file of the PV by name and record the disk ID in the pv.VolumeHandle parameter.

  4. Log on to the ECS console and view the disk status based on the disk ID.

    • If the disk is in the Available state, the disk is unmounted. You can restart the pod to mount the disk again.

      If the pod is in the Running state, the disk was mounted and then unmounted. In this case, the disk may be used by multiple pods. You can run the kubectl describe pvc <pvc-name> command to check whether the PVC is referenced by multiple pods based on the UsedBy content.

    • If no disk is found, it indicates that the disk has been released and cannot be restored.

      Important

      When you mount an enhanced SSD (ESSD) to an application, we recommend that you enable the instant access (IA) snapshot feature to ensure data security for the disk volume. For more information, see Data loss occurs due to accidental ESSD deletions.