This topic describes how to troubleshoot common issues related to storage and how to resolve common issue related to disk volumes and NAS volumes.

Troubleshoot common issues

Perform the following operations to view the log of a specified volume plug-in. This allows you to identify the problems.

  1. Run the following command to check whether events occur in persistent volume claims (PVCs) or pods:
    kubectl get events

    Expected output:

    LAST SEEN   TYPE      REASON                 OBJECT                                                  MESSAGE
    2m56s       Normal    FailedBinding          persistentvolumeclaim/data-my-release-mariadb-0         no persistent volumes available for this claim and no storage class is set
    41s         Normal    ExternalProvisioning   persistentvolumeclaim/pvc-nas-dynamic-create-subpath8   waiting for a volume to be created, either by external provisioner "nasplugin.csi.alibabacloud.com" or manually created by system administrator
    3m31s       Normal    Provisioning           persistentvolumeclaim/pvc-nas-dynamic-create-subpath8   External provisioner is provisioning volume for claim "default/pvc-nas-dynamic-create-subpath8"
  2. Run the following command to check whether the FlexVolume or CSI plug-in is deployed in the cluster:
    • Run the following command to check whether the FlexVolume plug-in is deployed in the cluster:
      kubectl get pod -nkube-system |grep flexvolume

      Expected output:

      NAME                      READY   STATUS             RESTARTS   AGE
      flexvolume-***            4/4     Running            0          23d
    • Run the following command to check whether the CSI plug-in is deployed in the cluster:
      kubectl get pod -nkube-system |grep csi

      Expected output:

      NAME                       READY   STATUS             RESTARTS   AGE
      csi-plugin-***             4/4     Running            0          23d
      csi-provisioner-***        7/7     Running            0          14d
  3. Check whether the volume templates match the template of the volume plug-in used in the cluster. The supported volume plug-ins are FlexVolume and CSI.
    If this is the first time you mount volumes in the cluster, check the driver specified in the persistent volume (PV) and StorageClass. The driver that you specified must be the same as the volume plug-in that is deployed in the cluster.
  4. Check whether the volume plug-in is upgraded to the latest version.
    • Run the following command to query the image version of the FlexVolume plug-in:
      kubectl get ds flexvolume -nkube-system -oyaml | grep image

      Expected output:

      image: registry.cn-hangzhou.aliyuncs.com/acs/Flexvolume:v1.14.8.109-649dc5a-aliyun

      For more information about the FlexVolume plug-in, see Flexvolume.

    • Run the following command to query the image version of the CSI plug-in:
      kubectl get ds csi-plugin -nkube-system -oyaml |grep image

      Expected output:

      image: registry.cn-hangzhou.aliyuncs.com/acs/csi-plugin:v1.18.8.45-1c5d2cd1-aliyun

      For more information about the CSI plug-in, see csi-plugin and csi-provisioner.

  5. View log data.
    • If a PVC of disk type is in the Pending state, the cluster fails to create the PV. You must check the log of the Provisioner plug-in.
      • If the FlexVolume plug-in is deployed in the cluster, run the following command to print the log of alicloud-disk-controller:
        podid=`kubectl get pod -nkube-system | grep alicloud-disk-controller | awk '{print $1}'`
        kubectl logs {PodID} -nkube-system
      • If the CSI plug-in is deployed in the cluster, run the following command to print the log of csi-provisioner:
        podid=`kubectl get pod -nkube-system | grep csi-provisioner | awk '{print $1}'`
        kubectl logs {PodID} -nkube-system -c csi-provisioner
        Note Two pods are created to run csi-provisioner. After you run the kubectl get pod -nkube-system | grep csi-provisioner | awk '{print $1}' command, two podid are returned. Then, run the kubectl logs {PodID} -nkube-system -c csi-provisioner command for each of the two pods.
    • If a mounting error occurs when the system starts a pod, you must check the log of FlexVolume or csi-plugin.
      • If the FlexVolume plug-in is deployed in the cluster, run the following command to print the log of FlexVolume:
        kubectl get pod {pod-name} -owide

        Log on to the Elastic Compute Service (ECS) instance where the pod runs and check the log of FlexVolume in the /var/log/alicloud/flexvolume_**.log directory.

      • If the CSI plug-in is deployed in the cluster, run the following command to print the log of csi-plugin:
        nodeID=`kubectl get pod {pod-name} -owide | awk 'NR>1 {print $7}'`
        podID=`kubectl get pods -nkube-system -owide -lapp=csi-plugin | grep $nodeID|awk '{print $1}'`
        kubectl logs {PodID} -nkube-system
    • View the log of Kubelet.

      Run the following command to query the node where the pod runs:

      kubectl get pod deployment-disk-5c795d7976-bjhkj -owide | awk 'NR>1 {print $7}'

      Log on to the node and check the log in the /var/log/message directory.

Quick recovery

If you fail to mount volumes to most of the pods on a node, you can schedule the pods to another node. For more information, see the questions and answers in the following sections.

FAQ about disk volumes

Issue:

The system prompts "The specified disk is not a portable disk" when you unmount a disk.

Cause:

The disk is billed on a subscription basis, or you accidentally switch the billing method of the disk to subscription when you upgrade the ECS instance that is associated with the disk.

Solution:

Switch the billing method of the disk from subscription to pay-as-you-go.

Issue:

You fail to launch a pod that has a disk mounted and the system prompts "had volume node affinity conflict".

Cause:

You have set the nodeaffinity attribute for the persistent volume (PV). The value of the attribute is different from that of the pod. Therefore, the pod cannot be scheduled to the expected node.

Solution:

Modify the nodeaffinity attribute of the PV or pod to ensure that the PV and pod use the same value.

Issue:

The system prompts "The specified disk is not a portable disk" when you unmount a disk.

Cause:

  • You entered an invalid value for diskid when you set the parameters of the PV.
  • Your account does not have permissions to modify diskid. The disk may not belong to the current account.

Solution:

Modify the diskid parameter.

Issue:

You fail to dynamically provision a PV and the system prompts "The specified AZone inventory is insufficient".

Cause:

The system fails to create the disk because ECS instances are out of stock.

Solution:

Change the type of disk or select another zone.

Issue:

You fail to dynamically provision a PV and the system prompts "disk size is not supported".

Cause:

The size of the disk that you specified in the PVC is invalid. The disk size must be at least 20 GiB.

Solution:

Change the size of the disk that is specified in the PVC to a valid value.

Issue:

The disk is blocked and the following errors are reported in the Kubelet log:

Operation for "{volumeName:kubernetes.io/csi/diskplugin.csi.alibabacloud.com^d-2zejaz33icbp2vvvc9le podName: nodeName:}" failed. No retries permitted until 2020-11-05 14:38:12.653566679 +0800 CST m=+9150650.781033052 (durationBeforeRetry 2m2s). Error: "MountVolume.MountDevice failed for volume \"d-2zejaz33icbp2vvvc9le\" (UniqueName: \"kubernetes.io/csi/diskplugin.csi.alibabacloud.com^d-2zejaz33icbp2vvvc9le\") pod \"pod-e5ee2d454cdb4d1d916d933495e56cbe-3584893\" (UID: \"f8d71e90-d934-4d5a-b54f-62555da5df22\") : rpc error: code = Aborted desc = NodeStageVolume: Previous attach action is still in process

Cause:

You used an earlier version of CSI. Earlier versions of CSI use blkid to obtain the UUID of the disk. The disks that are restored from snapshots share the same UUID. This blocks the blkid command.

Solution:

Restart the current node, or upgrade the CSI to the latest version.

Issue:

The following warning appears when you start the pod:

Warning  FailedMount       104s                 kubelet, cn-zhangjiakou.172.20.11.162  Unable to attach or mount volumes: unmounted volumes=[sysdata-nas], unattached volumes=[kun-log kun-script kun-app sysdata-nas kun-patch default-token-rbx8p kun-etc kun-bin]: timed out waiting for the condition
Warning  FailedMount       98s (x9 over 3m45s)  kubelet, cn-zhangjiakou.172.20.11.162  MountVolume.MountDevice failed for volume "nas-9d9ead08-8a1d-4463-a7e0-7bd0e3d3****" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nasplugin.csi.alibabacloud.com not found in the list of registered CSI drivers

Cause:

The warning usually appears on newly created nodes. The system starts the CSI pods and the service pods at the same time, and it requires a period of time to register CSI. Therefore, when you start to mount a volume to the service pod, the CSI registration may have not been completed. This triggers the warning.

Solution:

No operation is needed. The warning does not affect the startup of the pod.

Issue:

You fail to delete a pod and Kubelet generates a pod log that is not managed by ACK.

Cause:

The pod exceptionally exits. Some mount targets are not removed when the system deletes the pod. As a result, the system fails to delete the pod. Kubelet cannot collect all volume garbage. You must remove invalid mount targets manually or by executing an automated script.

Solution:

Run the following script on the failed node to remove invalid mount targets:

wget https://raw.githubusercontent.com/AliyunContainerService/kubernetes-issues-solution/master/kubelet/kubelet.sh
sh kubelet.sh

FAQ about NAS volumes

Issue:

The system prompts "chown: option not permitted" when you mount a NAS file system.

Cause:

Your container does not have permissions to use the specified NAS file system.

Solution:

Launch the container with root privileges.