Storage troubleshooting - Container Service for Kubernetes

This topic describes the diagnostic procedure for storage and how to troubleshoot storage exceptions.

Diagnostic procedure

Run the following command to view the pod event. Check whether the pod fails to be launched due to a storage issue.
```
kubectl describe pods <pod-name>
```
If the pod is in the following state, the volume is mounted to the pod. In this scenario, the pod fails to be launched due to other issues, such as CrashLoopBackOff. To resolve the issue, Submit a ticket.
Run the following command to check whether the Container Storage Interface (CSI) plug-in works as normal:
```
kubectl get pod -n kube-system |grep csi
```
Expected output:
```
NAME                       READY   STATUS             RESTARTS   AGE
csi-plugin-***             4/4     Running            0          23d
csi-provisioner-***        7/7     Running            0          14d
```
Note If the status of the pod is not Running, run the kubectl describe pods <pod-name> -n kube-system command to check the reason that causes containers to exit and view the pod event.
Run the following command to check whether the version of the CSI plug-in is up-to-date.
```
kubectl get ds csi-plugin -n kube-system -oyaml |grep image
```
Expected output:
```
image: registry.cn-****.aliyuncs.com/acs/csi-plugin:v*****-aliyun
```
For more information about the latest CSI version, see csi-plugin and csi-provisioner. If your cluster uses an earlier CSI version, update the plug-in to the latest version. For more information, see Manage system components. For more information about how to troubleshoot volume plug-in update failures, see Troubleshoot component update failures.
Troubleshoot the pod pending issue.
- If the pod uses a disk, refer to The status of the pod that uses a disk is not Running.
- If the pod uses an Apsara File Storage NAS (NAS) file system, refer to The status of the pod that uses a NAS file system is not Running.
- If the pod uses an Object Storage Service (OSS) bucket, refer to The status of the pod that uses an OSS bucket is not Running.
Troubleshoot the issue that the status of the persistent volume claim (PVC) is not Bound.
- If the PVC corresponds to a disk, refer to The status of the disk PVC not Bound.
- If the PVC corresponds to a NAS file system, refer to The status of the NAS PVC not Bound.
- If the PVC corresponds to an OSS bucket, refer to The status of the OSS PVC not Bound.
If the issue persists, Submit a ticket.

Troubleshoot component update failures

If you fail to update the csi-provisioner and csi-plugin components, perform the following steps to troubleshoot the issue.

csi-provisioner

By default, the csi-provisioner component is deployed by using a Deployment that creates two pods. The pods cannot be scheduled to the same node because they are mutually exclusive. If you fail to update the component, check whether only one node is available in the cluster.
For version 1.14 or earlier, the csi-provisioner component is deployed by using a StatefulSet. If the csi-provisioner component in your cluster is deployed by using a StatefulSet, you can run the kubectl delete sts csi-provisioner command to delete the current csi-provisioner component. Then, log on to Container Service for Kubernetes (ACK) console and re-install the csi-provisioner component. For more information, see Manage system components.

csi-plugin

Check whether the cluster contains nodes that are in the NotReady state. If NotReady nodes exist, ACK fails to update the DaemonSet that is used to deploy the csi-plugin component.
If you fail to update the csi-plugin component but all plug-ins work as normal, the issue is caused by an update timeout error. If a timeout error occurs when the component center updates the csi-plugin component, the component center automatically rolls back the update. To resolve this issue, Submit a ticket.

Disk troubleshooting

Note

To mount a disk to a node, make sure that the node and disk are created in the same region and zone. If they are created in different regions or zones, you fail to mount the disk to the node.
The types of disks supported by different types of Elastic Compute Service (ECS) instances vary. For more information, see Overview of instance families.

The status of the pod is not Running

Problem:

The status of the PVC is Bound but the status of the pod is not Running.

Cause:

No node is available for scheduling.
An error occurs when the system mounts the disk.
The ECS instance does not support the specified disk type.

Solution:

Schedule the pod to another node. For more information, see Schedule pods to specific nodes.
Run the kubectl describe pods <pod-name> command to view the pod event.
- Troubleshoot the issue based on the event.
  - If an error occurs when the system mounts a disk, refer to FAQ about disk volumes.
  - If an error occurs when the system unmounts a disk, refer to FAQ about disk volumes.
- If no event is displayed, Submit a ticket.
If the ECS instance does not support the specified disk type, select a disk type that is supported by the ECS instance. For more information, see Overview of instance families.
To troubleshoot ECS API issues, refer to ErrorCode.

The status of the PVC is not Bound

Problem:

The status of the PVC is not Bound and the status of the pod is not Running.

Cause:

Static: The selectors of the PVC and persistent volume (PV) fail to meet certain conditions. Therefore, the PV and PVC cannot be associated. For example, the selector configuration of the PVC is different from that of the PV, the selectors use different StorageClass names, or the status of the PV is Release.
Dynamic: The csi-provisioner component fails to create the disk.

Solution:

Static: Check the relevant YAML content. For more information, see Mount a statically provisioned disk volume by using kubectl.
Note If the status of the PV is Release, the PV cannot be reused. You need to create a new PV to use the disk.
Dynamic: Run the kubectl describe pvc <pvc-name> -n <namespace> command to view the PVC event.
- Troubleshoot the issue based on the event.
  - If an error occurs when the system creates a disk, refer to Type Problem Disk creation Why does the system prompt InvalidDataDiskCatagory.NotSupported when I create a dynamically provisioned PV? Why does the system prompt The specified AZone inventory is insufficient when I create a dynamically provisioned PV? Why does the system prompt disk size is not supported when I create a dynamically provisioned PV? Why does the system prompt waiting for first consumer to be created before binding when I create a dynamically provisioned PV? Why does the system prompt no topology key found on CSINode node-XXXX and fail to create a dynamically provisioned PV? Disk mounting Why does the system prompt had volume node affinity conflict when I launch a pod that has a disk mounted? Why does the system prompt can't find disk when I launch a pod that has a disk mounted? Why does the system prompt Previous attach action is still in process when I launch a pod that has a disk mounted? Why does the system prompt InvalidInstanceType.NotSupportDiskCategory when I launch a pod that has a disk mounted? Why does the system prompt diskplugin.csi.alibabacloud.com not found in the list of registered CSI drivers when I launch a pod that has a disk mounted? Why does the system prompt Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition when I start a pod that uses a disk volume? Why does the system prompt validate error Device /dev/nvme1n1 has error format more than one digit locations when I start a pod that uses a disk volume? Why does the system prompt ecs task is conflicted when I launch a pod that has a disk mounted? Why does the system prompt wrong fs type, bad option, bad superblock on /dev/xxxxx missing codepage or helper program, or other error when I launch a pod that has a disk mounted? Disk unmounting Why does the system prompt The specified disk is not a portable disk when I delete a pod that has a disk mounted? What do I do when I failed to delete a pod that has a disk mounted and the kubelet generates pod logs that are not managed by ACK? What do I do when the system failed to recreate a deleted pod and prompts that the mounting fails? Why does the system prompt target is busy when I delete a pod that has a disk mounted? Disk resizing Why does the system generate the Waiting for user to (re-)start a pod to finish file system resize of volume on node PVC event and fail to dynamically expand a disk? Disk usage Why does the system prompt input/output error when an application performs read and write operations on the mount directory of a disk volume?.
  - If an error occurs when the system expands a disk, refer to FAQ about disk volumes.
- If no event is displayed, Submit a ticket.
If an error occurs when you call the ECS API to create a disk, refer to ErrorCode and troubleshoot the issue. If the issue persists, Submit a ticket.

NAS troubleshooting

Note

To mount a NAS file system to a node, make sure that the node and NAS file system are deployed in the same virtual private cloud (VPC). If the node and NAS file system are deployed in different VPCs, use Cloud Enterprise Network (CEN) to connect them.
You can mount a NAS file system to a node that is deployed in a zone different from the NAS file system.
The path to which an Extreme NAS file system or CPFS 2.0 file system is mounted must start with /share.

The status of the pod is not Running

Problem:

The status of the PVC is Bound but the status of the pod is not Running.

Cause:

fsGroups are used when you mount the NAS file system. chmod is slowed down because a large number of files need to be handled.
Port 2049 is blocked in the security group rules.
The NAS file system and node are deployed in different VPCs.

Solution:

Check whether fsGroups are configured. If yes, delete the fsGroups, restart the pod, and try to mount the NAS file system again.
Check whether port 2049 of the node that hosts the pod is blocked. If yes, unblock the port and try again. For more information, see Add a security group rule.
If the NAS file system and node are deployed in different VPCs, use CEN to connect them.
For other causes, run the kubectl describe pods <pod-name> command to view the pod event.
- Troubleshoot the issue based on the event. For more information, see FAQ about NAS volumes.
- If no event is displayed, Submit a ticket.

The status of the PVC is not Bound

Problem:

The status of the PVC is not Bound and the status of the pod is not Running.

Cause:

Static: The selectors of the PVC and PV fail to meet certain conditions. Therefore, the PV and PVC cannot be associated. For example, the selector configuration of the PVC is different from that of the PV, the selectors use different StorageClass names, or the status of the PV is Release.
Dynamic: The csi-provisioner component fails to mount the NAS file system.

Solution:

Static: Check the relevant YAML content. For more information, see Mount a statically provisioned NAS volume by using kubectl.
Note If the status of the PV is Release, the PV cannot be reused. Create a new PV that uses the NAS file system.
Dynamic: Run the kubectl describe pvc <pvc-name> -n <namespace> command to view the PVC event.
- Troubleshoot the issue based on the event. For more information, see FAQ about NAS volumes.
- If no event is displayed, Submit a ticket.

OSS troubleshooting

Note

When you mount an OSS bucket to a node, you need to specify the AccessKey pair in the PV. You can store the AccessKey pair in a Secret.
If the OSS bucket and node are created in different regions, set Bucket URL to the public endpoint of the OSS bucket. If the OSS bucket and node are created in the same region, we recommend that you use the private endpoint of the OSS bucket.

The status of the pod is not Running

Problem:

The status of the PVC is Bound but the status of the pod is not Running.

Cause:

fsGroups are used when you mount the OSS bucket. chmod is slowed down because a large number of files need to be handled.
The OSS bucket and node are created in different regions and the private endpoint of the OSS bucket is used. As a result, the node fails to connect to the bucket endpoint.

Solution:

Check whether fsGroups are configured. If yes, delete the fsGroups, restart the pod, and try to mount the OSS bucket again.
Check whether the OSS bucket and node are created in the same region. If they are created in different regions, check whether the private endpoint of the OSS bucket is used. If yes, change to the public endpoint of the OSS bucket.
For other causes, run the kubectl describe pods <pod-name> command to view the pod event.
- Troubleshoot the issue based on the event. For more information, see FAQ about OSS volumes.
- If no event is displayed, Submit a ticket.

The status of the PVC is not Bound

Problem:

The status of the PVC is not Bound and the status of the pod is not Running.

Static: The selectors of the PVC and PV fail to meet certain conditions. Therefore, the PV and PVC cannot be associated. For example, the selector configuration of the PVC is different from that of the PV, the selectors use different StorageClass names, or the status of the PV is Release.
Dynamic: The csi-provisioner component fails to mount the OSS bucket.

Solution:

Static: Check the relevant YAML content. For more information, see Mount an OSS bucket as a statically provisioned volume by using kubectl.
Note If the status of the PV is Release, the PV cannot be reused. Create a new PV that uses the OSS bucket.
Dynamic: Run the kubectl describe pvc <pvc-name> -n <namespace> command to view the PVC event.
- Troubleshoot the issue based on the event. For more information, see FAQ about OSS volumes.
- If no event is displayed, Submit a ticket.