Rapid recovery of data volume mount problems

There are many reasons for problems with Pod attaching and uninstalling data volumes. There are defects in the design of storage volumes, bugs in the implementation of related components, and the possibility of improper use. In the face of complex applications and storage interaction systems, we need to treat the problem of data volumes from two aspects:

Try to avoid problems: reduce the self stability of storage components&the use of specifications.

How to face the problem: First, quickly restore the business, and then analyze the problem.

This article describes a quick service recovery scheme: when the pod fails to restart due to the data volume mount, the problem of node mount will not be solved for the time being, but the pod will be started successfully at other nodes to quickly recover the service, and the problem node will be analyzed after the service is restored.

Update a Pod and it is stuck in ContainerCreating status:

For example, if you mount a NAS data volume in a deployment type application, the Pod will report an error as mount failure when it starts:

Warning FailedMount 18s kubelet, cn-shenzhen.192.168.1.24 Unable to mount volumes for pod "nas-static-796b49b5f8-svbvh_default(2d483078-1400-11ea-a9b7-00163e084110)":
timeout expired waiting for volumes to attach or mount for pod "default"/"nas-static-796b49b5f8-svbvh".
list of unmounted volumes=[pvc-nas]. list of unattached volumes=[pvc-nas default-token-9v9hl]

The data volume is normally used before the update, but the pod cannot be started after the update. The above information shows that the data volume cannot be mounted. One possibility is that the node where the current pod is located has an abnormal status for this pv/pvc. The specific cause of the anomaly will not be investigated.

By scheduling the pod to other nodes to quickly start the pod, refer to the following steps:

1. Determine the node where the pod is located:

According to the above error information, we can get the node: cn shenzhen. 192.168.1.24

It can also be obtained through the following steps:

# podname="nas-static-796b49b5f8-svbvh"
# namespace="default"
# kubectl describe pod $podname -n $namespace | grep Node: | awk '{print $2}'

cn-shenzhen.192.168.1.24/192.168.1.24

2. Set node non schedulable:

You can use the console to configure the node scheduling status. Refer to

You can also use the following command line to mark the node with the current mount problem with a stain label to ensure that the pod will not schedule to this node again:

# kubectl taint nodes cn-shenzhen.192.168.1.24 key=value:NoSchedule

node/cn-shenzhen.192.168.1.24 tainted

3. Restart the problem Pod:

At this time, restart the problem Pod, and the new Pod will not be dispatched to the node with the problem:

Delete question Pod:

# kubectl delete pod nas-static-796b49b5f8-svbvh

pod "nas-static-796b49b5f8-svbvh" deleted

The new pod starts successfully and is dispatched to the new node:

# kubectl get pod

NAME READY STATUS RESTARTS AGE

nas-static-857b99fcc9-vvzkx 1/1 Running 0 14s

# kubectl describe pod nas-static-857b99fcc9-vvzkx | grep Node

Node: cn-shenzhen.192.168.1.25/192.168.1.25

4. Subsequent processing:

The purpose of the above steps is to ensure the rapid recovery of your business, but the problem of the problem node still exists. You can check and analyze it through [Common Storage Problems] ().

If you cannot solve the node problem, you can contact Alibaba Cloud container service technical support. After the node problem is solved, you can configure the problem node to schedulable status through the console or command line;

# kubectl taint nodes cn-shenzhen.192.168.1.24 key:NoSchedule-

node/cn-shenzhen.192.168.1.24 untainted

Update a pod and it is stuck in Terminating status:

For example, you use statefulset to create applications and mount cloud disk data volumes; When updating an application, the pod is always in Terminating status, which causes the new pod to fail to start normally.

# kubectl delete pod web-0

# kubectl get pod

NAME READY STATUS RESTARTS AGE

web-0 0/1 Terminating 0 47m

Go to the node where the pod is located to view the following log files:

# tailf /var/log/alicloud/flexvolume_ disk.log

# tailf /var/log/messages | grep kubelet

If the error is found to be caused by the failure of data volume Umount/Reach, for example:

unmount command failed, status: Failure, reason:

Device is busy
or
Target is busy
or
Orphan Pod
wait

If you are eager to resume business when you have not found a solution to the problem, you can force the deletion of the problem pod first to restore business first.

1. Use the force delete command to end the current pod:

# kubectl delete pod web-0 --force=true --grace-period=0

pod "web-0" force deleted

This command will forcibly delete the pod information in the Etcd database, thus providing the possibility to create a new pod (in the StatefulSet, the new pod will not be rebuilt before the old pod is deleted).


2. If the new pod fails to start, it is stuck in ContainerCreating:

You can refer to the practice of "updating a pod and getting stuck in the ContainerCreating state", configure the node to be non schedulable, and quickly restore the pod.

3. Log in to the problem node and analyze the cause:

Log in to the node where the problem occurs, and perform troubleshooting and analysis through [Frequently Asked Storage Questions] (). If the problem cannot be solved, Alibaba Cloud container service technical support may be contacted.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us