This topic describes the diagnostic procedure for pods and how to troubleshoot pod errors. This topic also provides answers to some frequently asked questions about pods.
Table of contents
Item | Content |
---|---|
Diagnostic procedure | Diagnostic procedure |
Common troubleshooting methods | |
FAQ and solutions |
Diagnostic procedure
- Check whether a pod runs as normal. For more information, see Check the status of a pod.
- If the pod does not run as normal, you can identify the cause by checking the events, log, and configurations of the pod. For more information, see Common troubleshooting methods. For more information about the abnormal states of pods and how to troubleshoot pod errors, see Abnormal states of pods and troubleshooting.
- If the pod is in the Running state but does not run as normal, see Pods remain in the Running state but do not run as normal.
- If an out of memory (OOM) error occurs in the pod, see Troubleshoot OOM errors in pods.
- If the issue persists, see Submit a ticket.
Abnormal states of pods and troubleshooting
Pod status | Description | Solution |
---|---|---|
Pending | The pod is not scheduled to a node. | Pods remain in the Pending state |
Init:N/M | The pod contains M init containers and N init containers are started. | Pods remain in the Init:N/M state, Init:Error state, or Init:CrashLoopBackOff state |
Init:Error | Init containers fail to start up. | Pods remain in the Init:N/M state, Init:Error state, or Init:CrashLoopBackOff state |
Init:CrashLoopBackOff | Init containers are stuck in a startup loop. | Pods remain in the Init:N/M state, Init:Error state, or Init:CrashLoopBackOff state |
Completed | The pod has completed the startup command. | Pods remain in the Completed state |
CrashLoopBackOff | The pod is stuck in a startup loop. | Pods remain in the CrashLoopBackOff state |
ImagePullBackOff | The pod fails to pull the container image. | Pods remain in the ImagePullBackOff state |
Running |
|
|
Terminating | The pod is being terminated. | Pods remain in the Terminating state |
Evicted | The pod is evicted. | Pods remain in the Evicted state. |
Common troubleshooting methods
Check the status of a pod
Check the details of a pod
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage. Then, click the name of the cluster or click Details in the Actions column of the cluster.
- In the left-side navigation pane of the details page, choose .
- In the upper-left corner of the Pods page, select the namespace to which the pod belongs. In the list of pods, find the pod and click the name of the pod or click View Details in the Actions column to view information about the pod. You can view the name, image, and IP address of the pod and the node that hosts the pod.
Check the configurations of a pod
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
- In the upper-left corner of the Pods page, select the namespace to which the pod belongs. In the list of pods, find the pod and click the name of the pod or click View Details in the Actions column.
- In the upper-right corner of the pod details page, click Edit to view the YAML file and configurations of the pod.
Check the events of a pod
Check the log of a pod
Check the monitoring information about a pod
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the cluster details page, choose .
- On the Prometheus Monitoring page, click the Cluster Overview tab to view the following monitoring information about pods: CPU usage, memory usage, and network I/O.
Log on to a container by using the terminal
Pod diagnostics
Pods remain in the Pending state
Cause
If a pod remains in the Pending state, the pod cannot be scheduled to a specific node. This issue occurs if the pod lacks required resources, does not have sufficient resources, uses the hostPort, or is configured with taints but lacks toleration rules.
Symptom
The pod remains in the Pending state.
Solution
Check the events of the pod and identify the reason why the pod cannot be scheduled to a node based on the events. Possible causes:
- Resource dependency
Some pods cannot be created without specific cluster resources, such as ConfigMaps and persistent volume claims (PVCs). For example, before you specify a PVC for a pod, you must associate the PVC with a persistent volume (PV).
- Insufficient resources
- On the cluster details page, choose pod, CPU, and memory.
Note The CPU and memory usage on a node is low. If a pod is scheduled to the node, the resource usage on the node will exceed the upper limit. In this case, the scheduler does not schedule the pod to the node. This prevents resources on the node from being exhausted during peak hours.
. On the Nodes page, check the usage of the following resources in the cluster:
- If the CPU or memory resources in the cluster are exhausted, you can use the following
methods to resolve the issue:
- Delete the pods that are no longer needed. For more information, see Manage pods.
- Modify the resource configurations for pods based on your business requirements. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.
- Add nodes to the cluster. For more information, see Modify the expected number of nodes in a node pool.
- Upgrade the nodes in the cluster. For more information, see Upgrade worker node configurations.
- On the cluster details page, choose pod, CPU, and memory.
- Use of hostPort
If you configure the hostPort for a pod, the value of
Replicas
that you specify for the Deployment or ReplicationController cannot be greater than the number of nodes in the cluster. This is because each node provides only one host port. If the host port of a node is used by other applications, the pod fails to be scheduled to the node. We recommend that you do not use the hostPort. You can create a Service and use the Service to access the pod. For more information, see Service. - Taints and toleration rules
If the events of the pod contain Taints or Tolerations, the pod fails to be scheduled because of taints. You can delete the taints or configure toleration rules for the pod. For more information, see Manage taints, Create a stateless application by using a Deployment, and Taints and toleration rules.
Pods remain in the Init:N/M state, Init:Error state, or Init:CrashLoopBackOff state
Cause
- If a pod remains in the Init:N/M state, the pod contains M init containers, N init containers are started, and M-N init containers fail to start up.
- If a pod remains in the Init:Error state, the init containers in the pod fail to start up.
- If a pod remains in the Init:CrashLoopBackOff state, the init containers in the pod are stuck in a startup loop.
Symptom
- Pods remain in the Init:N/M state.
- Pods remain in the Init:Error state.
- Pods remain in the Init:CrashLoopBackOff state.
Solution
- View the events of the pod and check whether errors occur in the init containers that fail to start up in the pod. For more information, see Check the events of a pod.
- Check the logs of the init containers that fail to start up in the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.
- Check the configuration of the pod and make sure that the configuration of the init containers that fail to start up is valid. For more information, see Check the configurations of a pod. For more information about init containers, see Debug init containers.
Pods remain in the ImagePullBackOff state
Cause
If a pod remains in the ImagePullBackOff state, the pod is scheduled to a node but the pod fails to pull the container image.
Symptom
Pods remain in the ImagePullBackOff state.
Solution
Check the description of the corresponding pod event and check the name of the container image that fails to be pulled.
Pods remain in the CrashLoopBackOff state
Cause
If a pod remains in the CrashLoopBackOff state, the application in the pod encounters an error.
Symptom
Pods remain in the CrashLoopBackOff state.
Solution
- View the events of the pod and check whether errors occur in the pod. For more information, see Check the events of a pod.
- Check the log of the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.
- Inspect the configurations of the pod and check whether the health check configurations are valid. For more information, see Check the configurations of a pod. For more information about health checks for pods, see Configure liveness, readiness, and startup probes.
Pods remain in the Completed state
Cause
If a pod is in the Completed state, the containers in the pod have completed the startup command and all the processes in the containers have exited.
Symptom
Pods remain in the Completed state.
Solution
- Inspect the configurations of the pod and check the startup command that is executed by the containers in the pod. For more information, see Check the configurations of a pod.
- Check the log of the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.
Pods remain in the Running state but do not run as normal
Cause
The YAML file that is used to deploy the pod contains errors.
Symptom
Pods remain in the Running state but do not run as normal.
Solution
Pods remain in the Terminating state
Cause
If a pod is in the Terminating state, the pod is being terminated.
Symptom
Pods remain in the Terminating state.
Solution
Pods that remain in the Terminating state are deleted after a period of time. If a pod remains in the Terminating state for a long period of time, you can run the following command to forcefully delete the pod:
kubectl delete pod [$Pod] -n [$namespace] --grace-period=0 --force
Pods remain in the Evicted state.
Cause
The kubelet automatically evicts one or more pods from a node to reclaim resources when the usage of certain resources on the node reaches a threshold. These resources include memory, storage, file system index nodes (inodes), and operating system process identifiers (PIDs).
Symptom
Pods remain in the Evicted state.
Solution
- Memory pressure:
- Modify the resource configurations for pods based on your business requirements. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.
- Upgrade the nodes in the cluster. For more information, see Upgrade worker node configurations.
- Disk pressure:
- Periodically clear the pod logs on the node in case the storage space is exhausted by pod logs.
- Expand the storage of the node. For more information, see Resize disks online for Linux and Windows instances.
- PID pressure: Modify the resource settings of the pod based on your business requirements. For more information, see Process ID Limits And Reservations.
Troubleshoot OOM errors in pods
CauseIf the memory usage of a container in the cluster exceeds the specified memory limit, the container may be terminated and trigger an OOM event, which causes the container to exit. For more information about OOM events, see Allocate memory resources to containers and pods.
Symptom- If the terminated process causes the container stuck, the container may restart.
- If an OOM error occurs, log on to the Container Service for Kubernetes (ACK) console and navigate to the pod details page. On the Events tab, you can view the following OOM event: pod was OOM killed. For more information, see Check the events of a pod.
- If you configure alert rules for pod exceptions in the cluster, you can receive alert notifications when an OOM event occurs. For more information about how to configure the alert rules, see Alert management.
- Check the node that hosts the pod in which an OOM error occurs.
- Use commands: Run the following command to query information about the pod:
kubectl get pod [$Pod] -o wide -n [$namespace]
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE pod_name 1/1 Running 0 25h 172.20.6.53 cn-hangzhou.192.168.0.198
- In the ACK console: For more information about how to view node information on the pod details page, see Check the details of a pod.
- Use commands: Run the following command to query information about the pod:
- Log on to the node and check the kernel log in the
/var/log/message
file. Search for theout of memory
keyword in the log file and check the process that is terminated due to an OOM error. If the process causes the container stuck, the container restarts after the process is terminated. - Check the time when the error occurs based on the memory usage graph of the pod. For more information, see Check the monitoring information about a pod.
- Check whether memory leaks occur in the processes of the pod based on the following
monitoring information: the points in time when spikes occur in memory usage, log
data, and process names.
- If the OOM error is caused by memory leaks, we recommend that you troubleshoot the issue based on your business scenario.
- If the processes run as normal, increase the memory limit of the pod. Make sure that the actual memory usage of the pod does not exceed 80% of the memory limit of the pod. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.