Diagnosing and resolving Pod issues - Container Service for Kubernetes

The container service platform provides a fault diagnosis system that combines expert knowledge with an AI model trained on large-scale data. The pod diagnosis feature integrates these two approaches to perform in-depth analysis and pinpoint root causes. A pod diagnosis consists of diagnostic check items and diagnostic root causes.

Diagnostic check items: These include the pod check, node check, NodeComponent check, ClusterComponent check, and ECSControllerManager check.
Diagnostic root causes: Identified root causes and recommended solutions. The pod diagnosis feature collects information from the cluster and its nodes, identifies anomalies, and then performs an in-depth diagnosis of them.

Important

The fault diagnosis feature runs a data collection program on your cluster nodes to collect check results. The collected information includes the system version, the running status of components such as Docker and Kubelet, node load, and key error messages from system logs. This program does not collect your business information or sensitive data.

Supported exception scenarios

The following table lists the exception scenarios covered by Pod diagnostics and AI-assisted diagnostics.

Category	Description
Pod diagnostics	A pod is not processed by the scheduler.
	A pod cannot be scheduled because it does not meet scheduling constraints.
	A pod is scheduled but not processed by the kubelet.
	A pod is waiting for a volume to be ready.
	A pod is evicted.
	A pod is evicted due to insufficient disk space on the node.
	A pod is evicted due to insufficient memory on the node.
	A pod is evicted due to insufficient inodes on the node.
	A pod's sandbox container creation fails.
	A pod is stuck in the Terminating state.
	A container in a pod has an OOM error.
	A container in a pod exits unexpectedly.
	A container in a pod is in the CrashLoopBackOff state.
	A container in a pod is not ready.
	A pod fails to pull a container image.
	A pod times out when pulling a container image.
AI-assisted diagnostics	A pod is in an abnormal state.
	A pod has an OOM error.
	A container in a pod exits unexpectedly.
	The ConfigMap or Secret configuration for a pod is invalid.
	A pod fails a health check.
	The persistent volume claim configuration for a pod is invalid.
	A pod fails to pull a container image.

Diagnosis process

cluster diagnosis collects information from the cluster and its nodes to identify anomalies, and then performs an in-depth diagnosis. This diagnosis combines expert experience with AI-powered intelligence to pinpoint the root cause. The diagnosis process has four stages: anomaly detection, data collection, check item assessment, and root cause analysis. The process generates a diagnostic report upon completion.

节点诊断

Anomaly detection: Collects basic data, such as Node status, Pod status, and the cluster Event stream, to quickly identify anomalies.
Data collection: Collects contextual data based on the detected anomalies. For example, a node diagnosis collects information about the node in K8s, details of the corresponding ECS instance, and the status of processes such as Docker and Kubelet.
Check item assessment: Evaluates key metrics from the collected data to determine if they are normal. For example, check items for a node diagnosis include the Docker process status and ECS status. Each diagnosis type has a corresponding set of check items. The results include a list of these items and an explanation for each.
Root cause analysis: Analyzes the collected data and check item results to automatically pinpoint root causes where possible.

Diagnostic results

Results fall into two types:

Root cause analysis results: include detected anomalies, the identified root cause, and suggestions for fixes.
Diagnostic item check results: include the check result of each diagnostic item. These results can surface causes that root cause analysis may miss.

Diagnostic items vary by cluster configuration. The items displayed on the diagnostic page reflect your actual cluster setup.

Pod diagnostic items

Category	Description
Pod	Diagnoses common Pod issues, including Pod status, image pulling, and network connectivity.
Node	Diagnoses common Node issues, including Node status, network status, kernel logs, core processes, and service availability.
NodeComponent	Diagnoses core Node components, including network and volume components.
ClusterComponent	Diagnoses common cluster issues, including API Service availability, DNS availability, and NAT Gateway status.
ECSControllerManager	Diagnoses common ECS instance issues, including ECS instance status, network connectivity, operating system, and disk I/O.

Pod

Parameter	Description	Solution
Pod container restart count	Counts container restarts within the pod.	Check the pod's status and logs. For more information, see Troubleshoot pod issues.
Blocked container image pull	Checks if other pods on the same node are also failing to pull container images.	Check the pod's status and logs. For more information, see Troubleshoot pod issues.
Image pull Secrets validity	Checks if the Secrets used by the pod to pull container images are valid.	Check the pod's status and logs. For more information, see Troubleshoot pod issues.
GPU pod environment variable validity	Checks if the NVIDIA_VISIBLE_DEVICES environment variable is set in the pod, as this can conflict with the kubelet.	Check the pod's status and logs. For more information, see Troubleshoot pod issues.
Connectivity to CoreDNS pods	Verifies if the pod can connect to CoreDNS pods.	Check the network connectivity from the pod to CoreDNS pods.
Connectivity to the CoreDNS Service	Verifies if the pod can connect to the CoreDNS Service.	Check the network connectivity from the pod to the CoreDNS Service.
Connectivity to the host network DNS server	Verifies if the pod can connect to the DNS server on the host network.	Check the network connectivity from the pod to the DNS server on the host network.
Container process D state	Checks if a container process is in the D state (uninterruptible sleep). This state is typically caused by a process waiting for disk I/O.	A process in the D state is typically waiting for disk I/O. Try restarting the host ECS instance.
Pod initialization status	Checks if the pod has initialized successfully.	Check the pod's status and logs. For more information, see Troubleshoot pod issues.
GPU resource requests	Checks if the pod requests GPU resources. This helps determine if a missing resource request is the reason the pod cannot use a GPU.	The pod does not request GPU resources. If GPUs are required, check the pod's configuration.
Pod scheduling status	Checks if the pod has been scheduled to a node.	If the pod has not been scheduled, check its configuration.

Node

If an issue persists after you apply the suggested fix, collect the node logs and submit a ticket.

Diagnostic item	What it detects	Fix
Connectivity errors to the Kubernetes API server	Whether the node can reach the cluster's API server. Loss of connectivity prevents the node from receiving workload assignments.	Check the cluster configuration. For more information, see Troubleshoot ACK clusters.
AUFS mount hangs	Whether AUFS mount hangs are occurring on the node.	Submit a ticket.
BufferIOError errors	Whether BufferIOError errors are present in the node kernel.	Submit a ticket.
Cgroup leaks	Whether cgroup leaks are occurring. When present, cgroup leaks can interrupt monitoring data collection and cause container startup failures.	Log on to the node and delete the affected cgroup directories.
Abnormal chronyd process status	Whether the chronyd process is running normally. An abnormal chronyd process disrupts system clock synchronization, which can affect time-sensitive operations.	Run `systemctl restart chronyd` to restart the process.
Image pulling by containerd	Whether the containerd runtime can pull images as expected.	Check the node network configuration and image settings.
Containerd status	Whether the containerd runtime is running.	Submit a ticket.
CoreDNS pod availability	Whether the node can reach the CoreDNS pod's IP address. Unreachable CoreDNS pods cause DNS resolution failures for workloads on this node.	Check whether the node can access the CoreDNS pod IP address. For more information, see What do I do if the DNS query load is not balanced among CoreDNS pods?.
Image status	Whether images are intact. Damaged images prevent containers from starting.	Submit a ticket.
Overlay2 status of images	Whether the overlay2 file system in images is damaged.	Submit a ticket.
System time	Whether the system clock is accurate.	None.
Docker container startup	Whether Docker containers are failing to start.	Submit a ticket.
Docker image pulling	Whether the node can pull Docker images as expected.	Check the node network configuration and image settings.
Docker status	Whether the Docker runtime is running.	Submit a ticket.
Docker startup time	The startup time of Dockerd.	None.
Docker hang errors	Whether Docker hang errors are occurring on the node. Docker hangs can cause containers to stop responding.	Run `systemctl restart docker` to restart Docker.
ECS instance existence	Whether the underlying ECS instance exists.	Check the ECS instance status. For more information, see FAQ about nodes and node pools.
ECS instance status	Whether the ECS instance is in a healthy state.	Check the ECS instance status. For more information, see FAQ about nodes and node pools.
Ext4FsError errors	Whether Ext4FsError errors are present in the node kernel.	Submit a ticket.
Read-only node file system	Whether the node file system has become read-only. This typically indicates a disk failure. A read-only file system blocks all write operations and affects running workloads.	Run `fsck` to repair the file system, then restart the node.
Hardware time	Whether the hardware clock and system clock are in sync. A difference greater than 2 minutes can cause component errors.	Run `hwclock --systohc` to sync the system time to the hardware clock.
DNS	Whether domain names can be resolved on the node.	For more information, see DNS troubleshooting.
Kernel oops errors	Whether oops errors are present in the node kernel. Kernel oops errors indicate unexpected code paths and can lead to instability.	Submit a ticket.
Kernel versions	Whether the kernel version is outdated. Outdated kernels may have known stability issues.	Update the node kernel. For more information, see FAQ about nodes and node pools.
DNS availability	Whether the node can reach the kube-dns Service cluster IP to use the cluster's DNS service.	Check the status and logs of CoreDNS pods. For more information, see DNS troubleshooting.
Kubelet status	Whether kubelet is running normally. A failed kubelet prevents the node from managing pods.	Check the kubelet logs. For more information, see Troubleshoot ACK clusters.
Kubelet startup time	The startup time of kubelet.	None.
CPU utilization	Whether the node's CPU utilization is excessively high.	None.
Memory utilization	Whether the node's memory utilization is excessively high.	None.
Memory fragmentation	Whether memory fragmentation exists on the node. Fragmentation reduces available contiguous memory and can affect workload performance.	Log on to the node and run `echo 3 \> /proc/sys/vm/drop_caches` to drop the cache.
Swap memory	Whether swap memory is enabled. Kubernetes requires swap to be disabled; enabling it can cause kubelet to behave unexpectedly.	Log on to the node and disable swap memory.
Loading of network device drivers	Whether VirtIO drivers on network devices are loaded correctly.	Submit a ticket.
Excessively high CPU utilization of the node	Whether CPU utilization was high over the past week. If many pods are scheduled to a node with consistently high CPU usage, resource contention can result in service interruptions.	Set resource requests and limits appropriately to avoid overloading the node.
Private node IP existence	Whether the node has a private IP address assigned. Without a private IP, the node cannot communicate within the cluster.	Remove the node from the cluster and add it back. Do not release the ECS instance when removing it. For more information, see Remove a node and Add existing ECS instances.
Excessively high memory utilization of the node	Whether memory utilization was high over the past week. High memory utilization combined with heavy pod scheduling can cause out-of-memory (OOM) errors and service interruptions.	Set resource requests and limits appropriately to avoid overloading the node.
Node status	Whether the node is in the Ready state.	Restart the node. For more information, see FAQ about nodes and node pools.
Node schedulability	Whether the node is marked as unschedulable. An unschedulable node does not receive new pod assignments.	Check the node's scheduling configuration. For more information, see Node draining and scheduling status.
OOM errors	Whether out-of-memory (OOM) errors are occurring on the node. OOM errors can cause pods and system processes to be killed.	Submit a ticket.
Runtime check	Whether the node's container runtime matches the cluster's configured runtime. A mismatch can cause pods to fail to start.	For more information, see Can I change the container runtime of a cluster from containerd to Docker?.
Outdated OS versions	Whether the node's OS version has known bugs or stability issues. Outdated OS versions can cause the Docker and containerd runtimes to malfunction.	Update the OS version.
Internet access	Whether the node can reach the internet.	Check whether SNAT is enabled for the cluster. For more information, see Enable an existing ACK cluster to access the internet.
RCUStallError errors	Whether RCUStallError errors are present in the node kernel. These errors indicate that a CPU core is stuck in a read-copy-update (RCU) critical section, which can cause the node to hang.	Submit a ticket.
OS versions	The OS version used by the node. Outdated OS versions may prevent the cluster from operating normally.	None.
Runc process leaks	Whether runc process leaks are occurring. Runc process leaks can cause the node to periodically enter the NotReady state.	Identify the leaked runc processes and terminate them manually.
SoftLockupError errors	Whether SoftLockupError errors are present in the node kernel. These errors indicate that a CPU core is not responding to interrupts, which can lead to node instability.	Submit a ticket.
Systemd hangs	Whether systemd hangs are occurring. A hung systemd can prevent services from starting or stopping, affecting node stability.	Log on to the node and run `systemctl daemon-reexec` to restart systemd.
Outdated systemd versions	Whether the systemd version has known bugs. Outdated systemd versions have stability issues that can cause Docker and containerd to malfunction.	Update the systemd version. For more information, see systemd.
Hung processes	Whether hung processes exist on the node. Hung processes consume resources without making progress and can degrade node performance.	Submit a ticket.
unregister_netdevice errors	Whether unregister_netdevice errors are present in the node kernel. These errors can cause kernel resource leaks and network instability.	Submit a ticket.

NodeComponent

Diagnostic item	What it detects	Fix
CNI component status	Whether the Container Network Interface (CNI) plug-in is running as expected. A failed CNI plug-in causes pod networking to stop working on the node.	Check the status of the cluster's network component. For more information, see FAQ about network management.
CSI component status	Whether the Container Storage Interface (CSI) plug-in is running as expected. A failed CSI plug-in prevents pods from mounting volumes.	Check the status of the cluster's storage component. For more information, see FAQ about CSI.

ClusterComponent

Diagnostic item	What it detects	Fix
aliyun-acr-credential-helper version	Whether the aliyun-acr-credential-helper component version is outdated.	Update aliyun-acr-credential-helper. For more information, see Use the aliyun-acr-credential-helper component to pull images without using a secret.
API Service availability	Whether the cluster's API Service is available. An unavailable API Service blocks workload management operations.	Run `kubectl get apiservice` to check availability. If unavailable, run `kubectl describe apiservice` to identify the cause.
Insufficient available pod CIDR blocks	Whether the number of available pod CIDR blocks in a Flannel cluster is fewer than five. Each node requires one pod CIDR block; if all blocks are used, new nodes cannot join the cluster.	Submit a ticket.
CoreDNS endpoints	The number of active CoreDNS endpoints. Too few endpoints reduce DNS availability.	Check the status and logs of CoreDNS pods. For more information, see DNS troubleshooting.
CoreDNS cluster IP addresses	Whether cluster IP addresses are assigned to CoreDNS pods. Without a cluster IP, DNS requests cannot reach CoreDNS, causing service-wide DNS failures.	Check the status and logs of CoreDNS pods. For more information, see DNS troubleshooting.
NAT gateway status	Whether the cluster's NAT gateway is functioning normally. A failed NAT gateway blocks outbound internet traffic from nodes without a public IP.	Log on to the NAT Gateway console and check whether the gateway is locked due to overdue payments.
Excessively high rate of concurrent connection drops on the NAT gateway	Whether the NAT gateway is dropping an abnormally high rate of concurrent connections. High drop rates indicate the gateway has reached its connection capacity.	Upgrade the NAT gateway. For more information, see FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways.

ECSControllerManager

Diagnostic item	What it detects	Fix
Overdue payments related to ECS instance components	Whether the ECS instance's disk or network bandwidth has been restricted due to overdue payments. Restricted resources can cause workloads to fail.	Top up your account to restore access.
Overdue payments related to the ECS instance	Whether the pay-as-you-go ECS instance has been suspended due to overdue payments.	Top up your account, then restart the instance.
ECS instance NIC status	Whether the network interface card (NIC) of the ECS instance is functioning normally. An abnormal NIC causes network connectivity loss.	Restart the instance.
ECS instance startup status	Whether the instance can be booted normally.	If the boot fails, create a new instance.
Status of ECS instance backend management system	Whether the ECS instance's backend management system is operating normally.	Restart the instance.
Status of ECS instance CPUs	Whether CPU contention or CPU binding failures exist at the underlying layer of the ECS instance. CPU contention can prevent the instance from acquiring CPU resources and degrade performance.	Restart the instance.
Split locks in the CPUs of the ECS instance	Whether split locks are occurring in the ECS instance's CPUs. Split locks can severely degrade CPU performance.	For more information, see Detecting and handling split locks.
Status of DDoS mitigation for the ECS instance	Whether the instance's public IP address is under a DDoS attack.	Purchase an anti-DDoS service. For more information, see Comparison of Alibaba Cloud Anti-DDoS solutions.
Limited read/write capabilities of the cloud disk	Whether the cloud disk's read/write throughput is being throttled. Throttling occurs when the disk's maximum IOPS is reached, causing I/O operations to slow down or queue.	For more information about monitoring disk metrics, see Block storage performance.
Loading of the ECS instance disk	Whether the cloud disk can be attached when the instance starts.	Stop the instance and start it again.
ECS instance expiration	Whether the subscription-based instance has expired. An expired instance is stopped and its resources become unavailable.	Renew the instance. For more information, see Renew a subscription instance.
ECS instance OS crashes	Whether OS crashes have occurred on the ECS instance within the past 48 hours.	Review the system logs to identify the cause. For more information, see View system logs and screenshots.
Status of the ECS instance host	Whether the physical server hosting the ECS instance has failures. Host failures can put the instance in an abnormal state and degrade its performance.	Restart the instance.
Loading of the ECS instance image	Whether the instance can load its image during initialization.	Restart the instance.
I/O hangs on the ECS instance disk	Whether I/O hangs are occurring on the system disk. Disk I/O hangs can cause the operating system to become unresponsive.	Check the disk metrics. For more information, see View the monitoring data of a cloud disk. For Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers.
ECS instance bandwidth upper limit	Whether the instance's total bandwidth has reached the maximum for its instance type. When the limit is reached, network throughput is capped and packets may be dropped.	Upgrade to an instance type with higher bandwidth. For more information, see Overview of instance configuration changes.
Upper limit of the burst bandwidth of the ECS instance	Whether the instance's burst bandwidth has exceeded the maximum allowed for its instance type.	Upgrade to an instance type with higher bandwidth. For more information, see Overview of instance configuration changes.
Loading of the ECS instance NIC	Whether the NIC can be loaded on the instance. If the NIC fails to load, the instance loses network connectivity.	Restart the instance.
NIC session establishment on the ECS instance	Whether sessions can be established to the NIC. If the NIC cannot establish sessions or has reached its session limit, network connectivity or throughput is affected.	Restart the instance.
Key operations on the ECS instance	Whether recent operations on the instance — such as starting, stopping, or upgrading — completed successfully.	Retry the failed operation.
Packet loss on the ECS instance NIC	Whether inbound or outbound packet loss is occurring on the NIC. Packet loss causes network errors and can disrupt running services.	Restart the instance.
ECS instance performance degradation	Whether the instance's performance has been temporarily degraded due to software or hardware issues.	View the instance's historical events or system logs to identify the cause. For more information, see View historical system events.
Compromised ECS instance performance	Whether the instance's performance is reduced. Insufficient CPU credits cause burstable instances to fall back to baseline performance.	The ECS instance can provide only the baseline performance due to insufficient available CPU credits.
ECS instance disk resizing	Whether the disk has been resized but the OS has not yet expanded the file system. The additional disk space is unavailable until the file system is resized.	After the disk is resized, the operating system cannot resize the file system automatically. If the disk cannot be used after it is resized, resize the disk again.
ECS instance resource application	Whether sufficient physical CPU and memory resources are available for the instance. If resources are insufficient, the instance cannot start.	Wait a few minutes and try starting the instance again. If the issue persists, create an instance in a different region.
ECS instance OS status	Whether kernel panics, OOM errors, or internal failures have occurred in the instance OS. These faults are often caused by misconfigured instance settings or user programs.	Restart the instance.
ECS instance virtualization status	Whether exceptions exist in the core services of the underlying virtualization layer. Virtualization-layer exceptions can cause the instance to stop responding or be unexpectedly suspended.	Restart the instance.