Container Service for Kubernetes (ACK) provides the pod diagnostics feature to help you identify the root cause of pod issues and provides fix suggestions. It uses two diagnostic modes built on different knowledge sources: expert mode, based on curated operational expertise, and AI mode, trained on large amounts of data.
When you run pod diagnostics, ACK executes a data collection program in your cluster. The program collects system version information, workload status, Docker and kubelet status, and key error entries from system logs. It does not collect business data or sensitive information.
How it works
Pod diagnostics runs through four sequential phases to generate results:
Anomaly identification — collects baseline data (pod status, cluster event streams) and flags anomalies.
Data collection — gathers context-specific data based on the identified anomalies.
Diagnostic item check — evaluates key metrics against collected data to detect abnormal conditions.
Root cause analysis — combines collected data with check results to determine the root cause.
Diagnostic results include two parts:
Root cause analysis results: detected anomalies, root cause, and suggestions for fixes.
Diagnostic item check results: per-item check outcomes, which surface issues that root cause analysis may not catch.
Supported scenarios
Pod diagnostics
| Scenario |
|---|
| Pods are not processed by the scheduler |
| Pods cannot be scheduled because they do not meet scheduling constraint requirements |
| Pods are scheduled but not processed by the kubelet |
| Pods are waiting for volumes to reach the Ready state |
| Pods are evicted |
| Sandboxed containers in pods fail to be created |
| Pods remain in the Terminating state |
| Out-of-memory (OOM) errors occur to containers in pods |
| Containers in pods exceptionally exit |
| Containers in pods remain in the CrashLoopBackOff state |
| Containers in pods are not ready |
| Pods fail to pull container images |
| Timeout errors occur when pods pull container images |
AI-assisted diagnostics
Diagnostic items
Pod diagnostics checks two categories of items:
| Category | Coverage |
|---|---|
| Pod | Pod status, image pulling, and network connectivity |
Diagnoses common node issues, including node status, network status, kernel logs, kernel processes, and service availability. | |
Diagnoses the status of key node components, including the network and volume components. | |
| ClusterComponent | API server availability, DNS service, and NAT gateway status |
Diagnoses common ECS instance issues, including the status of ECS instances, network connections, operating system, and disk I/O. |
Pod
| Diagnostic item | What it checks | Solution |
|---|---|---|
| Number of container restarts | How many times containers in a pod have restarted | Check pod status and logs. See Pod troubleshooting. |
| Container image download failures | Whether other pods on the same node are also failing to pull the container image | Check pod status and logs. See Pod troubleshooting. |
| Validity of Secrets used by pods to pull container images | Whether the Secrets used for image pulling are valid | Check pod status and logs. See Pod troubleshooting. |
| Validity of environment variables of GPU-accelerated pods | Whether NVIDIA_VISIBLE_DEVICES is set in pod environment variables, which may conflict with the kubelet | Check pod status and logs. See Pod troubleshooting. |
| Connectivity between pods and CoreDNS pods | Network connectivity from the pod to CoreDNS pods | Check connectivity between the pod and CoreDNS pods. |
| Connectivity between pods and CoreDNS Service | Network connectivity from the pod to the CoreDNS Service | Check connectivity between the pod and the CoreDNS Service. |
| Connectivity between pods and DNS server in the host network | Network connectivity from the pod to the host network DNS server | Check connectivity between the pod and the DNS server in the host network. |
| D status of container processes in pods | Whether any container process is in the D state (uninterruptible sleep, usually waiting on disk I/O) | Restart the ECS instance hosting the pod. If the issue persists, submit a ticket. |
| Pod initialization | Whether the pod has completed initialization | Check pod status and logs. See Pod troubleshooting. |
| GPU resources requested by pods | Whether the pod has requested GPU resources — rules out missing resource requests as the reason a pod cannot use GPUs | If no GPU resources are requested, check the pod configuration. |
| Pod scheduling | Whether the pod has been scheduled to a node | If the pod has not been scheduled, check the pod configuration. |
Node
If the following issues persist after you adopt the solutions described in the following table, collect the node logs and then submit a ticket.
Diagnostic item | Description | Solution |
Connectivity errors to the Kubernetes API server | Checks whether the node can connect to the Kubernetes API server of the cluster. | Check the configurations of the cluster. For more information, see Troubleshoot ACK clusters. |
AUFS mount hangs | Checks whether AUFS mount hangs occur. | |
BufferIOError errors | Checks whether BufferIOError errors occur in the node kernel. | |
Cgroup leaks | Checks whether cgroup leaks occur on the node. | Cgroup leaks may interrupt monitoring data collection and lead to container startup failures. Log on to the node and delete the cgroup directory. |
Abnormal chronyd process status | Checks whether the chronyd process on the node is in an abnormal state. If the chronyd process on the node is in an abnormal state, system clock synchronization may be affected. | If the chronyd process on the node is in an abnormal state, system clock synchronization may be affected. Run the |
Image pulling by containerd | Checks whether the containerd runtime can pull images as expected. | Check the configurations of the node network and images. |
Containerd status | Checks the status of the containerd runtime. | |
CoreDNS pod availability | Checks whether the node can access the IP address of the CoreDNS pod. | Check whether the node can access the IP address of the CoreDNS pod. For more information, see What to do if DNS query load is unbalanced. |
Image status | Checks whether images are damaged. | |
Overlay2 status of images | Checks whether the overlay2 file system in images is damaged. | If the overlay2 file system in images is damaged. |
System time | Checks whether the system time is correct. | None. |
Docker container startup | Checks whether Docker containers fail to be started. | |
Docker image pulling | Checks whether the node can pull Docker images as expected. | Check the configurations of the node network and images. |
Docker status | Checks the status of Docker. | |
Docker startup time | Checks the startup time of Dockerd. | None. |
Docker hang errors | Checks whether Docker hang errors occur on the node. | Run the |
ECS instance existence | Checks whether the ECS instance exists. | Check the status of the ECS instance. For more information, see Node and node pool FAQ. |
ECS instance status | Checks the ECS instance status. | Check the ECS instance status. For more information, see Node and node pool FAQ. |
Ext4FsError errors | Checks whether Ext4FsError errors occur in the node kernel. | |
Read-only node file system | In most cases, the node file system becomes read-only due to disk failures. You cannot write data to a read-only node file system and your business may be affected. | Use the fsck command to repair the node file system and then restart the node. |
Hardware time | Check the consistency between the hardware time and system time. If the difference between the hardware time and system time is longer than 2 minutes, component errors may occur. | Run the |
DNS | Checks whether domain names can be resolved on the node. | Checks whether domain names can be resolved on the node. For more information, see Troubleshoot DNS resolution errors. |
Kernel oops errors | Checks whether oops errors exist in the node kernel. | |
Kernel versions | Checks whether the kernel version is outdated. An outdated kernel version may lead to system failures. | Update the node kernel. For more information, see Node and node pool FAQ. |
DNS availability | Checks whether the node can access the cluster IP address of the kube-dns Service to use the DNS service provided by the cluster. | Check the status and logs of CoreDNS pods. For more information, see Troubleshoot DNS resolution errors. |
Kubelet status | Checks the kubelet status. | Checks the kubelet logs. For more information, see Troubleshoot ACK clusters. |
Kubelet startup time | Checks the startup time of kubelet. | None. |
CPU utilization | Checks whether the CPU utilization of the node is excessively high. | None. |
Memory utilization | Checks whether the memory utilization of the node is excessively high. | None. |
Memory fragmentation | Checks whether memory fragments exist on the node. | If memory fragments exist on the node, log on to the node and run the |
Swap memory | Checks whether swap memory is enabled for the node. | Swap memory cannot be enabled. Log on to the node and disable swap memory. |
Loading of network device drivers | Checks the loading of VirtIO drivers on network devices. | Errors occur during the loading of VirtIO drivers on network devices. |
Excessively high CPU utilization of the node | Checks the CPU utilization of the node within the last week. If the CPU utilization of the node is high and a large number of pods are scheduled to the node, the pods compete for resources. This increases CPU utilization and may result in service interruptions. | To avoid service interruptions, set resource requests and limits to proper values to avoid running an excessively large number of pods on the node. |
Private node IP existence | Checks whether the private node IP address exists. | If the private node IP address does not exist, remove the node and add the node to the cluster again. Do not release the ECS instance when you remove the node. For more information about how to remove a node from a cluster, see Remove a node. For more information about how to add a node to a cluster, see Add existing nodes. |
Excessively high memory utilization of the node | Checks the memory utilization of the node within the last week. If the memory utilization of the node is high and a large number of pods are scheduled to the node, the pods compete for resources. This increases memory utilization, leads to out of memory (OOM) errors, and may result in service interruptions. | To avoid service interruptions, set resource requests and limits to proper values to avoid running an excessively large number of pods on the node. |
Node status | Checks whether the node is in the Ready state. | Restart the node. For more information, see Node and node pool FAQ. |
Node schedulability | Checks whether the node is unschedulable. | If the node is unschedulable, check the scheduling configuration of the node. For more information, see Node draining and scheduling status. |
OOM errors | Checks whether OOM errors occur on the node. | |
Runtime check | Checks whether the runtime of the node is the same as the runtime of the cluster. | For more information, see If I choose the containerd container runtime when creating a cluster, can I change it to Docker later?. |
Outdated OS versions | Checks whether the OS version used by the node has known bugs, and whether the OS version used by the node is outdated and has stability issues. The preceding issues may cause the Docker and containerd components to malfunction. | Update the OS version. |
Internet access | Checks whether the node can access the Internet. | Check whether SNAT is enabled for the cluster. For more information, see Enable Internet access for an existing ACK cluster. |
RCUStallError errors | Checks whether RCUStallError errors occur in the node kernel. | |
OS versions | Checks the OS version used by the node. If an outdated OS version is used by the node, the cluster may not run as normal. | None. |
Runc process leaks | If runc process leaks occur on the node, the node may periodically enter the NotReady state. | If runc process leaks occur, check the leaked runc process and manually terminate the processes. |
SoftLockupError errors | Checks whether SoftLockupError errors occur in the node kernel. | |
Systemd hangs | Checks whether systemd hangs occur on the node. | If systemd hangs occur on the node, log on to the node and run the |
Outdated systemd versions | Checks whether the systemd version used by the node has known bugs. Outdated systemd versions have stability issues that can cause the Docker and containerd components to malfunction. | Update the systemd version. For more information, see systemd. |
Hung processes | Checks whether hung processes exist on the node. | |
unregister_netdevice errors | Checks whether unregister_netdevice errors occur in the node kernel. |
NodeComponent
Diagnostic item | Description | Solution |
CNI component status | Checks whether the Container Network Interface (CNI) plug-in runs as expected. | Check the status of the network component used by the cluster. For more information, see Network management FAQ. |
CSI component status | Checks whether the Container Storage Interface (CNI) plug-in runs as expected. | Check the status of the volume component used by the cluster. For more information, see CSI storage FAQ. |
ECSControllerManager
Diagnostic item | Description | Solution |
Overdue payments related to ECS instance components | Checks whether the disk or network bandwidth of the ECS instance is unavailable due to overdue payments within your account. | If the disk or network bandwidth of the ECS instance is unavailable due to overdue payments within your account, top up your account. |
Overdue payments related to the ECS instance | Checks whether the pay-as-you-go ECS instance is suspended due to overdue payments. | If the pay-as-you-go ECS instance is suspended due to overdue payments, you must first top up your account and then restart the instance. |
ECS instance NIC status | Checks whether the NIC of the ECS instance works as expected. | If the NIC of the ECS instance does not work as expected, restart the instance. |
ECS instance startup status | Checks whether the boot operation can be performed on the instance as normal. | If the boot operation cannot be performed on the instance as normal, you must create another instance. |
Status of ECS instance backend management system | Checks whether the backend management system of the ECS instance works expected. | If the backend management system of the ECS instance does not work as expected, restart the instance. |
Status of ECS instance CPUs | Checks whether CPU contention or CPU binding failures occur at the underlying layer of the ECS instance. | If CPU contention exists, the instance may fail to obtain CPUs or may encounter other issues. Restart the instance. |
Split locks in the CPUs of the ECS instance | Checks whether split locks occur in the CPUs of the ECS instance. | Split locks occur in the CPUs of the ECS instance. For more information, see Detecting and handling split locks. |
Status of DDos mitigation for the ECS instance | Checks whether the public IP address of the instance suffers from DDoS attacks. | If the IP address of the ECS instance suffers from DDoS attacks, purchase other anti-DDoS services. For more information, see Comparison of Alibaba Cloud Anti-DDoS solutions. |
Limited read/write capabilities of the cloud disk | Checks whether the read/write capabilities of the cloud disk are limited. | If the maximum read and write IOPS of the disk has been reached, the read and write operations on the disk are limited. For more information about how to view disk metrics, see Block storage performance. |
Loading of the ECS instance disk | Checks whether the cloud disk can be attached to the ECS instance when the instance is started. | If the instance fails to be started because the cloud disk fails to be attached to the instance. Stop the instance and then start the instance again. |
ECS instance expiration | Check whether the subscription of the instance has expired. | If the ECS instance has expired, renew the instance. For more information, see How to renew a subscription ECS instance. |
ECS instance OS crashes | Checks whether OS crashes occur on the ECS instance. | If OS crashes occur on the ECS instance within the last 48 hours, troubleshoot the system logs to identify the cause. For more information, see View instance system logs and screenshots. |
Status of the ECS instance host | Checks whether failures occur on the physical server on which the ECS instance is deployed. | If failures occur on the physical server on which the ECS instance is deployed, the instance may be in an abnormal state and the instance performance is downgraded. Restart the instance. |
Loading of the ECS instance image | Checks whether ECS instance can load the image when the system initializes the instance. | The ECS instance failed to load the image due to issues related to the system and image. Restart the instance. |
I/O hangs on the ECS instance disk | Checks whether I/O hangs occur on the system disk of the ECS instance. | If I/O hangs occur on the system disk of the ECS instance, check the disk metrics. For more information, see View the monitoring data of a cloud disk. For information about how to troubleshoot I/O hangs on Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers. |
ECS instance bandwidth upper limit | Checks whether the total bandwidth of the ECS instance has reached the maximum bandwidth allowed for the instance type. | If the total bandwidth of the ECS instance has reached the maximum bandwidth allowed for the instance type, upgrade the instance to an instance type that provides higher bandwidth capabilities. For more information, see Configuration change overview. |
Upper limit of the burst bandwidth of the ECS instance | Checks whether the burst bandwidth of the instance exceeds the upper limit of burst bandwidth allowed for the instance type. | If the burst bandwidth of the instance exceeds the upper limit of burst bandwidth allowed for the instance type, upgrade the instance to an instance type that provides higher bandwidth. For more information, see Configuration change overview. |
Loading of the ECS instance NIC. | Checks whether the NIC of the ECS instance can be loaded. | If the NIC cannot be loaded, the network connectivity of the instance is affected. Restart the instance. |
NIC session establishment on the ECS instance | Checks whether sessions can be established to the NIC of the ECS instance. | If sessions cannot be established to the NIC or if the maximum number of sessions supported by the NIC is reached, the network connectivity or throughput of the instance is affected. Restart the instance. |
Key operations on the ECS instance | Checks whether the operations that you recently performed on the instance succeeded. These operations include starting and stopping the instance and upgrading the configurations of the instance. | If the operations that you recently performed on the instance failed, perform the operations again. |
Packet loss on the ECS instance NIC | Checks whether inbound or outbound packet loss occurs on the NIC of the ECS instance. | If inbound or outbound packet loss occurs on the NIC of the ECS instance, restart the instance. |
ECS instance performance degradation | Checks whether the performance of the instance is temporarily degraded due to issues in the software or hardware. | If the performance of the instance is degraded, the time when the performance is degraded is displayed. You can view the historical events or system logs of the instance to identify the cause of the performance degradation. For more information, see View historical system events. |
Compromised ECS instance performance | Checks whether the performance of the ECS instance is compromised. | The ECS instance can provide only the baseline performance due to insufficient available CPU credits. |
ECS instance disk resizing | Checks whether the disk of the ECS instance is resized. | After the disk is resized, the operating system cannot resize the file system. If the disk cannot be used after it is resized, resize the disk again. |
ECS instance resource application | Checks whether the physical resources, including CPU and memory resources, required by the ECS instance are sufficient. | If the physical resources required by the ECS instance are insufficient, the instance cannot be started. Wait a few minutes and start the instance again. You can also create an ECS instance in another region. |
ECS instance OS status | Checks whether kernel panics, OOM errors, or internal failures occur in the OS of the ECS instance. | These faults may be caused by improper configurations of the instance or user programs in the OS of the instance. Restart the instance. |
ECS instance virtualization status | Checks whether exceptions exist in the core services at the underlying virtualization layer of the instance. | If exceptions exist, the instance may not respond or may be unexpectedly suspended. Restart the instance. |
ClusterComponent
| Diagnostic item | What it checks | Solution |
|---|---|---|
| aliyun-acr-credential-helper version | Whether the aliyun-acr-credential-helper version in the cluster is outdated | If outdated, update the component. See Use aliyun-acr-credential-helper to pull images without a password. |
| API Service availability | Whether the cluster's API Service is available | Run kubectl get apiservice to check availability. If unavailable, run kubectl describe apiservice to identify the cause. |
| Insufficient available pod CIDR blocks | Whether the number of available pod CIDR blocks (Flannel clusters) has dropped below five. Each node consumes one pod CIDR block; if all blocks are used, newly added nodes cannot function. | Submit a ticket. |
| CoreDNS endpoints | The number of active CoreDNS endpoints | Check CoreDNS pod status and logs. See DNS troubleshooting. |
| CoreDNS cluster IP addresses | Whether cluster IP addresses are allocated to CoreDNS pods. Missing cluster IPs cause service interruptions. | Check CoreDNS pod status and logs. See DNS troubleshooting. |
| NAT gateway status | Whether the cluster's NAT gateway is active | Log on to the NAT Gateway console and check whether the gateway is locked due to overdue payments.NAT Gateway console |
| Excessively high rate of concurrent connection drops on the NAT gateway | Whether the NAT gateway is dropping concurrent connections at a high rate | If the rate is high, upgrade the NAT gateway. See FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways. |