All Products
Search
Document Center

Container Service for Kubernetes:Pod diagnostics

Last Updated:Mar 25, 2026

Container Intelligence Service diagnoses pods using expert-rule checks and an AI-assisted diagnostics model. When a pod is abnormal, the diagnostics engine collects data from the affected node, identifies anomalies, runs predefined diagnostic checks, and traces the root cause — then surfaces suggested fixes.

Important

When you run pod diagnostics, ACK deploys a data collection program on each node in the cluster. The program collects the system version, workload status, Docker and kubelet status, and key error messages from system logs. It does not collect business data or sensitive information.

How it works

Diagnostic results are produced in four stages:

Node diagnostics
  1. Anomaly identification — Collects node status, pod status, and cluster event streams, then identifies anomalies.

  2. Data collection — Gathers context-specific data based on detected anomalies: node information in Kubernetes, ECS instance information, Docker process status, and kubelet process status.

  3. Diagnostic item check — Checks key metrics against the collected data. Node diagnostics are grouped into categories; each category lists its diagnostic items with descriptions.

  4. Root cause analysis — Analyzes the root cause based on collected data and check results, using both expert mode and AI mode.

Diagnostic results

Each diagnostic run produces two types of output:

  • Root cause analysis results — Detected anomalies, root cause, and suggested fixes.

  • Diagnostic item check results — Pass/fail status for each diagnostic item. These complement root cause analysis by surfacing issues that pattern-matching alone may miss.

The diagnostic items available depend on your cluster configuration. The items shown on the diagnostic page are authoritative.

Supported scenarios

The following table lists the scenarios covered by pod diagnostics and AI-assisted diagnostics.

CategoryScenario
Pod diagnosticsPods are not processed by the scheduler
Pods cannot be scheduled due to scheduling constraint violations
Pods are scheduled but not processed by kubelet
Pods are waiting for volumes to reach the Ready state
Pods are evicted
Pods are evicted due to insufficient disk space
Pods are evicted due to insufficient memory on the node
Pods are evicted due to insufficient disk indexes
Sandboxed containers in pods fail to start
Pods remain in the Terminating state
Out-of-memory (OOM) errors occur on containers in pods
Containers in pods exit unexpectedly
Containers in pods remain in the CrashLoopBackOff state
Containers in pods are not ready
Pods fail to pull container images
Pods time out when pulling container images
AI-assisted diagnosticsPod status is abnormal
OOM errors occur on pods
Containers in pods exit unexpectedly
ConfigMap or Secret configuration is invalid
Pods fail health checks
Persistent volume claim (PVC) configuration is invalid
Errors occur when pulling container images

Diagnostic item categories

Pod diagnostics checks five categories of components:

CategoryWhat it checks
PodPod status, image pulling, and network connectivity
NodeNode status, network status, kernel logs, kernel processes, and service availability
NodeComponentStatus of key node components, including network (CNI) and storage (CSI) components
ClusterComponentAPI server availability, DNS service, and NAT gateway status
ECSControllerManagerECS instance status, network connections, operating system, and disk I/O

Pod

Diagnostic itemWhat it checksSolution
Number of container restartsHow many times containers in a pod have restartedCheck pod status and logs. See Pod troubleshooting.
Container image download failuresWhether other pods on the same node are also failing to pull imagesCheck pod status and logs. See Pod troubleshooting.
Validity of Secrets used to pull imagesWhether the image pull Secrets are validCheck pod status and logs. See Pod troubleshooting.
Connectivity between pods and CoreDNS podsWhether pods can reach CoreDNS podsCheck network connectivity between pods and CoreDNS pods.
Connectivity between pods and CoreDNS ServiceWhether pods can reach the CoreDNS ServiceCheck network connectivity between pods and the CoreDNS Service.
Connectivity between pods and host network DNS serverWhether pods can reach the DNS server in the host networkCheck network connectivity between pods and the host network DNS server.
D state of container processesWhether container processes are stuck in D state (uninterruptible sleep)Processes in D state are typically waiting on disk I/O. Restart the ECS instance. If the issue persists, submit a ticket.
Pod initializationWhether pods have initializedCheck pod status and logs. See Pod troubleshooting.
Pod schedulingWhether pods are scheduledCheck pod status and logs. See Pod troubleshooting.

Node

If an issue persists after applying the solution, collect node logs and submit a ticket.

Diagnostic itemWhat it checksSolution
Connectivity to the Kubernetes API serverWhether the node can connect to the cluster's API serverCheck cluster configurations. See Troubleshoot ACK clusters.
AUFS mount hangsWhether AUFS mount hangs are occurringSubmit a ticket.
BufferIOError errorsWhether BufferIOError errors appear in the node kernelSubmit a ticket.
Cgroup leaksWhether cgroup leaks exist on the nodeCgroup leaks can interrupt monitoring data collection and cause container startup failures. Log on to the node and delete the cgroup directory.
Abnormal chronyd processWhether the chronyd process is running normallyAn abnormal chronyd process affects system clock synchronization. Run systemctl restart chronyd to restart the process.
Image pulling by containerdWhether the containerd runtime can pull imagesCheck node network settings and image configurations.
containerd statusWhether containerd is runningSubmit a ticket.
CoreDNS pod availabilityWhether the node can access CoreDNS pod IP addressesSee What do I do if the DNS query load is not balanced among CoreDNS pods?.
Image statusWhether images are corruptedSubmit a ticket.
overlay2 status of imagesWhether the overlay2 file system in images is corruptedSubmit a ticket.
System timeWhether the system time is correctNo action required.
Docker container startupWhether Docker containers fail to startSubmit a ticket.
Docker image pullingWhether the node can pull Docker imagesCheck node network settings and image configurations.
Docker statusWhether Docker is runningSubmit a ticket.
dockerd startup timeThe startup time of dockerdNo action required.
Docker hang errorsWhether Docker hang errors are occurringRun systemctl restart docker to restart Docker.
ECS instance existenceWhether the ECS instance existsCheck ECS instance status. See FAQ about nodes and node pools.
ECS instance statusThe current status of the ECS instanceCheck ECS instance status. See FAQ about nodes and node pools.
Ext4FsError errorsWhether Ext4FsError errors appear in the node kernelSubmit a ticket.
Read-only node file systemWhether the node file system is in read-only modeA read-only file system typically indicates a disk failure and blocks writes. Run fsck to repair the file system, then restart the node.
Hardware timeWhether hardware time and system time are in syncA difference greater than 2 minutes can cause component errors. Run hwclock --systohc to sync system time to the hardware clock.
DNS resolutionWhether domain names can be resolved on the nodeSee DNS troubleshooting.
Kernel oops errorsWhether kernel oops errors exist in the node kernelSubmit a ticket.
Kernel versionWhether the kernel version is outdatedAn outdated kernel version can cause system failures. Update the node kernel. See FAQ about nodes and node pools.
DNS availabilityWhether the node can access the kube-dns Service cluster IP for DNSCheck CoreDNS pod status and logs. See DNS troubleshooting.
kubelet statusWhether kubelet is runningCheck kubelet logs. See Troubleshoot ACK clusters.
kubelet startup timeThe startup time of kubeletNo action required.
CPU utilizationWhether CPU utilization is excessively highNo action required.
Memory utilizationWhether memory utilization is excessively highNo action required.
Memory fragmentationWhether memory fragments exist on the nodeLog on to the node and run echo 3 > /proc/sys/vm/drop_caches to clear the cache.
Swap memoryWhether swap memory is enabledSwap memory must be disabled. Log on to the node and disable swap.
VirtIO driver loadingWhether VirtIO drivers are loaded on network devicesCheck VirtIO driver errors on the network device.
High CPU utilization (weekly)Whether CPU utilization has been consistently high over the past weekHigh CPU from too many scheduled pods causes resource contention. Set appropriate resource requests and limits to avoid overloading the node.
Private node IP addressWhether the node has a private IP addressIf the private IP is missing, remove the node from the cluster without releasing the ECS instance, then re-add it. See Remove a node and Add existing ECS instances.
High memory utilization (weekly)Whether memory utilization has been consistently high over the past weekHigh memory from too many scheduled pods can cause OOM errors and service interruptions. Set appropriate resource requests and limits.
Node statusWhether the node is in the Ready stateRestart the node. See FAQ about nodes and node pools.
Node schedulabilityWhether the node is unschedulableIf the node is cordoned, check its scheduling configuration. See Node draining and scheduling status.
OOM errorsWhether OOM errors have occurred on the nodeSubmit a ticket.
Container runtime consistencyWhether the node runtime matches the cluster runtimeSee Can I change the container runtime of a cluster from containerd to Docker?.
OS version (known bugs)Whether the OS version has known bugs or stability issuesKnown OS bugs can cause Docker and containerd to malfunction. Update the OS version.
Internet accessWhether the node can reach the internetCheck whether SNAT is enabled for the cluster. See Enable an existing ACK cluster to access the internet.
RCUStallError errorsWhether RCUStallError errors appear in the node kernelSubmit a ticket.
OS versionThe OS version currently used by the nodeNo action required.
Runc process leaksWhether runc process leaks are occurringRunc leaks can cause the node to periodically enter the NotReady state. Identify the leaked runc processes and terminate them manually.
SoftLockupError errorsWhether SoftLockupError errors appear in the node kernelSubmit a ticket.
systemd hangsWhether systemd hangs are occurringRun systemctl daemon-reexec to restart systemd.
systemd version (known bugs)Whether the systemd version has known bugsOutdated systemd versions can cause Docker and containerd to malfunction. Update systemd. See systemd.
Hung processesWhether hung processes exist on the nodeSubmit a ticket.
unregister_netdevice errorsWhether unregister_netdevice errors appear in the node kernelSubmit a ticket.

NodeComponent

Diagnostic itemWhat it checksSolution
CNI component statusWhether the Container Network Interface (CNI) plugin is runningCheck the network component status. See FAQ about network management.
CSI component statusWhether the Container Storage Interface (CSI) plugin is runningCheck the storage component status. See FAQ about CSI.

ClusterComponent

Diagnostic itemWhat it checksSolution
aliyun-acr-credential-helper versionWhether the aliyun-acr-credential-helper version is outdatedUpdate aliyun-acr-credential-helper. See Use aliyun-acr-credential-helper to pull images without a secret.
API Service availabilityWhether the cluster's API Service is availableRun kubectl get apiservice to check availability. If unavailable, run kubectl describe apiservice to view details and identify the cause.
Available pod CIDR blocksWhether the number of available pod CIDR blocks is fewer than five (Flannel only)Each node requires one pod CIDR block. If all blocks are used, new nodes cannot join the cluster. Submit a ticket.
CoreDNS endpointsThe number of active CoreDNS endpointsCheck CoreDNS pod status and logs. See DNS troubleshooting.
CoreDNS cluster IP addressesWhether cluster IP addresses are allocated to CoreDNS podsUnallocated cluster IPs can cause DNS service interruptions. Check CoreDNS pod status and logs. See DNS troubleshooting.
NAT gateway statusThe status of the cluster's NAT gatewayLog on to the NAT Gateway console and check whether the gateway is locked due to overdue payments.
NAT gateway concurrent connection drop rateWhether the rate of concurrent connection drops on the NAT gateway is highUpgrade the NAT gateway. See FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways.

ECSControllerManager

Diagnostic itemWhat it checksSolution
ECS instance component overdue paymentsWhether disk or network bandwidth is unavailable due to overdue paymentsTop up your account.
ECS instance overdue paymentsWhether a pay-as-you-go ECS instance is suspended due to overdue paymentsTop up your account, then restart the instance.
ECS instance NIC statusWhether the NIC is functioningRestart the instance.
ECS instance startup statusWhether the instance can boot normallyIf the instance cannot boot, create a new instance.
ECS instance backend management systemWhether the backend management system is functioningRestart the instance.
ECS instance CPU statusWhether CPU contention or binding failures exist at the underlying layerCPU contention can prevent the instance from getting CPU time. Restart the instance.
Split locks in ECS instance CPUsWhether split locks are occurring in the instance CPUsSee Detecting and handling split locks.
DDoS mitigation statusWhether the instance's public IP address is under DDoS attackPurchase additional anti-DDoS protection. See Comparison of Alibaba Cloud Anti-DDoS solutions.
Cloud disk read/write capabilitiesWhether cloud disk read/write operations are throttledThrottling occurs when disk IOPS reaches its maximum. Check disk metrics. See Block storage performance.
Cloud disk loadingWhether the cloud disk can be attached during instance startupIf the disk fails to attach, stop the instance and start it again.
ECS instance expirationWhether the subscription has expiredRenew the instance. See Renew a subscription instance.
ECS instance OS crashesWhether OS crashes have occurred within the last 48 hoursCheck system logs to identify the cause. See View system logs and screenshots.
ECS instance host statusWhether the physical server hosting the instance has failuresPhysical server failures can degrade instance performance. Restart the instance.
ECS instance image loadingWhether the instance can load its image during initializationIf the image fails to load, restart the instance.
I/O hangs on the system diskWhether I/O hangs are occurring on the instance's system diskCheck disk metrics. See View the monitoring data of a cloud disk. For Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers.
ECS instance bandwidth limitWhether total bandwidth has reached the instance type's maximumUpgrade to an instance type with higher bandwidth. See Overview of instance configuration changes.
ECS instance burst bandwidth limitWhether burst bandwidth has exceeded the instance type's limitUpgrade to an instance type with higher bandwidth. See Overview of instance configuration changes.
ECS instance NIC loadingWhether the NIC can be loadedIf the NIC fails to load, network connectivity is affected. Restart the instance.
NIC session establishmentWhether sessions can be established to the NICIf sessions cannot be established or the session limit is reached, network connectivity or throughput is affected. Restart the instance.
Recent key operationsWhether recent instance operations succeeded (start, stop, resize)If an operation failed, perform it again.
NIC packet lossWhether inbound or outbound packet loss is occurring on the NICRestart the instance.
ECS instance performance degradationWhether instance performance is temporarily degraded due to hardware or software issuesCheck historical events and system logs to identify the cause. See View historical system events.
Compromised ECS instance performanceWhether the instance is running at baseline performance only due to insufficient CPU creditsTop up CPU credits or upgrade to an instance type without credit limits.
ECS instance disk resizingWhether the disk has been resized but the file system has not been updatedIf the file system was not resized after disk expansion, resize the disk again.
ECS instance resource availabilityWhether sufficient physical CPU and memory resources are available for the instanceIf resources are insufficient, the instance cannot start. Wait a few minutes and try again, or create the instance in a different region.
ECS instance OS statusWhether kernel panics, OOM errors, or internal failures exist in the OSThese issues may be caused by misconfiguration or user programs in the OS. Restart the instance.
ECS instance virtualization statusWhether exceptions exist in the underlying virtualization layerVirtualization exceptions can cause the instance to freeze or restart unexpectedly. Restart the instance.