Container Intelligence Service diagnoses pods using expert-rule checks and an AI-assisted diagnostics model. When a pod is abnormal, the diagnostics engine collects data from the affected node, identifies anomalies, runs predefined diagnostic checks, and traces the root cause — then surfaces suggested fixes.
When you run pod diagnostics, ACK deploys a data collection program on each node in the cluster. The program collects the system version, workload status, Docker and kubelet status, and key error messages from system logs. It does not collect business data or sensitive information.
How it works
Diagnostic results are produced in four stages:

Anomaly identification — Collects node status, pod status, and cluster event streams, then identifies anomalies.
Data collection — Gathers context-specific data based on detected anomalies: node information in Kubernetes, ECS instance information, Docker process status, and kubelet process status.
Diagnostic item check — Checks key metrics against the collected data. Node diagnostics are grouped into categories; each category lists its diagnostic items with descriptions.
Root cause analysis — Analyzes the root cause based on collected data and check results, using both expert mode and AI mode.
Diagnostic results
Each diagnostic run produces two types of output:
Root cause analysis results — Detected anomalies, root cause, and suggested fixes.
Diagnostic item check results — Pass/fail status for each diagnostic item. These complement root cause analysis by surfacing issues that pattern-matching alone may miss.
The diagnostic items available depend on your cluster configuration. The items shown on the diagnostic page are authoritative.
Supported scenarios
The following table lists the scenarios covered by pod diagnostics and AI-assisted diagnostics.
| Category | Scenario |
|---|---|
| Pod diagnostics | Pods are not processed by the scheduler |
| Pods cannot be scheduled due to scheduling constraint violations | |
| Pods are scheduled but not processed by kubelet | |
| Pods are waiting for volumes to reach the Ready state | |
| Pods are evicted | |
| Pods are evicted due to insufficient disk space | |
| Pods are evicted due to insufficient memory on the node | |
| Pods are evicted due to insufficient disk indexes | |
| Sandboxed containers in pods fail to start | |
| Pods remain in the Terminating state | |
| Out-of-memory (OOM) errors occur on containers in pods | |
| Containers in pods exit unexpectedly | |
| Containers in pods remain in the CrashLoopBackOff state | |
| Containers in pods are not ready | |
| Pods fail to pull container images | |
| Pods time out when pulling container images | |
| AI-assisted diagnostics | Pod status is abnormal |
| OOM errors occur on pods | |
| Containers in pods exit unexpectedly | |
| ConfigMap or Secret configuration is invalid | |
| Pods fail health checks | |
| Persistent volume claim (PVC) configuration is invalid | |
| Errors occur when pulling container images |
Diagnostic item categories
Pod diagnostics checks five categories of components:
| Category | What it checks |
|---|---|
| Pod | Pod status, image pulling, and network connectivity |
| Node | Node status, network status, kernel logs, kernel processes, and service availability |
| NodeComponent | Status of key node components, including network (CNI) and storage (CSI) components |
| ClusterComponent | API server availability, DNS service, and NAT gateway status |
| ECSControllerManager | ECS instance status, network connections, operating system, and disk I/O |
Pod
| Diagnostic item | What it checks | Solution |
|---|---|---|
| Number of container restarts | How many times containers in a pod have restarted | Check pod status and logs. See Pod troubleshooting. |
| Container image download failures | Whether other pods on the same node are also failing to pull images | Check pod status and logs. See Pod troubleshooting. |
| Validity of Secrets used to pull images | Whether the image pull Secrets are valid | Check pod status and logs. See Pod troubleshooting. |
| Connectivity between pods and CoreDNS pods | Whether pods can reach CoreDNS pods | Check network connectivity between pods and CoreDNS pods. |
| Connectivity between pods and CoreDNS Service | Whether pods can reach the CoreDNS Service | Check network connectivity between pods and the CoreDNS Service. |
| Connectivity between pods and host network DNS server | Whether pods can reach the DNS server in the host network | Check network connectivity between pods and the host network DNS server. |
| D state of container processes | Whether container processes are stuck in D state (uninterruptible sleep) | Processes in D state are typically waiting on disk I/O. Restart the ECS instance. If the issue persists, submit a ticket. |
| Pod initialization | Whether pods have initialized | Check pod status and logs. See Pod troubleshooting. |
| Pod scheduling | Whether pods are scheduled | Check pod status and logs. See Pod troubleshooting. |
Node
If an issue persists after applying the solution, collect node logs and submit a ticket.
| Diagnostic item | What it checks | Solution |
|---|---|---|
| Connectivity to the Kubernetes API server | Whether the node can connect to the cluster's API server | Check cluster configurations. See Troubleshoot ACK clusters. |
| AUFS mount hangs | Whether AUFS mount hangs are occurring | Submit a ticket. |
| BufferIOError errors | Whether BufferIOError errors appear in the node kernel | Submit a ticket. |
| Cgroup leaks | Whether cgroup leaks exist on the node | Cgroup leaks can interrupt monitoring data collection and cause container startup failures. Log on to the node and delete the cgroup directory. |
| Abnormal chronyd process | Whether the chronyd process is running normally | An abnormal chronyd process affects system clock synchronization. Run systemctl restart chronyd to restart the process. |
| Image pulling by containerd | Whether the containerd runtime can pull images | Check node network settings and image configurations. |
| containerd status | Whether containerd is running | Submit a ticket. |
| CoreDNS pod availability | Whether the node can access CoreDNS pod IP addresses | See What do I do if the DNS query load is not balanced among CoreDNS pods?. |
| Image status | Whether images are corrupted | Submit a ticket. |
| overlay2 status of images | Whether the overlay2 file system in images is corrupted | Submit a ticket. |
| System time | Whether the system time is correct | No action required. |
| Docker container startup | Whether Docker containers fail to start | Submit a ticket. |
| Docker image pulling | Whether the node can pull Docker images | Check node network settings and image configurations. |
| Docker status | Whether Docker is running | Submit a ticket. |
| dockerd startup time | The startup time of dockerd | No action required. |
| Docker hang errors | Whether Docker hang errors are occurring | Run systemctl restart docker to restart Docker. |
| ECS instance existence | Whether the ECS instance exists | Check ECS instance status. See FAQ about nodes and node pools. |
| ECS instance status | The current status of the ECS instance | Check ECS instance status. See FAQ about nodes and node pools. |
| Ext4FsError errors | Whether Ext4FsError errors appear in the node kernel | Submit a ticket. |
| Read-only node file system | Whether the node file system is in read-only mode | A read-only file system typically indicates a disk failure and blocks writes. Run fsck to repair the file system, then restart the node. |
| Hardware time | Whether hardware time and system time are in sync | A difference greater than 2 minutes can cause component errors. Run hwclock --systohc to sync system time to the hardware clock. |
| DNS resolution | Whether domain names can be resolved on the node | See DNS troubleshooting. |
| Kernel oops errors | Whether kernel oops errors exist in the node kernel | Submit a ticket. |
| Kernel version | Whether the kernel version is outdated | An outdated kernel version can cause system failures. Update the node kernel. See FAQ about nodes and node pools. |
| DNS availability | Whether the node can access the kube-dns Service cluster IP for DNS | Check CoreDNS pod status and logs. See DNS troubleshooting. |
| kubelet status | Whether kubelet is running | Check kubelet logs. See Troubleshoot ACK clusters. |
| kubelet startup time | The startup time of kubelet | No action required. |
| CPU utilization | Whether CPU utilization is excessively high | No action required. |
| Memory utilization | Whether memory utilization is excessively high | No action required. |
| Memory fragmentation | Whether memory fragments exist on the node | Log on to the node and run echo 3 > /proc/sys/vm/drop_caches to clear the cache. |
| Swap memory | Whether swap memory is enabled | Swap memory must be disabled. Log on to the node and disable swap. |
| VirtIO driver loading | Whether VirtIO drivers are loaded on network devices | Check VirtIO driver errors on the network device. |
| High CPU utilization (weekly) | Whether CPU utilization has been consistently high over the past week | High CPU from too many scheduled pods causes resource contention. Set appropriate resource requests and limits to avoid overloading the node. |
| Private node IP address | Whether the node has a private IP address | If the private IP is missing, remove the node from the cluster without releasing the ECS instance, then re-add it. See Remove a node and Add existing ECS instances. |
| High memory utilization (weekly) | Whether memory utilization has been consistently high over the past week | High memory from too many scheduled pods can cause OOM errors and service interruptions. Set appropriate resource requests and limits. |
| Node status | Whether the node is in the Ready state | Restart the node. See FAQ about nodes and node pools. |
| Node schedulability | Whether the node is unschedulable | If the node is cordoned, check its scheduling configuration. See Node draining and scheduling status. |
| OOM errors | Whether OOM errors have occurred on the node | Submit a ticket. |
| Container runtime consistency | Whether the node runtime matches the cluster runtime | See Can I change the container runtime of a cluster from containerd to Docker?. |
| OS version (known bugs) | Whether the OS version has known bugs or stability issues | Known OS bugs can cause Docker and containerd to malfunction. Update the OS version. |
| Internet access | Whether the node can reach the internet | Check whether SNAT is enabled for the cluster. See Enable an existing ACK cluster to access the internet. |
| RCUStallError errors | Whether RCUStallError errors appear in the node kernel | Submit a ticket. |
| OS version | The OS version currently used by the node | No action required. |
| Runc process leaks | Whether runc process leaks are occurring | Runc leaks can cause the node to periodically enter the NotReady state. Identify the leaked runc processes and terminate them manually. |
| SoftLockupError errors | Whether SoftLockupError errors appear in the node kernel | Submit a ticket. |
| systemd hangs | Whether systemd hangs are occurring | Run systemctl daemon-reexec to restart systemd. |
| systemd version (known bugs) | Whether the systemd version has known bugs | Outdated systemd versions can cause Docker and containerd to malfunction. Update systemd. See systemd. |
| Hung processes | Whether hung processes exist on the node | Submit a ticket. |
| unregister_netdevice errors | Whether unregister_netdevice errors appear in the node kernel | Submit a ticket. |
NodeComponent
| Diagnostic item | What it checks | Solution |
|---|---|---|
| CNI component status | Whether the Container Network Interface (CNI) plugin is running | Check the network component status. See FAQ about network management. |
| CSI component status | Whether the Container Storage Interface (CSI) plugin is running | Check the storage component status. See FAQ about CSI. |
ClusterComponent
| Diagnostic item | What it checks | Solution |
|---|---|---|
| aliyun-acr-credential-helper version | Whether the aliyun-acr-credential-helper version is outdated | Update aliyun-acr-credential-helper. See Use aliyun-acr-credential-helper to pull images without a secret. |
| API Service availability | Whether the cluster's API Service is available | Run kubectl get apiservice to check availability. If unavailable, run kubectl describe apiservice to view details and identify the cause. |
| Available pod CIDR blocks | Whether the number of available pod CIDR blocks is fewer than five (Flannel only) | Each node requires one pod CIDR block. If all blocks are used, new nodes cannot join the cluster. Submit a ticket. |
| CoreDNS endpoints | The number of active CoreDNS endpoints | Check CoreDNS pod status and logs. See DNS troubleshooting. |
| CoreDNS cluster IP addresses | Whether cluster IP addresses are allocated to CoreDNS pods | Unallocated cluster IPs can cause DNS service interruptions. Check CoreDNS pod status and logs. See DNS troubleshooting. |
| NAT gateway status | The status of the cluster's NAT gateway | Log on to the NAT Gateway console and check whether the gateway is locked due to overdue payments. |
| NAT gateway concurrent connection drop rate | Whether the rate of concurrent connection drops on the NAT gateway is high | Upgrade the NAT gateway. See FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways. |
ECSControllerManager
| Diagnostic item | What it checks | Solution |
|---|---|---|
| ECS instance component overdue payments | Whether disk or network bandwidth is unavailable due to overdue payments | Top up your account. |
| ECS instance overdue payments | Whether a pay-as-you-go ECS instance is suspended due to overdue payments | Top up your account, then restart the instance. |
| ECS instance NIC status | Whether the NIC is functioning | Restart the instance. |
| ECS instance startup status | Whether the instance can boot normally | If the instance cannot boot, create a new instance. |
| ECS instance backend management system | Whether the backend management system is functioning | Restart the instance. |
| ECS instance CPU status | Whether CPU contention or binding failures exist at the underlying layer | CPU contention can prevent the instance from getting CPU time. Restart the instance. |
| Split locks in ECS instance CPUs | Whether split locks are occurring in the instance CPUs | See Detecting and handling split locks. |
| DDoS mitigation status | Whether the instance's public IP address is under DDoS attack | Purchase additional anti-DDoS protection. See Comparison of Alibaba Cloud Anti-DDoS solutions. |
| Cloud disk read/write capabilities | Whether cloud disk read/write operations are throttled | Throttling occurs when disk IOPS reaches its maximum. Check disk metrics. See Block storage performance. |
| Cloud disk loading | Whether the cloud disk can be attached during instance startup | If the disk fails to attach, stop the instance and start it again. |
| ECS instance expiration | Whether the subscription has expired | Renew the instance. See Renew a subscription instance. |
| ECS instance OS crashes | Whether OS crashes have occurred within the last 48 hours | Check system logs to identify the cause. See View system logs and screenshots. |
| ECS instance host status | Whether the physical server hosting the instance has failures | Physical server failures can degrade instance performance. Restart the instance. |
| ECS instance image loading | Whether the instance can load its image during initialization | If the image fails to load, restart the instance. |
| I/O hangs on the system disk | Whether I/O hangs are occurring on the instance's system disk | Check disk metrics. See View the monitoring data of a cloud disk. For Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers. |
| ECS instance bandwidth limit | Whether total bandwidth has reached the instance type's maximum | Upgrade to an instance type with higher bandwidth. See Overview of instance configuration changes. |
| ECS instance burst bandwidth limit | Whether burst bandwidth has exceeded the instance type's limit | Upgrade to an instance type with higher bandwidth. See Overview of instance configuration changes. |
| ECS instance NIC loading | Whether the NIC can be loaded | If the NIC fails to load, network connectivity is affected. Restart the instance. |
| NIC session establishment | Whether sessions can be established to the NIC | If sessions cannot be established or the session limit is reached, network connectivity or throughput is affected. Restart the instance. |
| Recent key operations | Whether recent instance operations succeeded (start, stop, resize) | If an operation failed, perform it again. |
| NIC packet loss | Whether inbound or outbound packet loss is occurring on the NIC | Restart the instance. |
| ECS instance performance degradation | Whether instance performance is temporarily degraded due to hardware or software issues | Check historical events and system logs to identify the cause. See View historical system events. |
| Compromised ECS instance performance | Whether the instance is running at baseline performance only due to insufficient CPU credits | Top up CPU credits or upgrade to an instance type without credit limits. |
| ECS instance disk resizing | Whether the disk has been resized but the file system has not been updated | If the file system was not resized after disk expansion, resize the disk again. |
| ECS instance resource availability | Whether sufficient physical CPU and memory resources are available for the instance | If resources are insufficient, the instance cannot start. Wait a few minutes and try again, or create the instance in a different region. |
| ECS instance OS status | Whether kernel panics, OOM errors, or internal failures exist in the OS | These issues may be caused by misconfiguration or user programs in the OS. Restart the instance. |
| ECS instance virtualization status | Whether exceptions exist in the underlying virtualization layer | Virtualization exceptions can cause the instance to freeze or restart unexpectedly. Restart the instance. |