This topic covers the diagnostics procedure for ACK nodes and solutions for common node exceptions.
Node conditions quick reference
The table below maps each node condition to its trigger, typical node state, severity, and the relevant troubleshooting section.
To collect node conditions such as InodesPressure, DockerOffline, RuntimeOffline, and NTPProblem, install node-problem-detector in your cluster and create an Event Center. The Event Center is automatically enabled when you create the cluster. For more information, see Create and use an event center.
| Condition | Trigger | Node state | Severity | See |
|---|---|---|---|---|
| MemoryPressure | Available memory drops below the memory.available threshold |
NotReady | High — triggers container eviction | Insufficient memory resources - MemoryPressure |
| DiskPressure | Available disk drops below the imagefs.available threshold |
NotReady | High — triggers container eviction if nodefs.available is also exceeded |
Insufficient disk space - DiskPressure |
| InodesPressure | Available inodes drop below the inodesFree threshold |
NotReady | High — triggers container eviction | Insufficient inodes - InodesPressure |
| NodePIDPressure | Available PIDs drop below the pid.available threshold |
NotReady | High — triggers container eviction | Insufficient PIDs - NodePIDPressure |
| RuntimeOffline | dockerd or containerd exception | NotReady | Critical — container runtime is unavailable | dockerd exceptions - RuntimeOffline or containerd exceptions - RuntimeOffline |
| NTPProblem | NTP/chronyd process exception | NotReady | Medium — may cause time drift issues | NTP exceptions - NTPProblem |
Diagnostics procedure
Node is NotReady
-
Check whether any of the following node conditions are TrueNode-pressure EvictionNode-pressure EvictionNode-pressure EvictionNode-pressure Eviction: PIDPressure, DiskPressure, or MemoryPressure. If one is True, go to the corresponding section in this topic.
-
Check the key components of the node: For component check commands, see Check key components.
-
kubelet: Check the status, logs, and configuration. If kubelet is abnormal, see kubelet exceptions.
-
dockerd: Check the status, logs, and configuration. If dockerd is abnormal, see dockerd exceptions - RuntimeOffline.
-
containerd: Check the status, logs, and configuration. If containerd is abnormal, see containerd exceptions - RuntimeOffline.
-
NTP: Check the status, logs, and configuration. If the NTP service is abnormal, see NTP exceptions - NTPProblem.
-
-
Collect and review the node diagnostics log. See Collect diagnostics logs.
-
Check the node monitoring data (CPU, memory, network). See Check monitoring data. If resource usage is abnormal, see Insufficient CPU resources or Insufficient memory resources - MemoryPressure.
Node is Unknown
-
Verify that the Elastic Compute Service (ECS) instance hosting the node is in the Running state.
-
Check the key components of the node (kubelet, dockerd, containerd, NTP). See Check key components and the corresponding exception sections.
-
Check network connectivity. See Check security groups. If a network exception occurs, see Network exceptions.
-
Collect and review the node diagnostics log. See Collect diagnostics logs.
-
Check the node monitoring data (CPU, memory, network). See Check monitoring data. If resource usage is abnormal, see Insufficient CPU resources or Insufficient memory resources - MemoryPressure.
If the issue persists
Run the built-in Exception Diagnosis feature. See Use the node diagnosis feature.
Common troubleshooting methods
Use the node diagnosis feature
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.
-
Find the target node, and in the Actions column, choose More > Exception Diagnosis.
-
In the panel that appears, click Create diagnosis, then review the diagnostics results and remediation suggestions.
Check node details
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.
-
Find the node and click its name, or click Details in the Actions column.
Check node status
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.
-
On the Nodes page, review the status of each node:
-
Ready: The node is operating normally.
-
NotReady: Click the node name or Details in the Actions column to view node details.
-
Check node events
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.
-
Click the node name or Details in the Actions column. Events related to the node appear in the lower section of the node details page.
Collect diagnostics logs
Use either of the following methods:
-
Use the node diagnosis feature in the console. See Node diagnostics.
-
Run a script. See How do I collect the diagnostic data of an ACK cluster?
Check key components
kubelet
Run the following commands on the node where kubelet is running.
Check the status:
systemctl status kubelet
Expected output:
Check the logs:
journalctl -u kubelet
Check the configuration:
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
dockerd
Run the following commands on the node where dockerd is running.
Check the status:
systemctl status docker
Expected output:
Check the logs:
journalctl -u docker
Check the configuration:
cat /etc/docker/daemon.json
containerd
Run the following commands on the node where containerd is running.
Check the status:
systemctl status containerd
Expected output:
Check the logs:
journalctl -u containerd
NTP
Run the following commands on the node where the NTP service is running.
Check the status:
systemctl status chronyd
Expected output:
Check the logs:
journalctl -u chronyd
Check monitoring data
CloudMonitor
ACK is integrated with CloudMonitor. Log on to the CloudMonitor console to view monitoring data for the ECS instances in your cluster. See Monitor nodes.
Managed Service for Prometheus
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
Click the cluster name. In the left navigation pane, choose Operations > Prometheus Monitoring.
-
On the Prometheus Monitoring page, click the Node Monitoring tab, then the Nodes tab.
-
Select a node from the drop-down list to view its CPU, memory, and disk metrics.
Check security groups
See Overview for security group concepts, and Configure security groups for clusters for configuration steps.
Read node conditions from kubectl
Run the following command to view all conditions on a node and compare them against the healthy baseline:
kubectl describe node <node-name>
The Conditions section of a healthy node looks similar to the following:
Conditions:
Type Status Reason Message
---- ------ ------ -------
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready True KubeletReady kubelet is posting ready status
If any condition shows True for a pressure type, or False for Ready, the node has an active problem. Cross-reference the condition name with the node conditions quick reference table to find the relevant section.
kubelet exceptions
Cause: kubelet exceptions are typically caused by issues with the kubelet process itself, the container runtime, or invalid kubelet configurations.
Symptom: The kubelet service status is inactive.
Solution:
-
Restart kubelet. The restart does not affect running containers.
systemctl restart kubelet -
Check whether kubelet is now healthy:
systemctl status kubelet -
If the status is still abnormal, check the logs:
-
If the logs show an exception, troubleshoot based on the error message.
-
If the configuration is invalid, edit and reload it: ``
bash vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf systemctl daemon-reload; systemctl restart kubelet``
journalctl -u kubelet -
dockerd exceptions - RuntimeOffline
Cause: In most cases, a dockerd exception occurs because the dockerd configuration is invalid, the dockerd process is overloaded, or the node is overloaded.
Symptoms:
-
The dockerd service status is inactive.
-
The dockerd service status is active (running) but dockerd is not functioning correctly — for example,
docker psordocker execcommands fail. -
The node condition RuntimeOffline is True.
-
If you enabled alerting for cluster nodes, you receive an alert when a dockerd exception occurs. See Alert management to configure alert rules.
Solution:
-
Restart dockerd:
systemctl restart docker -
Check whether dockerd is now healthy:
systemctl status docker -
If the status is still abnormal, check the logs:
journalctl -u docker
containerd exceptions - RuntimeOffline
Cause: In most cases, a containerd exception occurs because the containerd configuration is invalid, the containerd process is overloaded, or the node is overloaded.
Symptoms:
-
The containerd service status is inactive.
-
The node condition RuntimeOffline is True.
-
If you enabled alerting for cluster nodes, you receive an alert when a containerd exception occurs. See Alert management to configure alert rules.
Solution:
-
Restart containerd:
systemctl restart containerd -
Check whether containerd is now healthy:
systemctl status containerd -
If the status is still abnormal, check the logs:
journalctl -u containerd
NTP exceptions - NTPProblem
Cause: In most cases, an NTP exception occurs because the status of the NTP process is abnormal.
Symptoms:
-
The chronyd service status is inactive.
-
The node condition NTPProblem is True.
-
If you enabled alerting for cluster nodes, you receive an alert when an NTP exception occurs. See Alert management to configure alert rules.
Solution:
-
Restart chronyd:
systemctl restart chronyd -
Check whether chronyd is now healthy:
systemctl status chronyd -
If the status is still abnormal, check the logs:
journalctl -u chronyd
PLEG exceptions - PLEG is not healthy
Cause: The Pod Lifecycle Event Generator (PLEG) records events throughout the lifecycle of pods, such as container startups and terminations. The PLEG is not healthy exception typically occurs when the container runtime on the node is abnormal or the node uses an older systemd version.
Symptoms:
-
The node state is NotReady.
-
The kubelet logs contain the following:
I0729 11:20:59.245243 9575 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.138893648s ago; threshold is 3m0s. -
If you enabled alerting for cluster nodes, you receive an alert when a PLEG exception occurs. See Alert management to configure alert rules.
Solution:
Try the following steps in order, from least to most disruptive:
-
Restart the key components in sequence: first dockerd or containerd, then kubelet. Check whether the node recovers.
-
If the node is still not healthy, restart the entire node instance. See Restart an instance.
WarningRestarting the node also restarts all pods running on it. Proceed with caution.
-
If the node runs CentOS 7.6, refer to What can I do if the kubelet logs contain the "Reason:KubeletNotReady Message:PLEG is not healthy:" error when CentOS 7.6 is used.
Insufficient node resources for scheduling
Cause: The nodes in the cluster do not have enough resources to schedule new pods.
Symptoms:
Pod scheduling fails with one of the following errors:
-
0/2 nodes are available: 2 Insufficient cpu -
0/2 nodes are available: 2 Insufficient memory -
0/2 nodes are available: 2 Insufficient ephemeral-storage
The scheduler marks a resource as insufficient when:
-
CPU: Pod's requested CPU > (Node's allocatable CPU - Node's already allocated CPU)
-
Memory: Pod's requested memory > (Node's allocatable memory - Node's already allocated memory)
-
Ephemeral storage: Pod's requested ephemeral storage > (Node's allocatable ephemeral storage - Node's already allocated ephemeral storage)
Run the following command to view resource allocation on the node:
kubectl describe node <node-name>
Expected output:
Allocatable:
cpu: 3900m
ephemeral-storage: 114022843818
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 12601Mi
pods: 60
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 725m (18%) 6600m (169%)
memory 977Mi (7%) 16640Mi (132%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
-
Allocatable: the total amount of resources available for scheduling on the node.
-
Allocated resources: the amount of resources already claimed by scheduled pods.
Solution:
To reduce node load, use one or more of the following approaches:
-
Delete pods you no longer need. See Manage pods.
-
Adjust resource requests and limits for pods. See Modify the upper and lower limits of CPU and memory resources for a pod. To get right-sizing recommendations based on historical usage, see Resource profiling.
-
Add nodes to the cluster. See Create and manage a node pool.
-
Upgrade node specifications. See Upgrade or downgrade the configurations of a worker node.
Insufficient CPU resources
Cause: In most cases, CPU resources on a node become insufficient because containers have consumed an excessive amount of CPU.
Symptoms:
-
The node status becomes abnormal.
-
If you enabled alerting for cluster nodes, you receive an alert when CPU usage reaches or exceeds 85%. See Alert management to configure alert rules.
Solution:
-
On the Node Monitoring tab of the console, review the CPU usage curve and identify when usage spiked. Check whether any processes on the node are consuming excessive CPU.
-
Reduce the node load. See Insufficient node resources for scheduling.
-
If the issue persists, restart the node. See Restart an instance.
WarningRestarting the node also restarts all pods running on it. Proceed with caution.
Insufficient memory resources - MemoryPressure
Cause: In most cases, memory resources on a node become insufficient because containers have consumed an excessive amount of memory.
Symptoms:
-
The node condition MemoryPressure changes to True when available memory drops below the
memory.availablethreshold, triggering container eviction. See Node-pressure Eviction. -
Evicted containers show the event:
The node was low on resource: memory. -
The node shows the event:
attempting to reclaim memory. -
An out-of-memory (OOM) error may occur. When OOM happens, the node shows the event:
System OOM. -
If you enabled alerting for cluster nodes, you receive an alert when memory usage reaches or exceeds 85%. See Alert management to configure alert rules.
Solution:
-
On the Node Monitoring tab of the console, review the memory usage curve and identify when usage spiked. Check whether any processes on the node have memory leaks.
-
Reduce the node load. See Insufficient node resources for scheduling.
-
If the issue persists, restart the node. See Restart an instance.
WarningRestarting the node also restarts all pods running on it. Proceed with caution.
Insufficient inodes - InodesPressure
Cause: In most cases, inodes on a node become insufficient because containers have consumed an excessive number of inodes.
Symptoms:
-
The node condition InodesPressure changes to True when available inodes drop below the
inodesFreethreshold, triggering container eviction. See Node-pressure Eviction. -
Evicted containers show the event:
The node was low on resource: inodes. -
The node shows the event:
attempting to reclaim inodes. -
If you enabled alerting for cluster nodes, you receive an alert when inode usage is insufficient. See Alert management to configure alert rules.
Solution:
-
On the Node Monitoring tab of the console, review the inode usage curve and identify when usage spiked. Check whether any processes on the node are consuming excessive inodes.
-
For additional remediation steps, see Resolve the issue of insufficient disk space on a Linux instance.
Insufficient PIDs - NodePIDPressure
Cause: In most cases, process IDs (PIDs) on a node become insufficient because containers have consumed an excessive number of PIDs.
Symptoms:
-
The node condition NodePIDPressure changes to True when available PIDs drop below the
pid.availablethreshold, triggering container eviction. See Node-pressure Eviction. -
If you enabled alerting for cluster nodes, you receive an alert when PID resources are insufficient. See Alert management to configure alert rules.
Solution:
-
Query the maximum PID limit and the current highest PID value:
sysctl kernel.pid_max ps -eLf | awk '{print $2}' | sort -rn | head -n 1 -
Identify the top five processes by PID count:
ps -elT | awk '{print $4}' | sort | uniq -c | sort -k1 -g | tail -5Expected output (column 1: PID count, column 2: process ID):
73 9743 75 9316 76 2812 77 5726 93 5691 -
Use the process IDs to identify the corresponding processes and pods, diagnose the root cause, and optimize the code.
-
Reduce the node load. See Insufficient node resources for scheduling.
-
If the issue persists, restart the node. See Restart an instance.
WarningRestarting the node also restarts all pods running on it. Proceed with caution.
Insufficient disk space - DiskPressure
Cause: In most cases, disk space on a node becomes insufficient because containers have consumed an excessive amount of disk space or container images are too large.
Symptoms:
-
The node condition DiskPressure changes to True when available disk space drops below the
imagefs.availablethreshold. -
If available disk drops below the
nodefs.availablethreshold, all containers on the node are evicted. See Node-pressure Eviction. -
If disk space remains below the health threshold (default: 80%) after image reclaim, the node shows the event:
failed to garbage collect required amount of images. -
Evicted containers show the event:
The node was low on resource: [DiskPressure]. -
The node shows the event:
attempting to reclaim ephemeral-storageorattempting to reclaim nodefs. -
If you enabled alerting for cluster nodes, you receive an alert when disk usage reaches or exceeds 85%. See Alert management to configure alert rules.
Solution:
-
On the Node Monitoring tab of the console, review the disk usage curve and identify when usage spiked. Check whether any processes on the node are consuming excessive disk space.
-
Delete files you no longer need. See Resolve the issue of insufficient disk space on a Linux instance.
-
Set
ephemeral-storagelimits on pods to cap their disk usage. See Modify the upper and lower limits of CPU and memory resources for a pod. -
Use Alibaba Cloud storage services instead of hostPath volumes. See Storage.
-
Resize the node's disk.
-
Reduce the node load. See Insufficient node resources for scheduling.
Insufficient IP addresses - InvalidVSwitchId.IpNotEnough
Cause: In most cases, IP addresses become insufficient because containers have consumed too many IP addresses from the vSwitch.
Symptoms:
-
Pods fail to start and remain in the ContainerCreating state. The pod logs contain:
time="2020-03-17T07:03:40Z" level=warning msg="Assign private ip address failed: Aliyun API Error: RequestId: 2095E971-E473-4BA0-853F-0C41CF52651D Status Code: 403 Code: InvalidVSwitchId.IpNotEnough Message: The specified VSwitch \"vsw-AAA\" has not enough IpAddress., retrying" -
If you enabled alerting for cluster nodes, you receive an alert when IP addresses are insufficient. See Alert management to configure alert rules.
Solution:
Reduce the number of containers on the node. See Insufficient node resources for scheduling.
For Terway-specific IP address issues, see the following articles:
Network exceptions
Cause: In most cases, a network exception occurs because the node state is abnormal, security group configurations are invalid, or the network is overloaded.
Symptoms:
-
You cannot log on to the node.
-
The node state is Unknown.
-
If you enabled alerting for cluster nodes, you receive an alert when outbound internet bandwidth usage reaches or exceeds 85%. See Alert management to configure alert rules.
Solution:
If you cannot log on to the node:
-
Check whether the node is in the Running state.
-
Check the security group configurations. See Check security groups.
If the network is overloaded:
-
On the Node Monitoring tab of the console, review the network performance curve and look for bandwidth usage spikes.
-
Apply network policies to throttle pod traffic. See Use network policies in ACK clusters.
Unexpected node restarts
Cause: In most cases, a node unexpectedly restarts because it is overloaded.
Symptoms:
-
The node state is NotReady during the restart.
-
If you enabled alerting for cluster nodes, you receive an alert when the node unexpectedly restarts. See Alert management to configure alert rules.
Solution:
-
Find out when the node restarted:
last rebootExpected output:

-
Review the node monitoring data around that time to identify abnormal resource usage. See Check monitoring data.
-
Review the node kernel logs around that time. See Collect diagnostics logs.
High disk I/O from auditd or "audit: backlog limit exceeded" in system log
Cause: Some existing nodes in the cluster are configured with audit rules of auditd for Docker by default. When containers restart frequently, a large amount of data is written into a container, or kernel bugs occur, the volume of audit log writes can spike — causing high disk I/O on the auditd process and the error audit: backlog limit exceeded in the system log.
This issue affects only nodes running Docker (not containerd).
Symptoms:
-
iotop -o -d 1shows thatDISK WRITEis consistently at 1 MB/s or higher. -
dmesg -doutput containsaudit_printk_skb: 100 callbacks suppressed. -
dmesg -doutput containsaudit: backlog limit exceeded.
Confirm the cause:
-
Log on to the node.
-
Check whether the issue is caused by auditd Docker rules:
sudo auditctl -l | grep -- ' -k docker'If the output includes
-w /var/lib/docker -k docker, the issue is caused by the auditd rules.
Solution:
Use one of the following approaches (in order of preference):
Upgrade the cluster
Upgrade the Kubernetes version of the cluster. See Manually upgrade a cluster.
Switch the container runtime to containerd
If upgrading is not possible, migrate node pools from Docker to containerd:
-
Create new node pools by cloning the existing Docker node pools. Configure the new node pools to use containerd, keeping all other settings identical.
-
During off-peak hours, drain nodes from the Docker node pools one by one until all application pods are running on the containerd node pools.
Update auditd configurations manually
If neither of the above applies, remove the auditd rules for Docker:
-
Log on to the node.
-
Delete the auditd rules:
sudo test -f /etc/audit/rules.d/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/rules.d/audit.rules sudo test -f /etc/audit/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/audit.rules -
Apply the updated rules:
if service auditd status | grep running || systemctl status auditd | grep running; then sudo service auditd restart || sudo systemctl restart auditd sudo service auditd status || sudo systemctl status auditd fi