All Products
Search
Document Center

Container Service for Kubernetes:Troubleshoot abnormal nodes

Last Updated:Mar 26, 2026

This topic covers the diagnostics procedure for ACK nodes and solutions for common node exceptions.

Node conditions quick reference

The table below maps each node condition to its trigger, typical node state, severity, and the relevant troubleshooting section.

To collect node conditions such as InodesPressure, DockerOffline, RuntimeOffline, and NTPProblem, install node-problem-detector in your cluster and create an Event Center. The Event Center is automatically enabled when you create the cluster. For more information, see Create and use an event center.
Condition Trigger Node state Severity See
MemoryPressure Available memory drops below the memory.available threshold NotReady High — triggers container eviction Insufficient memory resources - MemoryPressure
DiskPressure Available disk drops below the imagefs.available threshold NotReady High — triggers container eviction if nodefs.available is also exceeded Insufficient disk space - DiskPressure
InodesPressure Available inodes drop below the inodesFree threshold NotReady High — triggers container eviction Insufficient inodes - InodesPressure
NodePIDPressure Available PIDs drop below the pid.available threshold NotReady High — triggers container eviction Insufficient PIDs - NodePIDPressure
RuntimeOffline dockerd or containerd exception NotReady Critical — container runtime is unavailable dockerd exceptions - RuntimeOffline or containerd exceptions - RuntimeOffline
NTPProblem NTP/chronyd process exception NotReady Medium — may cause time drift issues NTP exceptions - NTPProblem

Diagnostics procedure

image

Node is NotReady

  1. Check whether any of the following node conditions are TrueNode-pressure EvictionNode-pressure EvictionNode-pressure EvictionNode-pressure Eviction: PIDPressure, DiskPressure, or MemoryPressure. If one is True, go to the corresponding section in this topic.

  2. Check the key components of the node: For component check commands, see Check key components.

  3. Collect and review the node diagnostics log. See Collect diagnostics logs.

  4. Check the node monitoring data (CPU, memory, network). See Check monitoring data. If resource usage is abnormal, see Insufficient CPU resources or Insufficient memory resources - MemoryPressure.

Node is Unknown

  1. Verify that the Elastic Compute Service (ECS) instance hosting the node is in the Running state.

  2. Check the key components of the node (kubelet, dockerd, containerd, NTP). See Check key components and the corresponding exception sections.

  3. Check network connectivity. See Check security groups. If a network exception occurs, see Network exceptions.

  4. Collect and review the node diagnostics log. See Collect diagnostics logs.

  5. Check the node monitoring data (CPU, memory, network). See Check monitoring data. If resource usage is abnormal, see Insufficient CPU resources or Insufficient memory resources - MemoryPressure.

If the issue persists

Run the built-in Exception Diagnosis feature. See Use the node diagnosis feature.

Common troubleshooting methods

Use the node diagnosis feature

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.

  3. Find the target node, and in the Actions column, choose More > Exception Diagnosis.

  4. In the panel that appears, click Create diagnosis, then review the diagnostics results and remediation suggestions.

Check node details

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.

  3. Find the node and click its name, or click Details in the Actions column.

Check node status

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, review the status of each node:

    • Ready: The node is operating normally.

    • NotReady: Click the node name or Details in the Actions column to view node details.

Check node events

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.

  3. Click the node name or Details in the Actions column. Events related to the node appear in the lower section of the node details page.

Collect diagnostics logs

Use either of the following methods:

Check key components

kubelet

Run the following commands on the node where kubelet is running.

Check the status:

systemctl status kubelet

Expected output:

image

Check the logs:

journalctl -u kubelet

Check the configuration:

cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

dockerd

Run the following commands on the node where dockerd is running.

Check the status:

systemctl status docker

Expected output:

Docker

Check the logs:

journalctl -u docker

Check the configuration:

cat /etc/docker/daemon.json

containerd

Run the following commands on the node where containerd is running.

Check the status:

systemctl status containerd

Expected output:

image

Check the logs:

journalctl -u containerd

NTP

Run the following commands on the node where the NTP service is running.

Check the status:

systemctl status chronyd

Expected output:

image

Check the logs:

journalctl -u chronyd

Check monitoring data

CloudMonitor

ACK is integrated with CloudMonitor. Log on to the CloudMonitor console to view monitoring data for the ECS instances in your cluster. See Monitor nodes.

Managed Service for Prometheus

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. Click the cluster name. In the left navigation pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the Node Monitoring tab, then the Nodes tab.

  4. Select a node from the drop-down list to view its CPU, memory, and disk metrics.

Check security groups

See Overview for security group concepts, and Configure security groups for clusters for configuration steps.

Read node conditions from kubectl

Run the following command to view all conditions on a node and compare them against the healthy baseline:

kubectl describe node <node-name>

The Conditions section of a healthy node looks similar to the following:

Conditions:
  Type                 Status  Reason                       Message
  ----                 ------  ------                       -------
  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    KubeletReady                 kubelet is posting ready status

If any condition shows True for a pressure type, or False for Ready, the node has an active problem. Cross-reference the condition name with the node conditions quick reference table to find the relevant section.

kubelet exceptions

Cause: kubelet exceptions are typically caused by issues with the kubelet process itself, the container runtime, or invalid kubelet configurations.

Symptom: The kubelet service status is inactive.

Solution:

  1. Restart kubelet. The restart does not affect running containers.

    systemctl restart kubelet
  2. Check whether kubelet is now healthy:

    systemctl status kubelet
  3. If the status is still abnormal, check the logs:

    • If the logs show an exception, troubleshoot based on the error message.

    • If the configuration is invalid, edit and reload it: ``bash vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf systemctl daemon-reload; systemctl restart kubelet ``

    journalctl -u kubelet

dockerd exceptions - RuntimeOffline

Cause: In most cases, a dockerd exception occurs because the dockerd configuration is invalid, the dockerd process is overloaded, or the node is overloaded.

Symptoms:

  • The dockerd service status is inactive.

  • The dockerd service status is active (running) but dockerd is not functioning correctly — for example, docker ps or docker exec commands fail.

  • The node condition RuntimeOffline is True.

  • If you enabled alerting for cluster nodes, you receive an alert when a dockerd exception occurs. See Alert management to configure alert rules.

Solution:

  1. Restart dockerd:

    systemctl restart docker
  2. Check whether dockerd is now healthy:

    systemctl status docker
  3. If the status is still abnormal, check the logs:

    journalctl -u docker

containerd exceptions - RuntimeOffline

Cause: In most cases, a containerd exception occurs because the containerd configuration is invalid, the containerd process is overloaded, or the node is overloaded.

Symptoms:

  • The containerd service status is inactive.

  • The node condition RuntimeOffline is True.

  • If you enabled alerting for cluster nodes, you receive an alert when a containerd exception occurs. See Alert management to configure alert rules.

Solution:

  1. Restart containerd:

    systemctl restart containerd
  2. Check whether containerd is now healthy:

    systemctl status containerd
  3. If the status is still abnormal, check the logs:

    journalctl -u containerd

NTP exceptions - NTPProblem

Cause: In most cases, an NTP exception occurs because the status of the NTP process is abnormal.

Symptoms:

  • The chronyd service status is inactive.

  • The node condition NTPProblem is True.

  • If you enabled alerting for cluster nodes, you receive an alert when an NTP exception occurs. See Alert management to configure alert rules.

Solution:

  1. Restart chronyd:

    systemctl restart chronyd
  2. Check whether chronyd is now healthy:

    systemctl status chronyd
  3. If the status is still abnormal, check the logs:

    journalctl -u chronyd

PLEG exceptions - PLEG is not healthy

Cause: The Pod Lifecycle Event Generator (PLEG) records events throughout the lifecycle of pods, such as container startups and terminations. The PLEG is not healthy exception typically occurs when the container runtime on the node is abnormal or the node uses an older systemd version.

Symptoms:

  • The node state is NotReady.

  • The kubelet logs contain the following:

    I0729 11:20:59.245243    9575 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.138893648s ago; threshold is 3m0s.
  • If you enabled alerting for cluster nodes, you receive an alert when a PLEG exception occurs. See Alert management to configure alert rules.

Solution:

Try the following steps in order, from least to most disruptive:

  1. Restart the key components in sequence: first dockerd or containerd, then kubelet. Check whether the node recovers.

  2. If the node is still not healthy, restart the entire node instance. See Restart an instance.

    Warning

    Restarting the node also restarts all pods running on it. Proceed with caution.

  3. If the node runs CentOS 7.6, refer to What can I do if the kubelet logs contain the "Reason:KubeletNotReady Message:PLEG is not healthy:" error when CentOS 7.6 is used.

Insufficient node resources for scheduling

Cause: The nodes in the cluster do not have enough resources to schedule new pods.

Symptoms:

Pod scheduling fails with one of the following errors:

  • 0/2 nodes are available: 2 Insufficient cpu

  • 0/2 nodes are available: 2 Insufficient memory

  • 0/2 nodes are available: 2 Insufficient ephemeral-storage

The scheduler marks a resource as insufficient when:

  • CPU: Pod's requested CPU > (Node's allocatable CPU - Node's already allocated CPU)

  • Memory: Pod's requested memory > (Node's allocatable memory - Node's already allocated memory)

  • Ephemeral storage: Pod's requested ephemeral storage > (Node's allocatable ephemeral storage - Node's already allocated ephemeral storage)

Run the following command to view resource allocation on the node:

kubectl describe node <node-name>

Expected output:

Allocatable:
  cpu:                3900m
  ephemeral-storage:  114022843818
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12601Mi
  pods:               60
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                725m (18%)  6600m (169%)
  memory             977Mi (7%)  16640Mi (132%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  • Allocatable: the total amount of resources available for scheduling on the node.

  • Allocated resources: the amount of resources already claimed by scheduled pods.

Solution:

To reduce node load, use one or more of the following approaches:

Insufficient CPU resources

Cause: In most cases, CPU resources on a node become insufficient because containers have consumed an excessive amount of CPU.

Symptoms:

  • The node status becomes abnormal.

  • If you enabled alerting for cluster nodes, you receive an alert when CPU usage reaches or exceeds 85%. See Alert management to configure alert rules.

Solution:

  1. On the Node Monitoring tab of the console, review the CPU usage curve and identify when usage spiked. Check whether any processes on the node are consuming excessive CPU.

  2. Reduce the node load. See Insufficient node resources for scheduling.

  3. If the issue persists, restart the node. See Restart an instance.

    Warning

    Restarting the node also restarts all pods running on it. Proceed with caution.

Insufficient memory resources - MemoryPressure

Cause: In most cases, memory resources on a node become insufficient because containers have consumed an excessive amount of memory.

Symptoms:

  • The node condition MemoryPressure changes to True when available memory drops below the memory.available threshold, triggering container eviction. See Node-pressure Eviction.

  • Evicted containers show the event: The node was low on resource: memory.

  • The node shows the event: attempting to reclaim memory.

  • An out-of-memory (OOM) error may occur. When OOM happens, the node shows the event: System OOM.

  • If you enabled alerting for cluster nodes, you receive an alert when memory usage reaches or exceeds 85%. See Alert management to configure alert rules.

Solution:

  1. On the Node Monitoring tab of the console, review the memory usage curve and identify when usage spiked. Check whether any processes on the node have memory leaks.

  2. Reduce the node load. See Insufficient node resources for scheduling.

  3. If the issue persists, restart the node. See Restart an instance.

    Warning

    Restarting the node also restarts all pods running on it. Proceed with caution.

Insufficient inodes - InodesPressure

Cause: In most cases, inodes on a node become insufficient because containers have consumed an excessive number of inodes.

Symptoms:

  • The node condition InodesPressure changes to True when available inodes drop below the inodesFree threshold, triggering container eviction. See Node-pressure Eviction.

  • Evicted containers show the event: The node was low on resource: inodes.

  • The node shows the event: attempting to reclaim inodes.

  • If you enabled alerting for cluster nodes, you receive an alert when inode usage is insufficient. See Alert management to configure alert rules.

Solution:

  1. On the Node Monitoring tab of the console, review the inode usage curve and identify when usage spiked. Check whether any processes on the node are consuming excessive inodes.

  2. For additional remediation steps, see Resolve the issue of insufficient disk space on a Linux instance.

Insufficient PIDs - NodePIDPressure

Cause: In most cases, process IDs (PIDs) on a node become insufficient because containers have consumed an excessive number of PIDs.

Symptoms:

  • The node condition NodePIDPressure changes to True when available PIDs drop below the pid.available threshold, triggering container eviction. See Node-pressure Eviction.

  • If you enabled alerting for cluster nodes, you receive an alert when PID resources are insufficient. See Alert management to configure alert rules.

Solution:

  1. Query the maximum PID limit and the current highest PID value:

    sysctl kernel.pid_max
    ps -eLf | awk '{print $2}' | sort -rn | head -n 1
  2. Identify the top five processes by PID count:

    ps -elT | awk '{print $4}' | sort | uniq -c | sort -k1 -g | tail -5

    Expected output (column 1: PID count, column 2: process ID):

    73 9743
    75 9316
    76 2812
    77 5726
    93 5691
  3. Use the process IDs to identify the corresponding processes and pods, diagnose the root cause, and optimize the code.

  4. Reduce the node load. See Insufficient node resources for scheduling.

  5. If the issue persists, restart the node. See Restart an instance.

    Warning

    Restarting the node also restarts all pods running on it. Proceed with caution.

Insufficient disk space - DiskPressure

Cause: In most cases, disk space on a node becomes insufficient because containers have consumed an excessive amount of disk space or container images are too large.

Symptoms:

  • The node condition DiskPressure changes to True when available disk space drops below the imagefs.available threshold.

  • If available disk drops below the nodefs.available threshold, all containers on the node are evicted. See Node-pressure Eviction.

  • If disk space remains below the health threshold (default: 80%) after image reclaim, the node shows the event: failed to garbage collect required amount of images.

  • Evicted containers show the event: The node was low on resource: [DiskPressure].

  • The node shows the event: attempting to reclaim ephemeral-storage or attempting to reclaim nodefs.

  • If you enabled alerting for cluster nodes, you receive an alert when disk usage reaches or exceeds 85%. See Alert management to configure alert rules.

Solution:

  1. On the Node Monitoring tab of the console, review the disk usage curve and identify when usage spiked. Check whether any processes on the node are consuming excessive disk space.

  2. Delete files you no longer need. See Resolve the issue of insufficient disk space on a Linux instance.

  3. Set ephemeral-storage limits on pods to cap their disk usage. See Modify the upper and lower limits of CPU and memory resources for a pod.

  4. Use Alibaba Cloud storage services instead of hostPath volumes. See Storage.

  5. Resize the node's disk.

  6. Reduce the node load. See Insufficient node resources for scheduling.

Insufficient IP addresses - InvalidVSwitchId.IpNotEnough

Cause: In most cases, IP addresses become insufficient because containers have consumed too many IP addresses from the vSwitch.

Symptoms:

  • Pods fail to start and remain in the ContainerCreating state. The pod logs contain:

    time="2020-03-17T07:03:40Z" level=warning msg="Assign private ip address failed: Aliyun API Error: RequestId: 2095E971-E473-4BA0-853F-0C41CF52651D Status Code: 403 Code: InvalidVSwitchId.IpNotEnough Message: The specified VSwitch \"vsw-AAA\" has not enough IpAddress., retrying"
  • If you enabled alerting for cluster nodes, you receive an alert when IP addresses are insufficient. See Alert management to configure alert rules.

Solution:

Reduce the number of containers on the node. See Insufficient node resources for scheduling.

For Terway-specific IP address issues, see the following articles:

Network exceptions

Cause: In most cases, a network exception occurs because the node state is abnormal, security group configurations are invalid, or the network is overloaded.

Symptoms:

  • You cannot log on to the node.

  • The node state is Unknown.

  • If you enabled alerting for cluster nodes, you receive an alert when outbound internet bandwidth usage reaches or exceeds 85%. See Alert management to configure alert rules.

Solution:

If you cannot log on to the node:

  1. Check whether the node is in the Running state.

  2. Check the security group configurations. See Check security groups.

If the network is overloaded:

  1. On the Node Monitoring tab of the console, review the network performance curve and look for bandwidth usage spikes.

  2. Apply network policies to throttle pod traffic. See Use network policies in ACK clusters.

Unexpected node restarts

Cause: In most cases, a node unexpectedly restarts because it is overloaded.

Symptoms:

  • The node state is NotReady during the restart.

  • If you enabled alerting for cluster nodes, you receive an alert when the node unexpectedly restarts. See Alert management to configure alert rules.

Solution:

  1. Find out when the node restarted:

    last reboot

    Expected output:

    output

  2. Review the node monitoring data around that time to identify abnormal resource usage. See Check monitoring data.

  3. Review the node kernel logs around that time. See Collect diagnostics logs.

High disk I/O from auditd or "audit: backlog limit exceeded" in system log

Cause: Some existing nodes in the cluster are configured with audit rules of auditd for Docker by default. When containers restart frequently, a large amount of data is written into a container, or kernel bugs occur, the volume of audit log writes can spike — causing high disk I/O on the auditd process and the error audit: backlog limit exceeded in the system log.

This issue affects only nodes running Docker (not containerd).

Symptoms:

  • iotop -o -d 1 shows that DISK WRITE is consistently at 1 MB/s or higher.

  • dmesg -d output contains audit_printk_skb: 100 callbacks suppressed.

  • dmesg -d output contains audit: backlog limit exceeded.

Confirm the cause:

  1. Log on to the node.

  2. Check whether the issue is caused by auditd Docker rules:

    sudo auditctl -l | grep -- ' -k docker'

    If the output includes -w /var/lib/docker -k docker, the issue is caused by the auditd rules.

Solution:

Use one of the following approaches (in order of preference):

Upgrade the cluster

Upgrade the Kubernetes version of the cluster. See Manually upgrade a cluster.

Switch the container runtime to containerd

If upgrading is not possible, migrate node pools from Docker to containerd:

  1. Create new node pools by cloning the existing Docker node pools. Configure the new node pools to use containerd, keeping all other settings identical.

  2. During off-peak hours, drain nodes from the Docker node pools one by one until all application pods are running on the containerd node pools.

Update auditd configurations manually

If neither of the above applies, remove the auditd rules for Docker:

  1. Log on to the node.

  2. Delete the auditd rules:

    sudo test -f /etc/audit/rules.d/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/rules.d/audit.rules
    sudo test -f /etc/audit/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/audit.rules
  3. Apply the updated rules:

    if service auditd status | grep running || systemctl status auditd | grep running; then
        sudo service auditd restart || sudo systemctl restart auditd
        sudo service auditd status || sudo systemctl status auditd
    fi