SOP: Troubleshooting node exceptions in an ACK cluster - Container Service for Kubernetes

Use this guide to diagnose node exceptions, apply common troubleshooting approaches, and resolve issues to restore nodes to a healthy state.

Category	Content
Diagnostic procedure	Diagnostic procedure
Common troubleshooting methods	Diagnose a node fault Check node details Check node status Check node events Collect node diagnostic logs Check node key components Check node monitoring Check node security groups
Common issues and solutions	Troubleshoot key component exceptions kubelet exceptions dockerd exceptions - RuntimeOffline containerd exceptions - RuntimeOffline NTP exceptions - NTPProblem PLEG exceptions - PLEG is not healthy Troubleshoot node resource exceptions Insufficient node resources for scheduling Insufficient node CPU Insufficient node memory - MemoryPressure Insufficient node inodes - InodesPressure Insufficient node PIDs - NodePIDPressure Insufficient node disk space - DiskPressure Insufficient node IP addresses - InvalidVSwitchId.IpNotEnough Troubleshoot node network exceptions Node network exceptions Troubleshoot other node exceptions Unexpected node restarts Resolving high disk I/O from the auditd process or "audit: backlog limit exceeded" errors in system logs

Diagnostic procedure

Check if a node is abnormal. For more information, see Check the status of a node.
- If a node is in the Not Ready state, perform the following steps to troubleshoot the issue:
  1. Check the node status information to see if node conditions, such as PIDPressure, DiskPressure, and MemoryPressure, are True. If a node condition is True, troubleshoot the issue based on that condition. For solutions, see dockerd exceptions - RuntimeOffline, Insufficient node memory - MemoryPressure, and Insufficient node inodes - InodesPressure.
  2. Check the key components and logs of the node.
    - Kubelet
      1. Check the kubelet's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If the kubelet is abnormal, see Kubelet exception handling.
    - Dockerd
      1. Check dockerd's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If dockerd is abnormal, see Dockerd exceptions - RuntimeOffline.
    - Containerd
      1. Check containerd's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If containerd is abnormal, see Containerd exceptions - RuntimeOffline.
    - NTP
      1. Check the NTP service's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If the NTP service is abnormal, see NTP exceptions - NTPProblem.
  3. Collect and check the diagnostics logs of the node. For more information, see Collect the diagnostics logs of a node.
  4. Check the node's monitoring data for abnormal resource load (such as CPU, memory, and network). For more information, see Check node monitoring. If the node load is abnormal, see Insufficient node CPU and Insufficient node memory - MemoryPressure for solutions.
- If a node is in the Unknown state, perform the following steps to troubleshoot the issue:
  1. Check that the underlying ECS instance is in the Running state.
  2. Check the key components of the node.
    - Kubelet
      1. Check the kubelet's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If the kubelet is abnormal, see Kubelet exception handling.
    - Dockerd
      1. Check dockerd's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If dockerd is abnormal, see Dockerd exceptions - RuntimeOffline.
    - Containerd
      1. Check containerd's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If containerd is abnormal, see Containerd exceptions - RuntimeOffline.
    - NTP
      1. Check the NTP service's status, logs, and configuration for abnormalities. For more information, see Check the key components of a node.
      2. If the NTP service is abnormal, see NTP exceptions - NTPProblem.
  3. Check the network connectivity of the node. For more information, see Check node security groups. If a network exception occurs on the node, see Node network exceptions for a solution.
  4. Collect and check the diagnostics logs of the node. For more information, see Collect the diagnostics logs of a node.
  5. Check the node's monitoring data for abnormal resource load (such as CPU, memory, and network). For more information, see Check node monitoring. If the node load is abnormal, see Insufficient node CPU and Insufficient node memory - MemoryPressure for solutions.
If this diagnostic procedure does not resolve the issue, use the node diagnosis feature in Container Service for Kubernetes (ACK). For more information, see Node diagnosis.

Common troubleshooting methods

Diagnose node failures

If a node fails, use the ACK Exception Diagnosis feature to diagnose the node.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Nodes.
On the Nodes page, find the target node and choose More > Exception Diagnosis in the Actions column.
In the panel that appears, click Create Diagnosis. Then, view the diagnosis results and remediation suggestions in the console.

Check node details

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Nodes.
On the Nodes page, click the name of the target node or click Details in the Actions column to view its details.

Check node status

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Nodes.
On the Nodes page, view the status of the node.
- If the node status is Ready, the node is running as expected.
- If the node status is not Ready, click the name of the target node or click Details in its Actions column to view its details.
  Note
  To collect information about conditions such as InodesPressure, DockerOffline, and RuntimeOffline, you must install node-problem-detector in your cluster and create a K8s Event Center. The option to create a K8s Event Center is selected by default during cluster creation. For more information, see Create and use K8s Event Center.

Check node events

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Nodes.
On the Nodes page, click the name of the target node or click Details in the Actions column to view its details.
Node events are displayed at the bottom of the details page.

Collect node diagnostic logs

Use the Container Intelligence Service in the console to collect diagnostic logs. For more information, see Node diagnostics.
Use a script to collect diagnostic logs. For more information, see How do I collect the diagnostic information of a Kubernetes cluster?.

Check key node components

kubelet

Check the kubelet status

Log on to the node and run the following command to check the kubelet status:

systemctl status kubelet

Expected output:

[root@iZm5eali3mx0qwqo68eobrZ ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2025-06-12 13:18:45 CST; 28min ago
     Docs: http://kubernetes.io/docs/
 Main PID: 6171 (kubelet)
    Tasks: 12 (limit: 98626)
   Memory: 88.2M
   CGroup: /system.slice/kubelet.service
           └─6171 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manifests --v=3 --authorization
Jun 12 13:47:12                kubelet[6171]: I0612 13:47:12.250117    6171 prober.go:116] "Probe succeeded" probeType="Liveness" pod="kube-system/coredns-5b97654649-jkbtq" podUID="0530c82c-a012-4f
Jun 12 13:47:12                kubelet[6171]: I0612 13:47:12.250144    6171 prober.go:116] "Probe succeeded" probeType="Readiness" pod="kube-system/coredns-5b97654649-jkbtq" podUID="0530c82c-a012-4
Jun 12 13:47:12                kubelet[6171]: I0612 13:47:12.716032    6171 prober.go:116] "Probe succeeded" probeType="Liveness" pod="kube-system/csi-plugin-bqdjr" podUID="33fd9325-c75d-458d-adc3-
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.574947    6171 kubelet_pods.go:1151] "Clean up pod workers for terminated pods"
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.575713    6171 kubelet_pods.go:1201] "Clean up probes for terminated pods"
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.575724    6171 kubelet_pods.go:1205] "Clean up orphaned pod statuses"
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.575731    6171 kubelet_pods.go:1209] "Clean up orphaned pod user namespace allocations"
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.575736    6171 kubelet_pods.go:1221] "Clean up orphaned pod directories"
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.575775    6171 kubelet_pods.go:1232] "Clean up mirror pods"
Jun 12 13:47:13                kubelet[6171]: I0612 13:47:13.575807    6171 kubelet_pods.go:1374] "Clean up orphaned pod cgroups"
lines 1-22/22 (END)

Check the kubelet logs
Log on to the node and run the following command to view the kubelet logs. For more information about viewing kubelet logs, see Collect node diagnostic logs.
```
journalctl -u kubelet
```
Check the kubelet configuration
Log on to the node and run the following command to view the kubelet configuration:
```
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
```

Runtime

Check dockerd

Check the dockerd status

Log on to the node and run the following command to check the dockerd status:

systemctl status docker

Expected output:

#systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-03-02 15:11:46 CST; 5 days ago
     Docs: https://docs.docker.com
 Main PID: 4330 (dockerd)
    Tasks: 75
   Memory: 8.3G
   CGroup: /system.slice/docker.service
           └─4330 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Mar 02 15:19:29 xxx                    dockerd[4330]: time="2022-03-02T15:19:29.418689546+08:00" level=info msg="Attempting next endpoint for pull a...eaders
Mar 02 15:22:20 xxx                    dockerd[4330]: time="2022-03-02T15:22:20.888769038+08:00" level=warning msg="Error getting v2 registry: Get h...eaders
Mar 02 15:22:20 xxx                    dockerd[4330]: time="2022-03-02T15:22:20.888912991+08:00" level=info msg="Attempting next endpoint for pull a...eaders
Mar 02 15:23:25 xxx                    dockerd[4330]: time="2022-03-02T15:23:25.487947465+08:00" level=warning msg="Seccomp is not enabled in your k...rofile

Check the dockerd logs
Log on to the node and run the following command to view the dockerd logs. For more information about viewing dockerd logs, see Collect node diagnostic logs.
```
journalctl -u docker
```
Check the dockerd configuration
Log on to the node and run the following command to view the dockerd configuration:
```
cat /etc/docker/daemon.json
```

Check containerd

Check the containerd status

Log on to the node and run the following command to check the containerd status:

systemctl status containerd

Expected output:

[root@iZm5eali3mx0qwqo68eobrZ ~]# systemctl status containerd
● containerd.service - containerd container runtime
   Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2025-06-12 13:18:44 CST; 29min ago
     Docs: https://containerd.io
 Main PID: 5994 (containerd)
    Tasks: 153
   Memory: 3.5G
   CGroup: /system.slice/containerd.service
           ├─ 5994 /usr/bin/containerd
           ├─ 6589 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id b4ad8b28c61fa573e04bc641901bd64b15adb1d4a... xxx -address /run/containerd/containerd.sock
           ├─ 6592 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id eb828cfb114bd404d7781f72b55ed4ee1e9743b7d... xxx -address /run/containerd/containerd.sock
           ├─ 6666 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 92bf24db851fcaf65ed9c3ddc9210b61169623158... xxx -address /run/containerd/containerd.sock
           ├─ 6726 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 96d44ff7503b876227e7d287a730cac5b50b5183f1... xxx -address /run/containerd/containerd.sock
           ├─ 6727 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3707c088fec2d14f9ec2c78beccf5f3e1f0a18a6b... xxx -address /run/containerd/containerd.sock
           ├─ 6766 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id b390b71290b772958626a0cc943fb678c7fec4ef5... xxx -address /run/containerd/containerd.sock
           ├─ 9501 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 40c0f850c4c003144bbc5a17bcb5b7f15ad121862... xxx -address /run/containerd/containerd.sock
           ├─ 9567 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3b07c32bf4fe84fde20177efdcffabd2cd93ac02a... xxx -address /run/containerd/containerd.sock
           ├─ 9633 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id d181fcabeebca990b1298072f91da4fbb251b666c... xxx -address /run/containerd/containerd.sock
           └─11643 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id f3744bcb95b522bc64a5a88dd4c49a231c082f5d79... xxx -address /run/containerd/containerd.sock
Jun 12 13:19:53  containerd[5994]: time="2025-06-12T13:19:53.013508035+08:00" level=error msg="ContainerStatus for \"f3744bcb95b522bc64a5a88dd4c49a231c082f5d790e64e6f835760b83e7b280\""
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.708151565+08:00" level=info msg="ImageCreate event &ImageCreate{Name:registry-cn-qingdao-vpc.ack.aliyuncs.com/acs/loongcol..."
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.710172874+08:00" level=info msg="ImageCreate event &ImageCreate{Name:sha256:e705a85d376f9fdecd0f1b2ebce63d85f8392de70a9636..."
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.713591809+08:00" level=info msg="PullImage \"registry-cn-qingdao-vpc.ack.aliyuncs.com/acs/loongcollector:v3.0.11.0-915c655..."
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.713969883+08:00" level=info msg="ImageCreate event &ImageCreate{Name:registry-cn-qingdao-vpc.ack.aliyuncs.com/acs/loongcol..."
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.717628440+08:00" level=info msg="CreateContainer within sandbox \"f3744bcb95b522bc64a5a88dd4c49a231c082f5d790e64e6f835760b..."
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.732237135+08:00" level=info msg="CreateContainer within sandbox \"f3744bcb95b522bc64a5a88dd4c49a231c082f5d790e64e6f835760b..."
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.735512597+08:00" level=info msg="StartContainer for \"e7c6a20368f9c20900848fbcec81d5145d73caa8c4decaae660e0ea89dd64c28\""
Jun 12 13:19:54  containerd[5994]: time="2025-06-12T13:19:54.771464238+08:00" level=info msg="StartContainer for \"e7c6a20368f9c20900848fbcec81d5145d73caa8c4decaae660e0ea89dd64c28\"..."
Jun 12 13:20:47  containerd[5994]: time="2025-06-12T13:20:47.246563467+08:00" level=error msg="ContainerStatus for \"f3744bcb95b522bc64a5a88dd4c49a231c082f5d790e64e6f835760b83e7b280\""
lines 1-30/30 (END)

Check the containerd logs
Log on to the node and run the following command to view the containerd logs. For more information about viewing containerd logs, see Collect node diagnostic logs.
```
journalctl -u containerd
```

Check NTP

Check the NTP service status

Log on to the node and run the following command to check the status of the chronyd process:

systemctl status chronyd

Expected output:

[root@iZm5eali3mx0qwqo68eobrZ ~]# systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2025-06-12 13:18:22 CST; 20min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
 Main PID: 1603 (chronyd)
    Tasks: 1 (limit: 98626)
   Memory: 732.0K
   CGroup: /system.slice/chronyd.service
           └─1603 /usr/sbin/chronyd
Jun 12 13:18:22 iZm5eali3mx0qwqo68eobrZ systemd[1]: chronyd.service: Succeeded.
Jun 12 13:18:22 iZm5eali3mx0qwqo68eobrZ systemd[1]: Stopped NTP client/server.
Jun 12 13:18:22 iZm5eali3mx0qwqo68eobrZ systemd[1]: Starting NTP client/server...
Jun 12 13:18:22 iZm5eali3mx0qwqo68eobrZ chronyd[1603]: chronyd version 4.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 +DEBUG)
Jun 12 13:18:22 iZm5eali3mx0qwqo68eobrZ chronyd[1603]: Frequency -35.639 +/- 2.058 ppm read from /var/lib/chrony/drift
Jun 12 13:18:22 iZm5eali3mx0qwqo68eobrZ systemd[1]: Started NTP client/server.
Jun 12 13:18:23 iZm5eali3mx0qwqo68eobrZ chronyd[1603]: System clock was stepped by 0.000000 seconds
Jun 12 13:18:27 iZm5eali3mx0qwqo68eobrZ chronyd[1603]: Selected source 100.100.61.88 (ntp.cloud.aliyuncs.com)
Jun 12 13:18:27 iZm5eali3mx0qwqo68eobrZ chronyd[1603]: System clock wrong by 1.476188 seconds
xxx

Check the NTP service logs
Log on to the node and run the following command to view NTP logs:
```
journalctl -u chronyd
```

Check node monitoring data

Cloud Monitor
ACK integrates with CloudMonitor. View basic monitoring data for the corresponding ECS instances in the CloudMonitor console. For more information about monitoring nodes with CloudMonitor, see Monitor nodes.
Managed Service for Prometheus
1. Log on to the ACK console. In the left navigation pane, click Clusters.
2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Prometheus Monitoring.
3. On the Prometheus Monitoring page, click the Node Monitoring tab, and then click the Nodes tab.
4. On the Nodes page, select a node to view its monitoring metrics, such as CPU, memory, and disk usage.

Check node security groups

For more information about checking node security groups, see Security group overview and Configure cluster security groups.

Troubleshooting kubelet exceptions

Cause

Typical causes include an abnormal Kubelet process, a container runtime exception, or an incorrect Kubelet configuration.

Symptoms

The Kubelet status is inactive.

Solution

Run the following command to restart Kubelet. This action does not affect running containers.
```
systemctl restart kubelet
```
After Kubelet restarts, log on to the node and run the following command to check its status.
```
systemctl status kubelet
```
If the Kubelet status remains abnormal, log on to the node and run the following command to view its logs.
```
journalctl -u kubelet
```
- If the logs contain specific error messages, use keywords from these messages to troubleshoot the issue.
- If the Kubelet configuration is incorrect, run the following commands to correct it.
```
vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf   # Modify the Kubelet configuration.
systemctl daemon-reload;systemctl restart kubelet         # Reload the configuration and restart Kubelet.
```

Dockerd exception: RuntimeOffline

Cause

Common causes include an invalid dockerd configuration, high process load, or high node load.

Symptoms

The dockerd status is inactive.
The dockerd status is active (running), but the daemon malfunctions, causing a node exception. Commands such as docker ps and docker exec may fail.
The node's RuntimeOffline condition is True.
A dockerd exception triggers an alert if you have configured alerting for cluster node exceptions. For more information about how to configure alert rules, see ACK alert management.

Solution

Run the following command to restart dockerd:
```
systemctl restart docker
```
Log on to the node and run the following command to check the dockerd status:
```
systemctl status docker
```
If the status remains abnormal, check the dockerd logs:
```
journalctl -u docker
```

Containerd exception: RuntimeOffline

Cause

This issue is often caused by an invalid containerd configuration, high process load, or high node load.

The containerd status is inactive.
The node's RuntimeOffline condition is True.
If alerting is configured for cluster node exceptions, you will be notified when a containerd exception occurs. For more information, see ACK alert management.

Solution

Run the following command to restart containerd:
```
systemctl restart containerd
```
After containerd restarts, log on to the node and run the following command to check its status:
```
systemctl status containerd
```
If the status remains abnormal after the restart, log on to the node and run the following command to view the containerd logs:
```
journalctl -u containerd
```

NTP exception: NTPProblem

Cause

An abnormal NTP process typically causes this issue.

Symptoms

The chronyd status is inactive.
The NTPProblem node condition is True.
If you have configured alerts for cluster node exceptions, you will receive an alert when a node's time service is abnormal. For more information on configuring alerts, see ACK alert management.

Solution

Run the following command to restart chronyd:
```
systemctl restart chronyd
```
After chronyd restarts, log in to the node and run the following command to verify that its status is normal:
```
systemctl status chronyd
```
If the status is still abnormal after the restart, log in to the node and run the following command to view the chronyd logs:
```
journalctl -u chronyd
```

PLEG is not healthy

Cause

The Pod Lifecycle Event Generator (PLEG) records events throughout the pod lifecycle, such as container startups and terminations. The PLEG is not healthy exception is typically caused by issues with the container runtime on the node or a defect in the node's systemd version.

Symptoms

The node status is NotReady.

The kubelet logs contain the following entry.

I0729 11:20:59.245243    9575 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.138893648s ago; threshold is 3m0s.

If you configured alerts for cluster node exceptions, you receive an alert when a PLEG exception occurs. To configure alerts, see ACK alert management.

Solution

Restart the key node components dockerd/containerd and kubelet in sequence, and then check if the node has recovered.
If restarting the key components does not resolve the issue, restart the node. See restart an instance for instructions.
Warning
Restarting the node may interrupt your workloads. Proceed with caution.
If the node runs CentOS 7.6, see "Reason:KubeletNotReady Message:PLEG is not healthy:" error in kubelet logs on CentOS 7.6.

Insufficient node scheduling resources

Cause

This issue typically occurs when nodes in the cluster have insufficient resources.

Symptoms

If cluster nodes have insufficient resources, pods fail to schedule, and you may see one of the following error messages:

Insufficient CPU resources: 0/2 nodes are available: 2 Insufficient cpu
Insufficient memory resources: 0/2 nodes are available: 2 Insufficient memory
Insufficient ephemeral storage: 0/2 nodes are available: 2 Insufficient ephemeral-storage

The scheduler uses the following formulas to determine if a node has insufficient resources:

A node has insufficient CPU if: The pod's CPU request > (The node's allocatable CPU - The node's allocated CPU)
A node has insufficient memory if: The pod's memory request > (The node's allocatable memory - The node's allocated memory)
A node has insufficient ephemeral storage if: The pod's ephemeral storage request > (The node's allocatable ephemeral storage - The node's allocated ephemeral storage)

A pod will not be scheduled to a node if its total resource requests exceed the node's available resources (allocatable minus allocated).

Run the following command to view the resource allocation details of a node:

kubectl describe node [$nodeName]

Note the following sections in the output:

Allocatable:
  cpu:                3900m
  ephemeral-storage:  114022843818
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12601Mi
  pods:               60
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                725m (18%)  6600m (169%)
  memory             977Mi (7%)  16640Mi (132%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)

In the output:

Allocatable: The total amount of allocatable resources (such as CPU, memory, and ephemeral storage) on the node.
Allocated resources: The total resources allocated to pods on the node.

Solution

To resolve this issue, reduce the node load using one of the following methods:

Delete unnecessary pods to reduce the node load. For more information, see Manage pods.
Adjust the resource configuration for your pods based on your business requirements. For more information, see Set CPU and memory requests and limits for a container.
You can use the resource profiling feature provided by ACK to get resource configuration suggestions for containers based on the historical data of resource usage. This simplifies the configuration of resource requests and limits for containers. For more information, see Resource profiling.
Add new nodes to the cluster. For more information, see Create and manage node pools.
Upgrade the node configurations. For more information, see Upgrade or downgrade the configurations of a worker node.

For more information, see Insufficient node CPU, Insufficient node memory - MemoryPressure, and Insufficient node disk space - DiskPressure.

Insufficient node CPU resources

Cause

Containers that consume excessive CPU resources on a node typically cause this issue.

Symptoms

Insufficient CPU resources can cause the node's status to become abnormal.
If alerts are configured for cluster node exceptions, an alert is triggered when node CPU utilization reaches or exceeds 85%. For more information about how to configure alerts, see ACK alert management.

Solution

Review the node's CPU utilization curve to identify when the exception occurred and which processes on the node are consuming excessive CPU resources. For more information, see Check the monitoring data of a node.
Reduce the load on the node. For more information, see Insufficient node resources for scheduling.
If necessary, restart the affected node. For more information, see Restart an instance.
Warning
Restarting the node may interrupt your services. Proceed with caution.

Insufficient node memory - MemoryPressure

Cause

This issue typically occurs when containers on a node consume excessive memory, leaving the node with insufficient available memory.

Symptoms

When a node's available memory drops below the threshold specified by memory.available, the node's MemoryPressure condition is set to True, and containers on the node are evicted. For more information about node eviction, see Node-pressure Eviction.
When a node is low on memory, you may observe the following common symptoms:
- The node's MemoryPressure condition is True.
- When containers on the node are evicted:
  - Events for evicted containers include the message The node was low on resource: memory.
  - Node events include the message attempting to reclaim memory.
- A System OOM may also occur, which is indicated by the System OOM message in the node's events.
You can configure alerts to be notified when a node's memory usage reaches or exceeds 85%. See ACK alert management to learn how.

Solution

Analyze the memory usage curve in the node's monitoring data to pinpoint when the issue occurred, and inspect running processes for memory leaks. For more information, see Check the monitoring data of a node.
Reduce the load on the node. For more information, see Insufficient node resources for scheduling.
If necessary, restart the affected node. For more information, see Restart an instance.
Warning
Caution: Restarting a node can interrupt your services.

Insufficient inodes - InodesPressure

Cause

This issue typically occurs when containers on a node consume excessive inodes.

Symptoms

When the number of available inodes on a node drops below the inodesFree threshold, its InodesPressure condition is set to True, and containers on the node are evicted. For more information, see node-pressure eviction.
When a node is low on inodes, you may see the following common error messages:
- The node condition InodesPressure is True.
- When containers on the node are evicted:
  - The events for the evicted containers contain the message: The node was low on resource: inodes.
  - The node's events contain the message: attempting to reclaim inodes.
If you have configured alerts for node exceptions in your cluster, you are alerted when a node is low on inodes. For more information about how to configure alerts, see ACK alert management.

Solution

Check the inode usage curve in the node's monitoring data to identify when the exception occurred and identify any processes that are consuming excessive inodes. For more information, see Check the monitoring data of nodes.
To troubleshoot other related issues, see Resolve "no space left" issues on Linux.

Insufficient PIDs - NodePIDPressure

Cause

This issue occurs when containers on a node consume too many PIDs, exhausting the node's available PIDs.

Symptoms

When a node's available PIDs drop below the pid.available threshold, its NodePIDPressure condition is set to True, and pods on the node are evicted. For more information about node eviction, see Node-pressure Eviction.
If you have configured alerts for node exceptions in your cluster, you receive an alert when a node runs low on PIDs. For more information about how to configure alerts, see ACK alert management.

Solution

Run the following commands to check the maximum number of PIDs on the node and the highest PID currently in use.

sysctl kernel.pid_max  # Check the maximum number of PIDs.
ps -eLf|awk '{print $2}' | sort -rn| head -n 1   # Check the current highest PID in use.

Run the following command to identify the five processes consuming the most PIDs.

ps -elT | awk '{print $4}' | sort | uniq -c | sort -k1 -g | tail -5

Expected output:

# The first column shows the number of PIDs used by a process. The second column shows the process ID.
73 9743
75 9316
76 2812
77 5726
93 5691

Use the process ID to find the corresponding process and pod. Analyze why the process is consuming too many PIDs and optimize the application code.
Reduce the load on the node. For more information, see Insufficient node resources for scheduling.
If necessary, restart the affected node. For more information, see Restart an instance.
Warning
Restarting the node may interrupt your services. Use caution.

Insufficient disk space - DiskPressure

Cause

Usually, a node runs out of disk space because containers consume excessive disk space or container images are too large.

Symptoms

When the available disk space for container images on a node drops below theimagefs.available threshold, the node's DiskPressure condition becomes True.
When the available disk space on the node's root filesystem drops below thenodefs.available threshold, all Pods on that node are evicted. For more information about node-pressure eviction, see node-pressure eviction.
When a node runs out of disk space, you typically see the following error messages:
- The node's DiskPressure condition is True.
- After image garbage collection runs, if the disk space remains below the health threshold (default: 80%), you can find the keyword failed to garbage collect required amount of images in the node events.
- When Pods are evicted from the node:
  - In the events for the evicted Pods, you can find the keyword The node was low on resource: [DiskPressure].
  - In node events, you can find keywords such as attempting to reclaim ephemeral-storage or attempting to reclaim nodefs.
If you have configured alerts for node exceptions, you will receive an alert if a node's disk usage reaches 85% or higher. For more information about how to configure alerts, see ACK alert management.

Solution

Check the node's monitoring data to view the disk usage curve. This helps you identify when the usage spiked and determine if any processes are consuming excessive disk space. For more information, see Check the monitoring data of nodes.
Delete unneeded files from the disk to free up space. For instructions, see Resolve "no space left" issues on Linux.
Based on your workload requirements, set resource requests and limits forephemeral-storage in your Pod specifications. For more information, see Set CPU and memory resource limits for a container.
Use Alibaba Cloud storage services instead of hostPath volumes. For more information, see Storage.
Increase the node's disk size.
Reduce the node load. For more information, see Insufficient node resources for scheduling.

Insufficient node IP addresses - InvalidVSwitchId.IpNotEnough

Cause

This issue occurs when an excessive number of containers on a node exhausts the available IP addresses.

Symptoms

Pods fail to start and are stuck in the ContainerCreating status. The Pod logs show an error message that contains the keyword InvalidVSwitchId.IpNotEnough. For more information about how to view Pod logs, see Troubleshoot Pod exceptions.

time="2020-03-17T07:03:40Z" level=warning msg="Assign private ip address failed: Aliyun API Error: RequestId: 2095E971-E473-4BA0-853F-0C41CF52651D Status Code: 403 Code: InvalidVSwitchId.IpNotEnough Message: The specified VSwitch \"vsw-AAA\" has not enough IpAddress., retrying"

If you configured alerts for your cluster, you will receive an alert when a node runs out of IP addresses. For more information about how to configure alerts, see ACK alert management.

Solution

Reduce the number of containers on the node. For more information, see Insufficient node resources for scheduling. For other related solutions, see What to do if a VSwitch provides insufficient IP addresses in a Terway cluster and I expanded the VSwitch CIDR blocks in Terway mode but still cannot allocate Pod IPs. What do I do?.

Node network exceptions

Cause

This issue typically results from an abnormal node status, misconfigured security groups, or high network load.

Symptoms

You cannot log on to the node.
The node status is Unknown.
If you configured alerts for cluster node exceptions, you will receive an alert when the node's outbound Internet bandwidth utilization reaches 85% or higher. For more information, see ACK alert management.

Solution

If you cannot log on to the node, follow these steps:
- Check if the node instance is in the Running status.
- Check the node's security group configuration. For more information, see Check the security groups of nodes.
If the node's network load is high, follow these steps:
- Check the node's network usage curve in the monitoring data to identify pods consuming excessive network bandwidth. For more information, see Check the monitoring data of nodes.
- Use network policies to control pod traffic. For more information, see Network policies in ACK clusters.

Unexpected node restarts

Cause

This issue is typically caused by an abnormal node load.

Symptoms

While the node is restarting, its status is NotReady.
If you configured alerts for cluster node exceptions, you receive an alert when a node restarts unexpectedly. For more information, see ACK alert management.

Solution

Run the following command to check the node's restart time.

last reboot

Expected output:

[root@iZoj203tgmxxx /root]
#last reboot
reboot   system boot  xxx  Tue Mar 22 06:26 - 11:49 (1+05:22)
reboot   system boot  xxx  Fri Oct 15 05:27 - 11:49 (159+06:22)
wtmp begins Fri Oct 15 05:27:03 2021

Check the node's monitoring data and investigate any resource anomalies based on the restart time. For more information, see Check the monitoring data of nodes.
Check the node's kernel logs and look for any error logs that correspond to the restart time. For more information, see Collect the diagnostics logs of nodes.

High disk I/O from auditd or "audit: backlog limit exceeded" error

Cause

By default, some existing nodes in a cluster are configured with auditd rules that monitor Docker operations. When these nodes use the Docker container runtime, the rules trigger the system to record audit logs for Docker-related activities. Under certain conditions, such as frequent container restarts, applications writing a large volume of files in a short period, or kernel bugs, the system may generate an excessive number of audit logs. This can occasionally cause high disk I/O from the auditd process or an audit: backlog limit exceeded error in the system log.

Symptoms

This issue affects only nodes that use the Docker container runtime. An affected node may exhibit the following symptoms:

When you run the iotop -o -d 1 command, the output shows that the DISK WRITE value for the auditd process consistently remains at 1 MB/s or higher.
When you run the dmesg -d command, the output contains logs with the audit_printk_skb keyword, such as audit_printk_skb: 100 callbacks suppressed.
When you run the dmesg -d command, the output contains the keyword audit: backlog limit exceeded.

Solution

Follow these steps to verify that the auditd configuration is the cause:

Log on to a cluster node.
Run the following command to check the audit rules.
```
sudo auditctl -l | grep -- ' -k docker'
```
If the output contains the following line, the auditd configuration is the cause.
```
-w /var/lib/docker -k docker
```

If the check confirms that this issue affects your cluster nodes, choose one of the following solutions.

Upgrade the cluster
Upgrade your cluster to fix this issue. For more information, see Manually upgrade a cluster.
Use the containerd container runtime
For clusters that you cannot upgrade, you can work around this issue by changing the node container runtime from Docker to containerd. Perform the following steps for each node pool that uses the Docker container runtime:
1. Create a new node pool that uses containerd by cloning the existing node pool. The new node pool's configuration must be identical to the original, except for the container runtime.
2. During off-peak hours, drain the nodes in the original node pool one by one.

Update the auditd configuration on nodes

If you cannot upgrade the cluster or switch to containerd, work around the issue by manually updating the auditd configuration on each affected node.

Log on to the affected node.

Run the following commands to delete the Docker-related audit rules.

sudo test -f /etc/audit/rules.d/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/rules.d/audit.rules
sudo test -f /etc/audit/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/audit.rules

Run the following command to apply the new audit rules.

if service auditd status |grep running || systemctl status auditd |grep running; then
    sudo service auditd restart || sudo systemctl restart auditd
    sudo service auditd status || sudo systemctl status auditd
fi

Contents

Diagnostic procedure

Common troubleshooting methods

Diagnose node failures

Check node details

Check node status

Check node events

Collect node diagnostic logs

Check key node components

Check node monitoring data

Check node security groups

Troubleshooting kubelet exceptions

Cause

Symptoms

Solution

Dockerd exception: RuntimeOffline

Cause

Symptoms

Solution

Containerd exception: RuntimeOffline

Cause

Solution

NTP exception: NTPProblem

Cause

Symptoms

Solution

PLEG is not healthy

Cause

Symptoms

Solution

Insufficient node scheduling resources

Cause

Symptoms

Solution

Insufficient node CPU resources

Cause

Symptoms

Solution

Insufficient node memory - MemoryPressure

Cause

Symptoms

Solution

Insufficient inodes - InodesPressure

Cause

Symptoms

Solution

Insufficient PIDs - NodePIDPressure

Cause

Symptoms

Solution

Insufficient disk space - DiskPressure

Cause

Symptoms

Solution

Insufficient node IP addresses - InvalidVSwitchId.IpNotEnough

Cause

Symptoms

Solution

Node network exceptions

Cause

Symptoms

Solution

Unexpected node restarts

Cause

Symptoms

Solution

High disk I/O from auditd or "audit: backlog limit exceeded" error

Cause

Symptoms

Solution