This topic describes the diagnostic procedure for nodes and how to troubleshoot node exceptions. This topic also provides answers to some frequently asked questions about nodes.
Table of contents
Category | Content |
Diagnostic procedure | |
Common troubleshooting methods | |
FAQ and solutions |
|
Diagnostic procedure

Check whether the status of a node is abnormal. For more information, see Check node status.
If the status of a node is NotReady, perform the following steps to troubleshoot the issue:
Check whether the values of the following node conditions are True: PIDPressure, DiskPressure, and MemoryPressure. If one of the node conditions is True, troubleshoot the issue based on the keyword of the node condition. For more information, see Dockerd exceptions - RuntimeOffline, Insufficient memory resources - MemoryPressure, and Insufficient inodes - InodesPressure.
Check the key components and logs of nodes.
Kubelet
Check the status, log, and configuration of the kubelet. For more information, see Check the key components of nodes.
If the kubelet experiences an exception, refer to Kubelet exceptions.
Dockerd
Check the status, log, and configuration of dockerd. For more information, see Check the key components of nodes.
If dockerd experiences an exception, refer to Dockerd exceptions - RuntimeOffline.
Containerd
Check the status, log, and configuration of containerd. For more information, see Check the key components of nodes.
If containerd experiences an exception, refer to Containerd exceptions - RuntimeOffline.
NTP
Check the status, log, and configuration of the NTP service. For more information, see Check the key components of nodes.
If the NTP service experiences an exception, refer to NTP exceptions - NTPProblem.
Check the diagnostic log of the node. For more information, see Check the diagnostic logs of nodes.
Check the monitoring data of the node, including the usage of CPU, memory, and network resources. For more information, see Check the monitoring data of nodes. If the resource usage is abnormal, refer to Insufficient CPU resources and Insufficient memory resources - MemoryPressure.
If the status of the node is Unknown, perform the following steps to troubleshoot the issue:
Check whether the status of the Elastic Compute Service (ECS) instance that hosts the node is Running.
Check the key components of the node.
Kubelet
Check the status, log, and configuration of the kubelet. For more information, see Check the key components of nodes.
If the kubelet experiences an exception, refer to Kubelet exceptions.
Dockerd
Check the status, log, and configuration of dockerd. For more information, see Check the key components of nodes.
If dockerd experiences an exception, refer to Dockerd exceptions - RuntimeOffline.
Containerd
Check the status, log, and configuration of containerd. For more information, see Check the key components of nodes.
If containerd experiences an exception, refer to Containerd exceptions - RuntimeOffline.
NTP
Check the status, log, and configuration of the NTP service. For more information, see Check the key components of nodes.
If the NTP service experiences an exception, refer to NTP exceptions - NTPProblem.
Check the network connectivity of the node. For more information, see Check the security groups of nodes. If a network exception occurs, refer to Network exceptions.
Check the diagnostic log of the node. For more information, see Check the diagnostic logs of nodes.
Check the monitoring data of the node, including the usage of CPU, memory, and network resources. For more information, see Check the monitoring data of nodes. If the resource usage is abnormal, refer to Insufficient CPU resources and Insufficient memory resources - MemoryPressure.
If the issue persists after you perform the preceding operations, use the diagnostics feature provided by Container Service for Kubernetes (ACK) to troubleshoot the issue. For more information, see Troubleshoot node exceptions.
If the issue persists, Submit a ticket.
Common troubleshooting methods
Troubleshoot node exceptions
If a node experiences an exception, you can use the diagnostics feature provided by ACK to troubleshoot the exception.
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
On the Nodes page, find the node that you want to diagnose and choose in the Actions column of the node.
On the Diagnosis details page, troubleshoot the issue based on the diagnostic report.
Check node details
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
On the Nodes page, click the name of the node that you want to manage or choose in the Actions column of the node.
Check node status
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
On the Nodes page, you can view the status of different nodes.
The Running state indicates that a node is running as normal.
If the status of a node is not Running, you can click the name of the node or choose
in the Actions column of the node to view the details of the node.NoteIf you want to collect information about different node conditions such as InodesPressure, DockerOffline, and RuntimeOffline, you must install node-problem-detector in your cluster and create an event center. The event center feature is automatically enabled when you create the cluster. For more information, see Create and use an event center.
Check node events
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
On the Nodes page, click the name of the node that you want to manage or choose in the Actions column of the node.
In the lower part of the node details page, you can view the events that are related to the node.
Check the diagnostic logs of nodes
Use a script to collect diagnostic logs: For more information, see How do I collect the diagnostic data of an ACK cluster?.
Use the console to collect diagnostic logs: For more information, see Collect the diagnostic logs of nodes.
Check the key components of nodes
Kubelet:
Check the status of the kubelet
Log on to the node where the kubelet runs and run the following command to query the status of the kubelet process:
systemctl status kubelet
Expected output:
Check the log of the kubelet
Log on to the node where the kubelet runs and run the following command to print the log of the kubelet: For more information about how to check the log of the kubelet, see Check the diagnostic logs of nodes.
journalctl -u kubelet
Check the configuration of the kubelet
Log on to the node where kubelet runs and run the following command to check the configuration of the kubelet:
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Runtime:
Check dockerd
Check the status of dockerd
Log on to the node where dockerd runs and run the following command to query the status of the dockerd process:
systemctl status docker
Expected output:
Check the log of dockerd
Log on to the node where dockerd runs and run the following command to print the log of dockerd: For more information about how to check the log of dockerd, see Check the diagnostic logs of nodes.
journalctl -u docker
Check the configuration of dockerd
Log on to the node where dockerd runs and run the following command to query the configuration of dockerd:
cat /etc/docker/daemon.json
Check containerd
Check the status of containerd
Log on to the node where containerd runs and run the following command to query the status of the containerd process:
systemctl status containerd
Expected output:
Check the log of containerd
Log on to the node where containerd runs and run the following command to print the log of containerd: For more information about how to check the log of containerd, see Check the diagnostic logs of nodes.
journalctl -u containerd
NTP:
Check the status of the NTP service
Log on to the node where the NTP service runs and run the following command to query the status of the chronyd process:
systemctl status chronyd
Expected output:
Check the log of the NTP service
Log on to the node where the NTP service runs and run the following command to print the log of the NTP service:
journalctl -u chronyd
Check the monitoring data of nodes
CloudMonitor
ACK is integrated with CloudMonitor. You can log on to the CloudMonitor console to view the monitoring data of the ECS instances that are deployed in your ACK cluster. For more information about how to monitor nodes, see Monitor nodes.
Prometheus Service
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage. Then, click the name of the cluster or click Details in the Actions column.
In the left-side navigation pane of the cluster details page, choose
.On the Prometheus Monitoring page, click the Node Monitoring tab. On the Nodes page, select a node from the drop-down list to view the monitoring data of the node, such as information about the CPU, memory, and disk resources.
Check the security groups of nodes
For more information, see Security group overview and Configure security group rules to enforce access control on ACK clusters.
Kubelet exceptions
Cause
In most cases, a kubelet exception occurs because the kubelet process experiences an exception, the runtime experiences an exception, or the configuration of the kubelet is invalid.
Issue
The status of the kubelet is inactive.
Solution
Run the following command to restart the kubelet. The restart operation does not affect the containers that are running.
systemctl restart kubelet
After the kubelet restarts, log on to the node where the kubelet runs and run the following command to check whether the status of the kubelet is normal:
systemctl status kubelet
If the status of the kubelet is abnormal, run the following command on the node to print the log of the kubelet:
journalctl -u kubelet
If you find an exception in the kubelet log, troubleshoot the exception based on the keyword.
If the configuration of the kubelet is invalid, run the following command to modify the configuration:
vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf #Modify the configuration of the kubelet. systemctl daemon-reload;systemctl restart kubelet #Reload the configuration and restart the kubelet.
Dockerd exceptions - RuntimeOffline
Cause
In most cases, a dockerd exception occurs because the configuration of dockerd is invalid, the dockerd process is overloaded, or the node is overloaded.
Issue
The status of dockerd is inactive.
The status of dockerd is active (running) but dockerd does not run as normal. As a result, the node experiences an exception. In this case, you may fail to run the
docker ps
ordocker exec
command.The value of the node condition RuntimeOffline is True.
If you enabled alerting for cluster nodes, you can receive alerts when the node where dockerd runs experiences an exception. For more information about how to configure alert rules, see Alert management.
Solution
Run the following command to restart dockerd:
systemctl restart docker
After dockerd restarts, log on to the node and run the following command to check whether the status of dockerd is normal:
systemctl status docker
If the status of dockerd is abnormal, run the following command on the node to print the log of dockerd:
journalctl -u docker
Containerd exceptions - RuntimeOffline
Cause
In most cases, a containerd exception occurs because the configuration of containerd is invalid, the containerd process is overloaded, or the node is overloaded.
The status of containerd is inactive.
The value of the node condition RuntimeOffline is True.
If you enabled alerting for cluster nodes, you can receive alerts when the node where containerd runs experiences an exception. For more information about how to configure alert rules, see Alert management.
Solution
Run the following command to restart containerd:
systemctl restart containerd
After containerd restarts, log on to the node and run the following command to check whether the status of containerd is normal:
systemctl status containerd
If the status of containerd is abnormal, run the following command on the node to print the log of containerd:
journalctl -u containerd
NTP exceptions - NTPProblem
Cause
In most cases, an NTP exception occurs because the status of the NTP process is abnormal.
Issue
The status of chronyd is inactive.
The value of the node condition NTPProblem is True.
If you enabled alerting for cluster nodes, you can receive alerts when the node where the NTP service runs experiences an exception. For more information about how to configure alert rules, see Alert management.
Solution
Run the following command to restart chronyd:
systemctl restart chronyd
After chronyd restarts, log on to the node and run the following command to check whether the status of chronyd is normal:
systemctl status chronyd
If the status of chronyd is abnormal, run the following command on the node to print the log of chronyd:
journalctl -u chronyd
PLEG exceptions - PLEG is not healthy
Cause
The Pod Lifecycle Event Generator (PLEG) records all events that occur throughout the lifecycle of pods, such as events that are related to container startups or terminations. In most cases, the PLEG is not healthy
exception occurs because the container runtime on the node is abnormal or the node uses an earlier systemd version that can cause this issue.
Issue
The status of the node is NotReady.
The following content exists in the log of the kubelet:
I0729 11:20:59.245243 9575 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.138893648s ago; threshold is 3m0s.
If you enabled alerting for cluster nodes, you can receive alerts when a PLEG exception occurs. For more information about how to configure alert rules, see Alert management.
Solution
Restart the following key components on the node in sequence: dockerd, containerd, and kubelet. Then, check whether the status of the node is normal.
If the status of the node is abnormal after you restart the key components, restart the node. For more information, see Restart instances.
WarningThe restart operation also restarts the pods on the node. Proceed with caution.
If the node runs CentOS 7.6, see The kubelet log of an ACK cluster that runs CentOS 7.6 contains the "Reason:KubeletNotReady Message:PLEG is not healthy:" information.
Insufficient node resources for scheduling
Cause
In most cases, this exception occurs because the resources provided by the nodes in the cluster are sufficient.
Issue
When resources provided by the nodes in your cluster are insufficient, pod scheduling fails and one of the following errors is returned:
Insufficient CPU resources: 0/2 nodes are available: 2 Insufficient cpu
Insufficient memory resources: 0/2 nodes are available: 2 Insufficient memory
Insufficient ephemeral storage resources: 0/2 nodes are available: 2 Insufficient ephemeral-storage
The scheduler determines whether the resources provided by a node are insufficient based on the following rules:
If the amount of CPU resources requested by a pod is greater than difference between the total amount of allocatable CPU resources provided by a node and the amount of CPU resources allocated from the node, the CPU resources provided by the node are insufficient.
If the amount of memory resources requested by a pod is greater than the difference between the total amount of allocatable memory resources provided by a node and the amount of memory resources allocated from the node, the memory resources provided by the node are insufficient.
If the amount of ephemeral storage resources requested by a pod is greater than the difference between the total amount of allocatable ephemeral storage resources provided by a node and the amount of ephemeral storage resources allocated from the node, the ephemeral storage resources provided by the node are insufficient.
If the amount of resources requested by a pod is greater than the difference between the total amount of allocatable resources provided by a node and the amount of resources allocated from the node, the pod is not scheduled to the node.
Run the following command to query information about the resource allocation on the node:
kubectl describe node [$nodeName]
Expected output:
Allocatable:
cpu: 3900m
ephemeral-storage: 114022843818
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 12601Mi
pods: 60
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 725m (18%) 6600m (169%)
memory 977Mi (7%) 16640Mi (132%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
The following list describes the fields in an ARN:
Allocatable: the amount of allocatable CPU, memory, or ephemeral storage resources provided by the node.
Allocated resources: the amount of CPU, memory, or ephemeral storage resources allocated from the node.
Solution
When the resources provided by the nodes are insufficient, you can use the following methods to reduce the loads of the nodes:
Delete the pods that you no longer require. For more information, see Manage pods.
Modify the resource configurations for the pods based on your business requirements. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.
Add nodes to the cluster. For more information, see Create a node pool.
Upgrade the nodes in the cluster. For more information, see Upgrade the configurations of a worker node.
For more information, see Insufficient CPU resources, Insufficient memory resources - MemoryPressure, and Insufficient disk space - DiskPressure.
Insufficient CPU resources
Cause
In most cases, the CPU resources provided by a node become insufficient because the containers on the node have occupied an excessive amount of CPU resources.
Issue
When a node does not have sufficient CPU resources, the status of the node may be abnormal.
If you enabled alerting for cluster nodes, you can receive alerts when the CPU usage of the node reaches or exceeds 85%. For more information about how to configure alert rules, see Alert management.
Solution
Check the CPU usage curve on the Node Monitoring page of the console and locate the time point at which the CPU usage spiked. Then, check whether the processes that run on the node have occupied an excessive amount of CPU resources. For more information, see Check the monitoring data of nodes.
For more information about how to reduce the loads of the node, see Insufficient node resources for scheduling.
Restart the node. For more information, see Restart instances.
WarningThe restart operation also restarts the pods on the node. Proceed with caution.
Insufficient memory resources - MemoryPressure
Cause
In most cases, the memory resources provided by a node become insufficient because the containers on the node have occupied an excessive amount of memory resources.
Issue
When the amount of available memory resources on the node drops below the
memory.available
threshold, the value of the node condition MemoryPressure is set to True. Containers are evicted from the node. For more information about container eviction, see Node-pressure Eviction.When the node does not have sufficient memory resources, the following issues occur:
The value of the node condition MemoryPressure is set to True.
Containers are evicted from the node:
You can find the The node was low on resource: memory information in the events of the containers that are evicted.
You can find the attempting to reclaim memory information in the event of the node.
An out of memory (OOM) error may occur. When an OOM error occurs, you can find the System OOM information in the event of the node.
If you enabled alerting for cluster nodes, you can receive alerts when the memory usage of the node reaches or exceeds 85%. For more information about how to configure alert rules, see Alert management.
Solution
Check the memory usage curve on the Node Monitoring page of the console and locate the time point at which the memory usage spiked. Then, check whether memory leaks occur in the processes that run on the node. For more information, see Check the monitoring data of nodes.
For more information about how to reduce the loads of the node, see Insufficient node resources for scheduling.
Restart the node. For more information, see Restart instances.
WarningThe restart operation also restarts the pods on the node. Proceed with caution.
Insufficient inodes - InodesPressure
Cause
In most cases, the inodes provided by a node become insufficient because the containers on the node have occupied an excessive number of inodes.
Issue
When the number of available inodes on the node drops below the
inodesFree
threshold, the value of the node condition InodesPressure is set to True. Containers are evicted from the node. For more information about container eviction, see Node-pressure Eviction.When the inodes provided by a node become insufficient, the following issues occur:
The value of the node condition InodesPressure is set to True.
Containers are evicted from the node:
You can find the The node was low on resource: inodes information in the events of the containers that are evicted.
You can find the attempting to reclaim inodes information in the event of the node.
If you enabled alerting for cluster nodes, you can receive alerts when the node does not have sufficient inodes. For more information about how to configure alert rules, see Alert management.
Solution
Check the inode usage curve on the Node Monitoring page of the console and locate the time point at which the inode usage spiked. Then, check whether the processes that run on the node have occupied an excessive number of inodes. For more information, see Check the monitoring data of nodes.
For more information about other issues that are related to inodes, see What do I do if no disk space is available or the number of inodes exceeds the inode quota in a Linux instance?
Insufficient PIDs - NodePIDPressure
Cause
In most cases, the process IDs (PIDs) provided by a node become insufficient because the containers on the node have occupied an excessive number of PIDs.
Issue
When the number of available PIDs on the node drops below the
pid.available
threshold, the value of the node condition NodePIDPressure is set to True. Containers are evicted from the node. For more information about container eviction, see Node-pressure Eviction.If you enabled alerting for cluster nodes, you can receive alerts when the node does not have sufficient PIDs. For more information about how to configure alert rules, see Alert management.
Solution
Run the following commands to query the maximum number of PIDs and the greatest PID value on the node.
sysctl kernel.pid_max #Query the maximum number of PIDs. ps -eLf|awk '{print $2}' | sort -rn| head -n 1 #Query the greatest PID value.
Run the following command to query the top five processes that occupied the most number of PIDs:
ps -elT | awk '{print $4}' | sort | uniq -c | sort -k1 -g | tail -5
Expected output:
#The first column displays the numbers of PIDs that are occupied by the processes. The second column displays the IDs of the processes. 73 9743 75 9316 76 2812 77 5726 93 5691
You can use the process IDs to locate the corresponding processes and pods, diagnose the issue, and optimize the code.
Reduce the loads of the node. For more information, see Insufficient node resources for scheduling.
Restart the node. For more information, see Restart instances.
WarningThe restart operation also restarts the pods on the node. Proceed with caution.
Insufficient disk space - DiskPressure
Cause
In most cases, the disk space provided by a node becomes insufficient because the containers on the node have occupied an excessive amount of disk space or the size of the container image is too large.
Issue
When the amount of available disk space on the node drops below the
imagefs.available
threshold, the value of the node condition DiskPressure is set to True.When the amount of available disk space drops below the
nodefs.available
threshold, all containers on the node are evicted. For more information about container eviction, see Node-pressure Eviction.When the disk space of a node becomes insufficient, the following issues occur:
The value of the node condition DiskPressure is set to True.
If the amount of available disk space remains lower than the health threshold after the image reclaim policy is triggered, you can find the failed to garbage collect required amount of images information in the event of the node. The default value of the health threshold is 80%.
Containers are evicted from the node:
You can find the The node was low on resource: [DiskPressure] information in the events of the containers that are evicted.
You can find the attempting to reclaim ephemeral-storage or attempting to reclaim nodefs information in the event of the node.
If you enabled alerting for cluster nodes, you can receive alerts when the disk usage of the node reaches or exceeds 85%. For more information about how to configure alert rules, see Alert management.
Solution
Check the disk usage curve on the Node Monitoring page of the console and locate the time point at which the disk usage spiked. Then, check whether the processes that run on the node have occupied an excessive amount of disk space. For more information, see Check the monitoring data of nodes.
If a large number of files have occupied the disk space, delete the files that you no longer need. For more information, see What do I do if no disk space is available or the number of inodes exceeds the inode quota in a Linux instance?
Limit the
ephemeral storage
that is allocated to the pods on the node. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.We recommend that you use the storage services provided by Alibaba Cloud and avoid using hostPath volumes. For more information, see CSI overview.
Resize the disk of the node.
Reduce the loads of the node. For more information, see Insufficient node resources for scheduling.
Insufficient IP addresses - InvalidVSwitchId.IpNotEnough
Cause
In most cases, the IP addresses provided by a node become insufficient because the containers on the node have occupied an excessive number of IP addresses.
Issue
Pods fail to be launched. The status of these pods is ContainerCreating. You can find the InvalidVSwitchId.IpNotEnough information in the logs of the pods. For more information about how to check the log of a pod, see Check the logs of pods.
time="2020-03-17T07:03:40Z" level=warning msg="Assign private ip address failed: Aliyun API Error: RequestId: 2095E971-E473-4BA0-853F-0C41CF52651D Status Code: 403 Code: InvalidVSwitchId.IpNotEnough Message: The specified VSwitch \"vsw-AAA\" has not enough IpAddress., retrying"
If you enabled alerting for cluster nodes, you can receive alerts when the node cannot provide sufficient IP addresses. For more information about how to configure alert rules, see Alert management.
Solution
Reduce the number of containers on the node. For more information, see Insufficient node resources for scheduling. For more information about other relevant operations, see How do I resolve the issue that the IP addresses provided by vSwitches are insufficient when the Terway network plug-in is used? and What do I do if the IP address of a newly created pod does not fall within the vSwitch CIDR block after I add a vSwitch in Terway mode? .
Network exceptions
Cause
In most cases, a network exception occurs because the status of the node is abnormal, the configurations of the security groups of the node are invalid, or the network is overloaded.
Issue
You failed to log on to the node.
The status of the node is Unknown.
If you enabled alerting for cluster nodes, you can receive alerts when the outbound Internet bandwidth usage of the node reaches or exceeds 85%. For more information about how to configure alert rules, see Alert management.
Solution
If you failed to log on to the node, perform the following steps to troubleshoot the issue:
Check whether the status of the node is Running.
Check the configurations of the security groups of the node. For more information, see Check the security groups of nodes.
If the network of the node is overloaded, perform the following steps to troubleshoot the issue:
Check the network performance curve of the node on the Node Monitoring page of the console and look for bandwidth usage spikes. For more information, see Check the monitoring data of nodes.
Use network policies to throttle pod traffic. Use network policies to throttle pod traffic. For more information, see Use network policies.
Unexpected node restarts
Cause
In most cases, a node unexpectedly restarts because the node is overloaded.
Issue
During the restart process of the node, the status of the node is NotReady.
If you enabled alerting for cluster nodes, you can receive alerts when the node unexpectedly restarts. For more information about how to configure alert rules, see Alert management.
Solution
Run the following command to query the point in time at which the node restarted:
last reboot
Expected output:
Check the monitoring data of the node and look for abnormal resource usage based on the point in time at which the node restarted. For more information, see Check the monitoring data of nodes.
Check the kernel log of the node and look for exceptions based on the point in time at which the node restarted. For more information, see Check the diagnostic logs of nodes.
How do I resolve the issue that the disk I/O utilization of the auditd process is high or the following error message appears in the system log: audit: backlog limit exceeded?
Cause
Some existing nodes in the cluster are configured with audit rules of auditd for Docker by default. If the nodes run Docker, the system generates audit logs for Docker based on the audit rules of auditd. A large number of audit logs may be generated when a large number of containers repetitively restart at the same time, a large amount of data is written into a container, or kernel bugs occur. In these cases, the disk I/O utilization of the auditd process may be high and the following error message may appear in the system log: audit: backlog limit exceeded.
Issue
This issue affects only nodes that run Docker. The following situations occur after you run specific commands on the node:
After you run the iotop -o -d 1 command, the output shows that the value of
DISK WRITE
remains at 1MB/s or higher.After you run the dmesg -d command, the output includes a log that contains the following keyword:
audit_printk_skb
. Example:audit_printk_skb: 100 callbacks suppressed
.After you run the dmesg -d command, the output contains the following keyword:
audit: backlog limit exceeded
.
Solution
Perform the following operations to check whether the preceding issue is caused by the audit rules of auditd:
Log on to the node.
Run the following command to query auditd rules:
sudo auditctl -l | grep -- ' -k docker'
If the following output is returned, the preceding issue is caused by the audit rules of auditd.
-w /var/lib/docker -k docker
To resolve this issue, select one of the following solutions:
Update the Kubernetes version of the cluster
Update the Kubernetes version of the cluster to the later major version. For more information, see Update the Kubernetes version of an ACK cluster.
Change the container runtime to containerd
If the Kubernetes version of the cluster cannot be updated, you can change the container runtime to containerd to prevent this issue. Perform the following operations on the node pools that use Docker as the container runtime:
Create new node pools by cloning the node pools that run Docker. The new node pools use containerd as the container runtime. Make sure that the configurations of the new node pools are the same as the configurations of the cloned node pools except for the container runtime.
Drain nodes one by one from the node pools that run Docker during off-peak hours until all the application pods are evicted from the node pools.
Update the configurations of auditd
If the preceding solutions are not applicable, you can manually update the configurations of auditd on the node to resolve this issue. Perform the following operations on the nodes that use Docker as the container runtime:
NoteYou can manage nodes in batches. For more information, see Manage nodes in batches.
Log on to the node.
Run the following command to delete the auditd rules for Docker:
sudo test -f /etc/audit/rules.d/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/rules.d/audit.rules sudo test -f /etc/audit/audit.rules && sudo sed -i.bak '/ -k docker/d' /etc/audit/audit.rules
Run the following command to apply new auditd rules:
if service auditd status |grep running || systemctl status auditd |grep running; then sudo service auditd restart || sudo systemctl restart auditd sudo service auditd status || sudo systemctl status auditd fi