ack-node-problem-detector is an event monitoring component for Alibaba Cloud Container Service for Kubernetes (ACK) clusters. It is adapted from an open source project and includes several enhancements. The component supports integration with third-party monitoring platforms. It handles node anomaly detection and serves as the Event Center for ACK clusters. You can add custom node monitoring plugins to this component to expand the scope of node monitoring. This topic describes the ack-node-problem-detector component, its usage, and its change history.
Component overview
The ack-node-problem-detector component is a node diagnostic tool for ACK clusters that monitors and reports node anomalies. The component consists of the following parts:
kube-event-init: Initializes the cloud resources for the Simple Log Service (SLS) Event Center instance when you install the ack-node-problem-detector component. This allows ack-node-problem-detector-daemonset and kube-eventer to use these resources to store and analyze event data.
ack-node-problem-detector-daemonset: Runs a pod replica on each eligible node to monitor node health and report cluster condition statuses and events. The ack-node-problem-detector image address mentioned later in this topic refers to the image address for ack-node-problem-detector-daemonset.
Note For more information about the open source community project node-problem-detector, see node-problem-detector.
kube-eventer: Reports all events in the cluster and sends them to the SLS Event Center by default. This provides event storage and analysis for 90 days by default. It also provides features such as monitoring dashboards, alerts, and event search and analysis. You can also configure kube-eventer to report cluster events to other systems, such as DingTalk or EventBridge, for further data integration. For more information, see kube-eventer.
accel-health-monitor: Runs a pod replica on each eligible GPU node to monitor the status of the node's GPU devices and report Node Conditions and Kubernetes events. The image address for accel-health-monitor is provided later in this topic. For more information about the permissions and notes for this component, see GPU anomaly detection.
Usage
For more information about how to install ack-node-problem-detector, its use cases, and the features of new plugins, see Event monitoring.
Change history
November 2025
Version number | Image address | Change time | Change description |
1.2.29 | accel-health-monitor: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/accel-health-monitor:v0.5.3-bafb2ba5-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.14-315a7cb-aliyun
| November 30, 2025 | The GPU detection plugin in ack-node-problem-detector-daemonset is deployed separately as a DaemonSet named ack-accel-health-monitor. For information about the permissions for ack-accel-health-monitor, see GPU anomaly detection. The GPU detection plugin adds detection capabilities for software and devices such as nvidia-persistenced, nvidia-fabricmanager, and nvlink. The feature that allows the GPU plugin of the ack-node-problem-detector component to fence abnormal GPUs is disabled by default. The fencing policies for some GPU check items are changed. For more information, see GPU anomaly detection. Fixed an issue where the GPU plugin would restart due to occasional failures in JSON object serialization. kube-eventer supports reporting data to SLS over HTTPS.
|
July 2025
Version number | Image address | Change time | Change description |
1.2.27 | | July 24, 2025 | Security hardening for kube-eventer and kube-event-init. ACK dedicated clusters support security hardening using the enhanced mode for accessing ECS instance metadata. During authentication, the system accesses ECS instance metadata in enhanced mode to improve cluster security. For more information, see Enforce the enhanced mode to access ECS instance metadata.
|
June 2025
Version number | Image address | Change time | Change description |
1.2.26 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.16-8d2193b-aliyun npd-gpu: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/npd-gpu-plugin:v0.4.1-7359b830-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.12-c7c1896-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun
| June 11, 2025 | |
Version number | Image address | Change time | Change description |
1.2.25 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.16-8ed7053-aliyun npd-gpu: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/npd-gpu-plugin:v0.4.0-e434dc36-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.12-c7c1896-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun
| June 06, 2025 | Added the npd-gpu container for GPU fault detection. Supports fencing specified GPU cards when a GPU fault is detected. Added support for multiple check items, such as NvidiaXID44Error, NvidiaXID61Error, NvidiaXID62Error, and NvidiaXID69Error. For more information, see GPU anomaly detection and automatic fencing. Supports configuring which GPU check items to enable through ack-node-problem-detector-config. Optimized the image size of the ack-node-problem-detector image.
|
August 2024
Version number | Image address | Change time | Change description |
1.2.20 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.14-3c6002c-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.11-0620284-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun
| August 20, 2024 | Supports GPU fault inspection for ECS nodes. Upgraded the kube-eventer component to optimize performance bottlenecks when reporting many events in a cluster. Upgraded the kube-eventer component to support the V4 signature algorithm for Simple Log Service data transmission. Added component parameter settings. You can now manually configure the local port of the ack-node-problem-detector DaemonSet pod to 20256 or 20257. The port is disabled by default.
|
December 2023
Version number | Image address | Change time | Change description |
v1.2.18 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.13-003ac31-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-27a468a-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| December 18, 2023 | Fixed a bug that caused false positive abnormal events for PodOOMKilling anomalies due to cached historical kernel logs. When you upgrade an older version of the ack-node-problem-detector component, user-defined component parameters are now inherited.
|
August 2023
Version number | Image address | Change time | Change description |
v1.2.17 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-27a468a-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| August 24, 2023 | You can modify the component parameter settings on the component management page in the ACK console to update the Project and Logstore instance configurations in the SLS service. Supports adding extra tag information, such as the cluster name, when sending log data to SLS. This information is displayed by default in the SLS data of the ACK Event Center.
|
June 2023
Version number | Image address | Change time | Change description |
v1.2.16 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-019546c-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| June 27, 2023 | Supports configuring the resource specification parameters for the component on the component management page in the ACK console. |
v1.2.15 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-019546c-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| June 06, 2023 | Optimized the performance load that ack-node-problem-detector places on the API server and etcd when PodOOMKilling frequently occurs in large-scale clusters. |
February 2023
Version number | Image address | Change time | Change description |
v1.2.14 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.11-edc7907-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.6-bbf76f7-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| February 03, 2023 | |
September 2022
Version number | Image address | Change time | Change description |
v1.2.11 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.11-edc7907-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.6-bbf76f7-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| September 30, 2022 | |
February 2022
Version number | Image address | Change time | Change description |
v1.2.9 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2 kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.6-f0efecf-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun
| February 22, 2022 | |
January 2022
Version number | Image address | Change time | Change description |
v1.2.8 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2 kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.5-cc7ec54-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun
| January 20, 2022 | |
November 2021
Version number | Image address | Change time | Change description |
v1.2.7 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2 kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.5-cc7ec54-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun
| November 25, 2021 | |
April 2021
Version number | Image address | Change time | Change description |
v1.2.5 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.4-0f5aaee-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:1.5-5e0e7c1-aliyun
| April 25, 2021 | Fixed an issue where kube-event-init in the kube-system namespace would cause a "414 Request-URI Too Large" error when the Event Center was enabled. Optimized the eventer list-watch mechanism to prevent excessive request traffic to etcd. For more information, see eventer list-watch. Fixed an issue where kube-eventer incorrectly parsed the timestamps of some system events. For more information, see fix FailedScheduling event write to sls with wrong timestamp.
|
July 2020
Version number | Image address | Change time | Change description |
v0.6.3-28-160499f | registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f | July 27, 2020 | Optimized OOM Killing event messages to include information such as the pod name, namespace, and UID. Optimized the execution efficiency of the check_fd plugin. Optimized event notifications for node PID watermarks. Upgraded the network issue detection plugin. Added a plugin to monitor and alert on the inode watermark of the node's system disk.
|