ack-node-problem-detector is an event monitoring component for Alibaba Cloud Container Service for Kubernetes (ACK), enhanced from the open-source Node Problem Detector project. It detects node anomalies, powers the event center, and integrates with third-party monitoring platforms. You can add custom monitoring plug-ins to expand its node problem detection capabilities. This topic describes the ack-node-problem-detector component, its usage, and its release notes.
Introduction
The ack-node-problem-detector component is a diagnostic tool for ACK clusters that monitors and reports node anomalies. The component consists of the following parts:
kube-event-init: When you install the ack-node-problem-detector component, kube-event-init initializes the Simple Log Service (SLS) resources required for the event center. This allows ack-node-problem-detector-daemonset and kube-eventer to use these resources to store and analyze event data.
ack-node-problem-detector-daemonset: Runs a pod replica on each node that meets the selector criteria to monitor node health and report node conditions and events. In the following sections, the image address for ack-node-problem-detector refers to the image address for this DaemonSet.
Note For more information about the open source node-problem-detector project, see node-problem-detector.
kube-eventer: Reports all cluster events. By default, this component sends events to the SLS event center, which provides 90-day data retention and features such as dashboards, alerts, and event search and analysis. You can also manually configure kube-eventer to send cluster events to other systems, such as DingTalk and EventBridge, for further data integration. For more information, see kube-eventer.
accel-health-monitor: Runs a pod on each eligible GPU node to monitor the status of GPU devices and report Node Conditions and Kubernetes events. The image address for accel-health-monitor is provided in the release notes. For information about its permissions and usage notes, see GPU anomaly detection.
Usage
For information about how to install ack-node-problem-detector, its use cases, and new plug-in features, see event monitoring.
Release notes
February 2026
Version | Image address | Release date | Description |
1.2.30 | kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.14-4b806cb-aliyun node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/ack-node-problem-detector:v0.8.17-952071f-aliyun accel-health-monitor: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/accel-health-monitor:v0.5.4-4c80dfa0-aliyun
| 2026-02-02 |
Note This version is in canary release. To use this version, submit a ticket. Improved the security of ack-node-problem-detector-daemonset. Improved the security of kube-eventer. Added an option on the component configuration page in the ACK console to enable or disable the generation of isolation files for abnormal GPUs. Modified the isolation policies for some GPU detection items. For more information, see GPU anomaly detection. Added support for eRDMA detection.
|
November 2025
Version | Image address | Release date | Description |
1.2.29 | accel-health-monitor: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/accel-health-monitor:v0.5.3-bafb2ba5-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.14-315a7cb-aliyun
| 2025-11-30 |
Note This version is in canary release. To use this version, submit a ticket. Deployed the GPU detection plug-in as a separate DaemonSet named ack-accel-health-monitor instead of including it in ack-node-problem-detector-daemonset. For information about the permissions for ack-accel-health-monitor, see GPU anomaly detection. The GPU detection plug-in can now detect issues related to nvidia-persistenced, nvidia-fabricmanager, and nvlink. Fixed an issue where the GPU plug-in restarted due to intermittent JSON serialization failures. kube-eventer now supports sending data to SLS over HTTPS.
|
July 2025
Version | Image address | Release date | Description |
1.2.27 | | 2025-07-24 |
Note This version is in canary release. To use this version, submit a ticket. Improved the security of kube-eventer and kube-event-init. ACK Dedicated Cluster now supports an enhanced mode for accessing ECS instance metadata, improving security by using a more secure authentication method. For more information, see Enforce the enhanced mode to access ECS instance metadata.
|
June 2025
Version | Image address | Release date | Description |
1.2.26 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.16-8d2193b-aliyun npd-gpu: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/npd-gpu-plugin:v0.4.1-7359b830-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.12-c7c1896-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun
| 2025-06-11 |
Note This version is in canary release. To use this version, submit a ticket. |
Version | Image address | Release date | Description |
1.2.25 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.16-8ed7053-aliyun npd-gpu: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/npd-gpu-plugin:v0.4.0-e434dc36-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.12-c7c1896-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun
| 2025-06-06 |
Note This version is in canary release. To use this version, submit a ticket. Added the npd-gpu container for GPU fault detection. Added support for isolating specific GPUs when a fault is detected. Added support for multiple detection items, including NvidiaXID44Error, NvidiaXID61Error, NvidiaXID62Error, and NvidiaXID69Error. For more information, see GPU anomaly detection and automatic isolation. You can now configure which GPU detection items to enable in ack-node-problem-detector-config. Reduced the image size of ack-node-problem-detector.
|
August 2024
Version | Image address | Release date | Description |
1.2.20 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.14-3c6002c-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.11-0620284-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun
| 2024-08-20 | Added support for GPU fault inspection on ECS nodes. Upgraded the kube-eventer component to improve performance during large-scale event reporting. Upgraded the kube-eventer component to support the V4 signature algorithm for Simple Log Service data transmission. Added a parameter to configure the local port of the ack-node-problem-detector DaemonSet pod to 20256 or 20257. This port is disabled by default.
|
December 2023
Version | Image address | Release date | Description |
v1.2.18 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.13-003ac31-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-27a468a-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| 2023-12-18 | Fixed an issue where cached historical kernel logs caused false positive PodOOMKilling events. ack-node-problem-detector now retains custom component parameters when you upgrade it from an earlier version.
|
August 2023
Version | Image address | Release date | Description |
v1.2.17 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-27a468a-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| 2023-08-24 | You can now update the Simple Log Service Project and Logstore configurations by modifying the component parameters on the Add-ons page in the ACK console. You can now attach additional tags, such as cluster names, when you send log data to Simple Log Service. These tags are then displayed by default in the ACK event center.
|
June 2023
Version | Image address | Release date | Description |
v1.2.16 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-019546c-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| 2023-06-27 | You can now configure the resource specification parameters for the component on the Add-ons page in the ACK console. |
v1.2.15 | ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-019546c-aliyun kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| 2023-06-06 | Improved the performance of ack-node-problem-detector. This reduces the load on the API server and etcd when PodOOMKilling events occur frequently in large-scale clusters. |
February 2023
Version | Image address | Release date | Description |
v1.2.14 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.11-edc7907-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.6-bbf76f7-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| 2023-02-03 | |
September 2022
Version | Image address | Release date | Description |
v1.2.11 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.11-edc7907-aliyun kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.6-bbf76f7-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun
| 2022-09-30 | |
February 2022
Version | Image address | Release date | Description |
v1.2.9 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2 kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.6-f0efecf-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun
| 2022-02-22 | |
January 2022
Version | Image address | Release date | Description |
v1.2.8 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2 kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.5-cc7ec54-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun
| 2022-01-20 | |
November 2021
Version | Image address | Release date | Description |
v1.2.7 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2 kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.5-cc7ec54-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun
| 2021-11-25 | |
April 2021
Version | Image address | Release date | Description |
v1.2.5 | ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.4-0f5aaee-aliyun kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:1.5-5e0e7c1-aliyun
| 2021-04-25 | Fixed an issue where kube-event-init in the kube-system namespace returned a "414 Request Too Large" error when the event center was enabled. Improved the list-watch mechanism of the eventer to prevent excessive request traffic to etcd. For more information, see eventer list-watch. Fixed an issue where kube-eventer incorrectly parsed the timestamps of some system events. For more information, see fix FailedScheduling event write to sls with wrong timestamp.
|
July 2020
Version | Image address | Release date | Description |
v0.6.3-28-160499f | registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f | 2020-07-27 | Enhanced OOMKilling event messages to include pod names, namespaces, and UIDs. Improved the execution efficiency of the check_fd plug-in. Improved event notifications for node PID usage. Upgrade the network diagnostics plugin. Added a plug-in to monitor and send alerts for inode usage on node system disks.
|