All Products
Search
Document Center

Container Service for Kubernetes:ack-node-problem-detector

Last Updated:Mar 26, 2026

ack-node-problem-detector is an event monitoring component for Alibaba Cloud Container Service for Kubernetes (ACK), enhanced from the open-source Node Problem Detector project. It detects node anomalies, powers the event center, and integrates with third-party monitoring platforms. You can add custom monitoring plug-ins to expand its node problem detection capabilities. This topic describes the ack-node-problem-detector component, its usage, and its release notes.

Introduction

The ack-node-problem-detector component is a diagnostic tool for ACK clusters that monitors and reports node anomalies. The component consists of the following parts:

  • kube-event-init: When you install the ack-node-problem-detector component, kube-event-init initializes the Simple Log Service (SLS) resources required for the event center. This allows ack-node-problem-detector-daemonset and kube-eventer to use these resources to store and analyze event data.

  • ack-node-problem-detector-daemonset: Runs a pod replica on each node that meets the selector criteria to monitor node health and report node conditions and events. In the following sections, the image address for ack-node-problem-detector refers to the image address for this DaemonSet.

    Note

    For more information about the open source node-problem-detector project, see node-problem-detector.

  • kube-eventer: Reports all cluster events. By default, this component sends events to the SLS event center, which provides 90-day data retention and features such as dashboards, alerts, and event search and analysis. You can also manually configure kube-eventer to send cluster events to other systems, such as DingTalk and EventBridge, for further data integration. For more information, see kube-eventer.

  • accel-health-monitor: Runs a pod on each eligible GPU node to monitor the status of GPU devices and report Node Conditions and Kubernetes events. The image address for accel-health-monitor is provided in the release notes. For information about its permissions and usage notes, see GPU anomaly detection.

Usage

For information about how to install ack-node-problem-detector, its use cases, and new plug-in features, see event monitoring.

Release notes

February 2026

Version

Image address

Release date

Description

1.2.30

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.14-4b806cb-aliyun

  • node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/ack-node-problem-detector:v0.8.17-952071f-aliyun

  • accel-health-monitor: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/accel-health-monitor:v0.5.4-4c80dfa0-aliyun

2026-02-02

Note

This version is in canary release. To use this version, submit a ticket.

  • Improved the security of ack-node-problem-detector-daemonset.

  • Improved the security of kube-eventer.

  • Added an option on the component configuration page in the ACK console to enable or disable the generation of isolation files for abnormal GPUs.

  • Modified the isolation policies for some GPU detection items. For more information, see GPU anomaly detection.

  • Added support for eRDMA detection.

November 2025

Version

Image address

Release date

Description

1.2.29

  • accel-health-monitor: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/accel-health-monitor:v0.5.3-bafb2ba5-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.14-315a7cb-aliyun

2025-11-30

Note

This version is in canary release. To use this version, submit a ticket.

  • Deployed the GPU detection plug-in as a separate DaemonSet named ack-accel-health-monitor instead of including it in ack-node-problem-detector-daemonset. For information about the permissions for ack-accel-health-monitor, see GPU anomaly detection.

  • The GPU detection plug-in can now detect issues related to nvidia-persistenced, nvidia-fabricmanager, and nvlink.

  • Fixed an issue where the GPU plug-in restarted due to intermittent JSON serialization failures.

  • kube-eventer now supports sending data to SLS over HTTPS.

July 2025

Version

Image address

Release date

Description

1.2.27

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.13-b4a3960-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.9-2b115d6-aliyun

2025-07-24

Note

This version is in canary release. To use this version, submit a ticket.

  • Improved the security of kube-eventer and kube-event-init.

  • ACK Dedicated Cluster now supports an enhanced mode for accessing ECS instance metadata, improving security by using a more secure authentication method. For more information, see Enforce the enhanced mode to access ECS instance metadata.

June 2025

Version

Image address

Release date

Description

1.2.26

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.16-8d2193b-aliyun

  • npd-gpu: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/npd-gpu-plugin:v0.4.1-7359b830-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.12-c7c1896-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun

2025-06-11

Note

This version is in canary release. To use this version, submit a ticket.

  • Fixed an issue where the NvidiaDeviceRecovered event was not emitted in some GPU self-healing scenarios.

  • Reduced the image size of ack-node-problem-detector.

Version

Image address

Release date

Description

1.2.25

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.16-8ed7053-aliyun

  • npd-gpu: registry-__ACK_REGION_ID__-vpc.ack.aliyuncs.com/acs/npd-gpu-plugin:v0.4.0-e434dc36-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.12-c7c1896-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun

2025-06-06

Note

This version is in canary release. To use this version, submit a ticket.

  • Added the npd-gpu container for GPU fault detection.

  • Added support for isolating specific GPUs when a fault is detected.

  • Added support for multiple detection items, including NvidiaXID44Error, NvidiaXID61Error, NvidiaXID62Error, and NvidiaXID69Error. For more information, see GPU anomaly detection and automatic isolation.

  • You can now configure which GPU detection items to enable in ack-node-problem-detector-config.

  • Reduced the image size of ack-node-problem-detector.

August 2024

Version

Image address

Release date

Description

1.2.20

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.14-3c6002c-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.11-0620284-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.8-e43647f-aliyun

2024-08-20

  • Added support for GPU fault inspection on ECS nodes.

  • Upgraded the kube-eventer component to improve performance during large-scale event reporting.

  • Upgraded the kube-eventer component to support the V4 signature algorithm for Simple Log Service data transmission.

  • Added a parameter to configure the local port of the ack-node-problem-detector DaemonSet pod to 20256 or 20257. This port is disabled by default.

December 2023

Version

Image address

Release date

Description

v1.2.18

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.13-003ac31-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-27a468a-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun

2023-12-18

  • Fixed an issue where cached historical kernel logs caused false positive PodOOMKilling events.

  • ack-node-problem-detector now retains custom component parameters when you upgrade it from an earlier version.

August 2023

Version

Image address

Release date

Description

v1.2.17

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-27a468a-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun

2023-08-24

  • You can now update the Simple Log Service Project and Logstore configurations by modifying the component parameters on the Add-ons page in the ACK console.

  • You can now attach additional tags, such as cluster names, when you send log data to Simple Log Service. These tags are then displayed by default in the ACK event center.

June 2023

Version

Image address

Release date

Description

v1.2.16

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-019546c-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun

2023-06-27

You can now configure the resource specification parameters for the component on the Add-ons page in the ACK console.

v1.2.15

  • ack-node-problem-detector: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/node-problem-detector:v0.8.12-bf8aff8-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.8-019546c-aliyun

  • kube-event-init: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun

2023-06-06

Improved the performance of ack-node-problem-detector. This reduces the load on the API server and etcd when PodOOMKilling events occur frequently in large-scale clusters.

February 2023

Version

Image address

Release date

Description

v1.2.14

  • ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.11-edc7907-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.6-bbf76f7-aliyun

  • kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun

2023-02-03

  • Reduced image pull times.

  • Added support for ACK Edge cluster.

September 2022

Version

Image address

Release date

Description

v1.2.11

  • ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.11-edc7907-aliyun

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer:v1.2.6-bbf76f7-aliyun

  • kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.7-48a2acc-aliyun

2022-09-30

  • Improved the inspection logic of ack-node-problem-detector to reduce the load on core cluster components.

  • Improved image security.

February 2022

Version

Image address

Release date

Description

v1.2.9

  • ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.6-f0efecf-aliyun

  • kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun

2022-02-22

  • Added support for kernel inspection.

  • Enhanced security.

January 2022

Version

Image address

Release date

Description

v1.2.8

  • ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.5-cc7ec54-aliyun

  • kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun

2022-01-20

  • Added support for different containerd modes.

  • Optimized the resource Quality of Service (QoS) limits for the component to improve stability.

November 2021

Version

Image address

Release date

Description

v1.2.7

  • ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.8.10-e0ff7d2

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.5-cc7ec54-aliyun

  • kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:v1.6-a92aba6-aliyun

2021-11-25

  • Added compatibility for system services on operating systems such as Alibaba Cloud Linux 3 and CentOS 8.

  • Added support for ARM architecture environments.

April 2021

Version

Image address

Release date

Description

v1.2.5

  • ack-node-problem-detector: registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f

  • kube-eventer: registry-vpc.__ACK_REGION_ID__.aliyuncs.com/acs/kube-eventer-amd64:v1.2.4-0f5aaee-aliyun

  • kube-event-init: registry.{ .Values.controller.regionId }.aliyuncs.com/acs/kube-eventer-init:1.5-5e0e7c1-aliyun

2021-04-25

  • Fixed an issue where kube-event-init in the kube-system namespace returned a "414 Request Too Large" error when the event center was enabled.

  • Improved the list-watch mechanism of the eventer to prevent excessive request traffic to etcd. For more information, see eventer list-watch.

  • Fixed an issue where kube-eventer incorrectly parsed the timestamps of some system events. For more information, see fix FailedScheduling event write to sls with wrong timestamp.

July 2020

Version

Image address

Release date

Description

v0.6.3-28-160499f

registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f

2020-07-27

  • Enhanced OOMKilling event messages to include pod names, namespaces, and UIDs.

  • Improved the execution efficiency of the check_fd plug-in.

  • Improved event notifications for node PID usage.

  • Upgrade the network diagnostics plugin.

  • Added a plug-in to monitor and send alerts for inode usage on node system disks.