Detect & Alert on ACK Node Anomalies with NPD and Event Center - Container Service for Kubernetes

Kubernetes generates events when cluster state changes — for example, when a pod is evicted, an image pull fails, or a node reports a hardware exception. By default, Kubernetes retains these events for only one hour. To detect anomalies in time and avoid losing event data, ACK provides an end-to-end event monitoring solution that collects, stores, alerts on, and archives cluster events.

ACK sinks cluster events to Simple Log Service by default, storing them free of charge for 90 days. For more information, see Create and use an event center.

How it works

ACK uses two open-source tools for event monitoring:

node-problem-detector (NPD) — Diagnoses Kubernetes nodes. NPD detects node exceptions (Docker engine hangs, Linux kernel hangs, outbound traffic exceptions, and file descriptor exceptions), generates node events, and works with kube-eventer to close the alert management loop. For more information, see NPD.
kube-eventer — An open-source event emitter maintained by ACK. kube-eventer forwards Kubernetes events to sinks such as DingTalk, Simple Log Service, and EventBridge. It filters events by severity level, enabling real-time collection, alerting on specific events, and asynchronous archiving. For more information, see kube-eventer.

Kubernetes generates two event types:

Type	When generated	Examples
Normal	State machine transitions to an expected state	Pod scheduled, container started
Warning	State machine transitions to an unexpected state	Pod evicted, image pull failed, `CrashLoopBackOff`

Choose a scenario

Select the scenario that fits your monitoring requirements:

Scenario	When to use
Scenario 1: Use NPD with the Simple Log Service event center	Get started quickly. Sink all cluster events to Simple Log Service for visualization, querying, and alerting from one place.
Scenario 2: Run NPD node diagnosis	Detect node-level exceptions (disk pressure, network issues, Docker daemon failures) and trigger closed-loop alert management.
Scenario 3: Send alerts to DingTalk	Route Warning events to a DingTalk group for real-time ChatOps notifications.
Scenario 4: Archive events to Simple Log Service	Persist Kubernetes events in a dedicated Logstore for long-term storage, indexing, and offline analysis.
Scenario 5: Route events to EventBridge	Build an event-driven architecture — trigger automated remediation or downstream workflows from cluster events.

Scenario 1: Use NPD with the Simple Log Service event center

NPD works with third-party plug-ins to detect node exceptions and generate cluster events. Kubernetes also generates events when cluster state changes — for example, when a pod is evicted or an image pull fails. The Kubernetes event center of Simple Log Service collects, stores, and visualizes these events, with built-in querying and alerting.

Step 1: Install the ack-node-problem-detector component

If you selected Install node-problem-detector and Create Event Center when creating the cluster, skip to Step 2: View the event center. For more information about installing during cluster creation, see Create an ACK managed cluster.

If you did not select this option, install the component manually:

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side navigation pane, click Add-ons.
On the Logs and Monitoring tab, find and install ack-node-problem-detector.

Step 2: View the event center

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Operations > Event Center.
On the Event Center page, click the Event Overview (Event Center) tab. The tab shows all Kubernetes events, their detailed information, and the pod lifecycle. For more information, see Collect Kubernetes events.

Scenario 2: Run NPD node diagnosis

NPD integrated with kube-eventer provides closed-loop alert management for node events. When NPD detects an exception, kube-eventer forwards the event to the configured sinks and raises an alert.

Prerequisites

Before you begin, ensure that you have:

The ack-node-problem-detector component installed. If not installed, see Step 1: Install the ack-node-problem-detector component. If already installed, reinstall it first — see Reinstall the ack-node-problem-detector component.

Verify the DaemonSet is running

On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Workloads > DaemonSets.
On the DaemonSets tab, select the kube-system namespace and verify that ack-node-problem-detector-daemonset is running. When both NPD and kube-eventer are running, the system sinks events and raises alerts based on the kube-eventer configuration.

Node diagnosis plug-ins

NPD supports the following node diagnosis plug-ins:

Plug-in	What it checks	Default threshold	Enabled by default
`ntp_check`	Whether system clocks are synchronized via Network Time Protocol (NTP)	—	Yes
`network_problem_check`	Whether connection tracking (conntrack) table usage exceeds the threshold	90%	Yes
`inodes_usage_check`	Whether inode usage on the system disk exceeds the threshold (adjustable)	80%	Yes
`pid_pressure_check`	Whether the pid process ratio exceeds the threshold of the kernel maximum	85%	Yes
`docker_offline_check`	Whether the Docker daemon is running	—	Yes
`fd_check`	Whether file descriptor usage exceeds the threshold (adjustable). Note This plug-in consumes significant resources. Enable only if needed.	80%	No
`ram_role_check`	Whether nodes have the required RAM role and AccessKey ID/secret configured	—	No
`nvidia_gpu_check`	Whether NVIDIA GPUs can generate `Xid` messages	—	No
`csi_hang_check`	Whether the Container Storage Interface (CSI) plug-in is running	—	No
`ps_hang_check`	Whether processes in the uninterruptible sleep (D) state exist	—	No
`public_network_check`	Whether nodes can access the Internet	—	No
`irqbalance_check`	Whether the `irqbalance` daemon is running	—	No

Scenario 3: Send alerts to DingTalk

Using a DingTalk chatbot to receive Kubernetes event alerts is a typical ChatOps implementation. When a Warning-level event occurs, kube-eventer pushes a message to the DingTalk group via webhook.

Prerequisites

Before you begin, ensure that you have:

The ack-node-problem-detector component installed. If not installed, see Step 1: Install the ack-node-problem-detector component. If already installed, reinstall it first — see Reinstall the ack-node-problem-detector component.

Step 1: Add a DingTalk chatbot

Click in the upper-right corner of the chatbox to open Group Settings.
Click Bot, then click Add Robot. Select Custom as the chatbot type.

On the Robot details page, click Add to open the Add Robot page. Set the following parameters, accept the DingTalk Custom Robot Service Terms of Service, and click Finished:

Parameter	Description
Edit profile picture	The avatar of the chatbot. Optional.
Chatbot name	The display name of the chatbot.
Add to Group	The DingTalk group to add the chatbot to.
Security settings	Three options are supported: Custom Keywords, additional signatures, and IP addresses (or CIDR blocks). Only Custom Keywords are supported for filtering cluster event alerts. Select Custom Keywords and enter `Warning`. Add more keywords to narrow the filter — up to 10 keywords.

添加机器人

Click Copy to copy the webhook URL.
On the ChatBot page, click next to the chatbot to modify its avatar and name, enable or disable message push, reset the webhook URL, or remove the chatbot.

Step 2: Configure kube-eventer to send events to DingTalk

On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Applications > Helm.
On the Helm page, find ack-node-problem-detector and click Update in the Actions column.
Set the following parameters, then click OK:
- In the npd section, set enabled to false.
- Set eventer.sinks.dingtalk.enabled to true.
- Enter the token from the webhook URL you copied.

Result

kube-eventer takes effect 30 seconds after deployment. When a Warning-level event occurs, an alert is sent to the DingTalk group.

Scenario 4: Archive events to Simple Log Service

Sink Kubernetes events to Simple Log Service for persistent storage, indexing, and audit. For long-term retention or offline archiving, configure the Logstore settings and optional data shipping to MaxCompute or Object Storage Service (OSS). For more information, see Create and use an event center.

Prerequisites

Before you begin, ensure that you have:

The ack-node-problem-detector component installed. Specify the project and Logstore during installation. If already installed, reinstall it first — see Reinstall the ack-node-problem-detector component.

Step 1: Create a Simple Log Service project and Logstore

Log on to the Simple Log Service console.
In the Projects section, click Create Project. In the Create Project panel, set the parameters and click Create.
Create the Simple Log Service project in the same region as your cluster. Same-region transmission uses the internal network, which reduces latency and eliminates cross-region bandwidth costs. This example creates a project named k8s-log4j in the China (Hangzhou) region.
In the Projects section, click k8s-log4j to open the project details page.
In the Logstores pane, click + to open the Create Logstore panel.
Set the parameters and click OK. This example creates a Logstore named k8s-logstore.
After the Logstore is created, click Data Import Wizard. The Import Data dialog box appears.
Select Log4j 1/2 and complete the configuration steps. This example uses the default settings.

Step 2: Configure kube-eventer to sink events to Simple Log Service

On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Applications > Helm.
On the Helm page, find ack-node-problem-detector and click Update in the Actions column.
Set the following parameters, then click OK:
- In the npd section, set enabled to false.
- Set eventer.sinks.sls.enabled to true.

Step 3: Verify event collection and set up indexing

Trigger a cluster event — for example, delete a pod or create an application.
Log on to the Simple Log Service console to view the collected events. For more information, see Consume log data by using Simple Log Service SDK.
Set up indexing for query and analysis. For more information, see Create indexes.
1. In the Projects section, click the project name.
2. Click next to the Logstore name and select Search & Analysis.
3. In the upper-right corner, click Enable Index.
4. In the Search & Analysis panel, set the parameters and click OK. The log query and analysis page appears. > Note: Index configuration takes effect within 1 minute and applies only to data ingested after the index is enabled or modified.
5. (Optional) To archive events for offline analysis, ship data from the Logstore to MaxCompute or Object Storage Service (OSS). For more information, see Create a data shipping job of the new version to ship data to MaxCompute and Create an OSS data shipping job (new version).

Scenario 5: Route events to EventBridge

EventBridge is a serverless event service that lets Alibaba Cloud services, custom applications, and software as a service (SaaS) applications connect in a standardized, centralized way. Routing ACK events to EventBridge lets you build a loosely coupled, event-driven architecture — for example, triggering automated remediation workflows when specific events occur. For more information, see What is EventBridge?.

Prerequisites

Before you begin, ensure that you have:

EventBridge activated. For more information, see Activate EventBridge and grant permissions to a RAM user.
The ack-node-problem-detector component installed. If not installed, see Step 1: Install the ack-node-problem-detector component. If already installed, reinstall it first — see Reinstall the ack-node-problem-detector component.

Enable EventBridge as an event sink

On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Applications > Helm.
On the Helm page, find ack-node-problem-detector and click Update in the Actions column.
Set eventer.sinks.eventbridge.enable to true, then click OK.

View Kubernetes events in EventBridge

Log on to the EventBridge console.
In the left-side navigation pane, click Event Buses.
On the Event Buses page, click the target event bus.
In the left-side navigation pane, click Event Tracking.
Select a query method, set query conditions, and click Query.
Find the event and click Details in the Actions column. For more information, see Query events by event ID.

Reinstall the ack-node-problem-detector component

Reinstall the component when you change the sink target (for example, switching from Simple Log Service to DingTalk) or when updating the component configuration.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Workloads > Jobs.
On the Jobs page, click More next to kube-eventer-init-v1.7-xxxx and click Delete.
On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Applications > Helm.
On the Helm page, delete the ack-node-problem-detector component.
On the Clusters page, find the cluster and click its name. In the left-side navigation pane, choose Operations > Add-ons.
On the Logging and Monitoring tab, find and reinstall ack-node-problem-detector.