Event monitoring is a monitoring method provided by Kubernetes. It provides improvements over resource monitoring in terms of timeliness, accuracy, and scenarios. You can use node-problem-detector with the Kubernetes event center of Log Service to sink cluster events, and configure node-problem-detector to diagnose clusters and send error events to sinks. You can sink cluster events to DingTalk, Log Service, and EventBridge. This allows you to monitor exceptions and issues in clusters in real time.
Background information
Kubernetes is designed based on the state machine. Events are generated due to transitions between different states. Typically, Normal events are generated when the state machine changes to expected states and Warning events are generated when the state machine changes to unexpected states.

- node-problem-detector is a tool to diagnose Kubernetes nodes. node-problem-detector detects node exceptions, generates node events, and works with kube-eventer to raise alerts upon these events and enable closed-loop management of alerts. node-problem-detector generates node events when the following exceptions are detected: Docker engine hangs, Linux kernel hangs, outbound traffic exceptions, and file descriptor exceptions. For more information, see NPD.
- kube-eventer is an open source event emitter that is maintained by ACK. kube-eventer sends Kubernetes events to sinks such as DingTalk, Log Service, and EventBridge. kube-eventer also provides filter conditions to filter different levels of events. You can use kube-eventer to collect events in real time, trigger alerts upon specific events, and asynchronously archive events. For more information, see kube-eventer.
This topic describes how to configure event monitoring in the following scenarios:
Scenario 1: Use node-problem-detector with the Kubernetes event center of Log Service to sink cluster events
node-problem-detector works with third-party plug-ins to detect node exceptions and generate cluster events. A Kubernetes cluster also generates events when the status of the cluster changes. For example, when a pod is evicted or an image pull operation fails, a related event is generated. The Kubernetes event center of Log Service collects, stores, and visualizes cluster events. It allows you to query and analyze these events, and configure alerts. You can sink cluster events to the Kubernetes event center of Log Service by using the following methods.
Method 1: If Install node-problem-detector and Create Event Center was selected when you created the cluster, perform the following steps to go to the Kubernetes event center. For more information about how to install node-problem-detector and deploy the Kubernetes event center when you create a cluster, see Create an ACK managed cluster.
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- Choose .
- Click Cluster Events Management in the upper-right corner of the page to go to the K8s Event Center page. In the left-side navigation pane of the K8s Event Center page, find the cluster that you want to manage and click the
icon to the left of the cluster name. You can view event details that are provided by the Kubernetes event center.
The Kubernetes event center provides event overview, event details, and information about pod lifecycles. You can also customize queries and configure alerts.
Method 2: If the Kubernetes event center was not deployed when you created the cluster, perform the following steps to deploy and use the Kubernetes event center:
- Install node-problem-detector in the monitored cluster and enable Log Service. For more information, see Scenario 3: Use DingTalk to raise alerts upon Kubernetes events. Note If node-problem-detector is deployed but Log Service is not enabled, reinstall node-problem-detector.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click its name or click Details in the Actions column.
- Choose .
- On the Helm page, delete the ack-node-problem-detector release to uninstall node-problem-detector.
When you configure parameters for node-problem-detector, create a Log Service project for the cluster by settingAfter node-problem-detector is redeployed, a Log Service project is automatically created in the Log Service console for the cluster.eventer.sinks.sls.enabled
to true. - Log on to the Log Service console to configure the Kubernetes event center for the cluster.
- In the Import Data section, click Kubernetes - Standard Output.
- Select the Log Service project that is automatically created in the preceding step from the Project drop-down list, and select k8s-event from the Logstore drop-down list.
- Click Next and click Complete Installation.
- In the Projects section of the Log Service console, find and click the Log Service project.
- In the left-side navigation pane, click the Dashboard icon and click Kubernetes Event Center V1.5. On the dashboard of the Kubernetes event center, you can view all cluster events.
Scenario 2: Configure node-problem-detector to diagnose a cluster and send events of exceptions to sinks
node-problem-detector is a tool to diagnose Kubernetes nodes. node-problem-detector detects node exceptions, generates node events, and works with kube-eventer to raise alerts upon these events and enable closed-loop management of alerts. node-problem-detector generates node events when the following exceptions are detected: Docker engine hangs, Linux kernel hangs, outbound traffic exceptions, and file descriptor exceptions. Perform the following steps to install and configure node-problem-detector.
- Log on to the ACK console.
- In the left-side navigation pane, choose App Catalog tab, find and click ack-node-problem-detector. . On the Note If the Kubernetes event center is deployed, you must first uninstall the ack-node-problem-detector component.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click its name or click Details in the Actions column.
- Choose .
- On the Helm page, delete the ack-node-problem-detector release.
- On the ack-node-problem-detector page, click Deploy, select a cluster, and then configure the required parameters on the Parameters wizard page. The namespace is automatically set to kube-system and the release name is automatically set to ack-node-problem-detector.You can set the sink parameters for kube-eventer as described in the following table.
Table 1. Parameters Parameter Description Default npd.image.repository
The image address of node-problem-detector registry.aliyuncs.com/acs/node-problem-detector
npd.image.tag
The image version of node-problem-detector v0.6.3-28-160499f
alibaba_cloud_plugins
Plug-ins that are used to diagnose nodes. For more information, see the Node diagnosis plug-ins supported by node-problem-detector table. fd_check
,ntp_check
,network_problem_check
, andinode_usage_check
are supported.plugin_settings.check_fd_warning_percentage
The alerting threshold for the percentage of opened file descriptors monitored by fd_check
80 plugin_settings.inode_warning_percentage
The alerting threshold for monitoring the inode
usage80 controller.regionId
The region where the cluster that has ack-node-problem-detector installed is deployed. Only cn-hangzhou
,cn-beijing
,cn-shenzhen
,cn-shanghai
are supported.The region where the cluster that has the plug-in installed is deployed controller.clusterType
The type of the cluster where ack-node-problem-detector is installed ManagedKubernetes controller.clusterId
The ID of the cluster where ack-node-problem-detector is installed The ID of the cluster where ack-node-problem-detector is installed controller.clusterName
The name of the cluster where ack-node-problem-detector is installed The name of the cluster where ack-node-problem-detector is installed controller.ramRoleType
The type of the assigned Resource Access Control (RAM) role. A value of restricted
indicates that token-based authentication is enabled for the RAM role.The default RAM role type assigned to the cluster eventer.image.repository
The image address of kube-eventer
registry.cn-hangzhou.aliyuncs.com/acs/eventer
eventer.image.tag
The image version of kube-eventer
v1.6.0-4c4c66c-aliyun
eventer.image.pullPolicy
Specifies how the kube-eventer
image is pulledIfNotPresent eventer.sinks.sls.enabled
Specifies whether to enable Log Service as a sink of kube-eventer
false
eventer.sinks.sls.project
The name of the project in Log Service. N/A eventer.sinks.sls.logstore
The name of the Logstore in the Log Service project N/A eventer.sinks.dingtalk.enabled
Specifies whether to enable DingTalk as a sink of kube-eventer
false
eventer.sinks.dingtalk.level
The level of events at which alerts are raised warning
eventer.sinks.dingtalk.label
The labels of the events N/A eventer.sinks.dingtalk.token
The token of the DingTalk chatbot N/A eventer.sinks.dingtalk.monitorkinds
The type of resource for which event monitoring is enabled N/A eventer.sinks.dingtalk.monitornamespaces
The namespace of the resources for which event monitoring is enabled N/A eventer.sinks.eventbridge.enable
Specifies whether to enable eventBridge
as a sink ofkube-eventer
false
Node diagnosis plug-ins supported by node-problem-detector are listed in the following table.
Plug-in Feature Description fd_check Checks whether the percentage of opened file descriptors on each cluster node exceeds a maximum of 80% The default threshold is 80%. The threshold is adjustable. This plug-in consumes a considerable amount of resources to perform the check. We recommend that you do not enable this plug-in. ram_role_check Checks whether cluster nodes are assigned the required RAM role and whether the AccessKey ID and AccessKey secret are configured for the RAM role N/A ntp_check Checks whether the system clocks of cluster nodes are properly synchronized through Network Time Protocol (NTP) This plug-in is enabled by default. nvidia_gpu_check Checks whether the NVIDIA GPUs of cluster nodes can generate Xid
messagesN/A network_problem_check Checks whether the connection tracking (conntrack)
table usage on each cluster node exceeds 90%This plug-in is enabled by default. inodes_usage_check Checks whether the inode
usage on the system disk of each cluster node exceeds 80%The default threshold is 80%. The threshold is adjustable. This plug-in is enabled by default. csi_hang_check Checks whether the Container Storage Interface (CSI) plug-in works as expected on cluster nodes N/A ps_hang_check Checks whether processes in the uninterruptible sleep (D) state exist in the systems of cluster nodes N/A public_network_check Checks whether cluster nodes can access the Internet N/A irqbalance_check Checks whether the irqbalance
daemon works as expected in the systems of cluster nodesN/A pid_pressure_check Checks whether the ratio of pid
processes in the node system to the maximumpid
processes allowed in the kernel exceeds 85%This plug-in is enabled by default. docker_offline_check Checks whether the docker daemon
works as expected on cluster nodesThis plug-in is enabled by default. Note Some plug-ins are enabled by default, as shown in the preceding table. You can find these plug-ins if you select Install node-problem-detector and Create Event Center when you enable Log Service for the cluster. You can also find these plug-ins when you install the ack-node-problem-detector component on the Add-ons page. You must manually enable some plug-ins when you deploy the ack-node-problem-detector component from the App Catalog page. - On the Parameters wizard page, click OK.
Go to the Clusters page. On the Clusters page, find and click the name of the monitored cluster or Applications in the Actions column. On the page that appears, click the DaemonSets tab. On the DaemonSets tab, you can find that ack-node-problem-detector-daemonset is running as expected.
When both node-problem-detector and
kube-eventer
work as expected, the system sinks events and raises alerts based on thekube-eventer
configurations.
Scenario 3: Use DingTalk to raise alerts upon Kubernetes events
Using a DingTalk chatbot to monitor Kubernetes events and raise alerts is a typical scenario of ChatOps. Perform the following steps to install and configure node-problem-detector.
- Click
in the upper-right corner of the chatbox of a DingTalk group to open the Group Settings page.
- Click Group Assistant, and then click Add Robot. In the ChatBot dialog box, click the + icon and select the chatbot that you want to use. In this example, Custom is selected.
- On the Robot details page, click Add to open the Add Robot page.
- Set the following parameters, read and accept the DingTalk Custom Robot Service Terms of Service, and then click Finished.
Parameter Description Edit profile picture The avatar of the chatbot. This parameter is optional. Chatbot name The name of the chatbot. Add to Group The DingTalk group to which the chatbot is added. Security settings Three types of security settings are supported: custom keywords, additional signatures, and IP addresses (or CIDR blocks). Only Custom Keywords are supported for filtering alerts that are raised upon cluster events.
Select Custom Keywords and enter Warning to receive alerts. If the chatbot frequently sends messages, you can add more keywords to filter the messages. You can add up to 10 keywords. Messages from ACK are also filtered through these keywords before the chatbot sends them to the DingTalk group.
- Click Copy to copy the webhook URL. Note On the ChatBot page, find the chatbot and click
to perform the following operations:
- Modify the avatar and name of the chatbot.
- Enable or disable message push.
- Reset the webhook URL.
- Remove the chatbot.
- Log on to the ACK console.
- In the left-side navigation pane, choose App Catalog tab, and then find and click ack-node-problem-detector. . On the Marketplace page, click the Note If the Kubernetes event center is deployed, you must first uninstall the ack-node-problem-detector component.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click its name or click Details in the Actions column.
- Choose .
- On the Helm page, delete the ack-node-problem-detector plug-in.
- On the ack-node-problem-detector page, click Deploy, select a cluster and namespace, and then click Next. On the Parameters wizard page, configure the required parameters and click OK.
- In the
npd
section, set theenabled
parameter to false. - In the
eventer.sinks.dingtalk.enabled
section, set the enabled parameter to true. - Enter the token that is contained in the webhook URL generated in Step 5.
- In the
Expected result:

Scenario 4: Sink Kubernetes events to Log Service
You can sink Kubernetes events to Log Service for persistent storage, and archive and audit these events. For more information, see Create and use an event center.
- Create a Log Service project and a Logstore.
- Configure Log4jAppender for the cluster.
- An event is generated after an operation is performed on the cluster, such as a pod deletion or an application creation. You can log on to the Log Service console to view the collected log data. For more information, see Consume log data.
- Set indexes and archiving. For more information, see Create indexes.
Scenario 5: Sink Kubernetes events to EventBridge
EventBridge is a serverless event service provided by Alibaba Cloud. Alibaba Cloud services, custom applications, and software as a service (SaaS) applications can connect to EventBridge in a standardized and centralized manner. In addition, EventBridge can route events among these applications based on standardized CloudEvents 1.0 protocol. ACK events can be sunk to EventBridge, which allows you to build a loosely-coupled and distributed event-driven architecture in EventBridge. For more information about EventBridge, see What is EventBridge?.
- Activate EventBridge. For more information, see Activate EventBridge and grant permissions to a RAM user.
- Log on to the ACK console.
- In the left-side navigation pane, choose App Catalog tab, find and click ack-node-problem-detector. . On the Note If the Kubernetes event center is deployed, you must first uninstall the ack-node-problem-detector component.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click its name or click Details in the Actions column.
- Choose .
- On the Helm page, delete the ack-node-problem-detector release.
- On the ack-node-problem-detector page, click Deploy, select a cluster and namespace, and then click Next. On the Parameters wizard page, configure the required parameters and click OK to deploy ack-node-problem-detector in the cluster. Configure the Kubernetes event center and enable EventBridge as a sink of Kubernetes events.
- In the
npd
section, set theenabled
parameter to true. - In the
eventer.sinks.eventbridge.enable
section, set the parameter to true.
- In the
- After EventBridge is enabled as a sink of Kubernetes events, you can view Kubernetes events in the EventBridge console.
- Log on to the EventBridge console.
- In the left-side navigation pane, click Event Buses.
- On the Event Buses page, find the event that you want to view and click Event Tracking in the Actions column.
- Select a query method, set query conditions, and then click Query.
- In the list of events, find the event that you want to view and click Details in the Actions column.
In the Event Details dialog box, you can view the details of the event.