Alibaba Cloud provides Elastic Compute Service (ECS) system events to record and communicate resource information, such as the startup, stopping, and expiration of ECS instances and the task executions on ECS instances. In scenarios in which large-scale clusters are involved and resources are scheduled in real time, you can monitor and respond to ECS system events by using the ecs-tool-event plug-in of Cloud Assistant to automate O&M operations, such as troubleshooting and dynamic scheduling.
ECS system events are defined by Alibaba Cloud to record and communicate resource information, such as the execution status of O&M tasks, resource exceptions, and resource status changes. For information about the categories and details of ECS system events, see Overview of ECS system events.
Cloud Assistant provides plug-ins that allow you to perform complex configurations by running simple commands and improve O&M efficiency. For more information, see Overview of Cloud Assistant and Use Cloud Assistant plug-ins.
Mechanism
You can monitor and respond to ECS system events in the ECS console or by calling ECS API operations. However, both methods have limitations.
If you monitor or respond to ECS system events in the ECS console, you cannot automate the response to the events. You must manually perform the operations. As a result, when ECS system events are generated for multiple ECS instances, you may overlook events.
If you monitor or respond to ECS system events by calling ECS API operations, you must develop programs to integrate the API operations. This requires financial and technical support.
To resolve the preceding issues, Alibaba Cloud provides the ecs-tool-event plug-in of Cloud Assistant that requests ECS system events from MetaServer every minute and stores the events as logs in operating systems. This way, you can collect system event logs from operating systems and monitor and respond to ECS system events based on the logs without the need to develop additional programs. For example, if you have Kubernetes automated O&M capabilities, you can collect log streams from the host_event.log file to adapt your O&M system.
Procedure
Make sure that Cloud Assistant Agent is installed on your ECS instances. For more information, see Install Cloud Assistant Agent.
You have the root permissions required to start, stop, and check the status of Cloud Assistant plug-ins.
Log on to an ECS instance and start the
ecs-tool-eventplug-in.After you start the ecs-tool-event plug-in, the plug-in requests ECS system events from MetaServer every minute and stores the events as logs in the operating system of the ECS instance.
sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --startNoteAfter you start the ecs-tool-event plug-in, you can go to the
ls /var/logdirectory to view thehost_event.logfile that is automatically generated.Log path: /var/log/host_event.log
Log format:
%Y-%m-%d %H:%M:%S - WARNING - Ecs event type is: ${Event type}, event status is: ${Event status}, action ISO 8601 time is ${Execution time in ISO 8601}
Example:
2024-01-08 17:02:01 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z
Check the status of the ecs-tool-event plug-in.
sudo acs-plugin-manager --statusCollect log streams from the host_event.log file to adapt your O&M system based on your business scenarios.
(Optional) If you do not want to respond to ECS system events, stop the
ecs-tool-eventplug-in.sudo acs-plugin-manager --remove --plugin ecs-tool-event
Example: Automatically respond to ECS system events when ECS instances are used as nodes in a Kubernetes cluster
Scenario
If exceptions occur on one of the ECS instances that are used as nodes in a Kubernetes cluster, such as when the ECS instance is restarted, runs out of memory, or encounters an operating system error, online service stability may be affected. You must monitor and respond to the exception events to identify and troubleshoot the exceptions at the earliest opportunity. You can convert ECS system events into operating system logs by using the ecs-tool-event plug-in and conveniently and efficiently monitor and respond to the events by using the open source Node Problem Detector (NPD), Draino, and Autoscaler provided by the Kubernetes community without the need to develop additional programs. This way, you can increase the stability and reliability of the Kubernetes cluster.
Architecture
The following workflow explains how the ecs-tool-event plug-in is used to automatically respond to ECS system events when ECS instances are used as nodes in a Kubernetes cluster. The following figure shows the architecture.
The ecs-tool-event plug-in of Cloud Assistant requests ECS system events from MetaServer every minute and stores the events as logs in the following path in the operating system:
/var/log/host_event.log.NPD collects system event logs and reports issues to the API server.
Draino receives Kubernetes events (ECS system events) from the API server, evicts pods from unhealthy nodes, and then migrates the pods to healthy nodes.
After pods are evicted from an unhealthy node, take the node out of service based on your business scenario, or use Autoscaler to automatically release the node and add a new ECS instance as a node to the Kubernetes cluster.
Procedure
Step 1: Start the ecs-tool-event plug-in on a node
Log on to a node (ECS instance) in the Kubernetes cluster and start the ecs-tool-event plug-in.
In actual scenarios, you must start the ecs-tool-event plug-in on all nodes in the Kubernetes cluster. You can use Cloud Assistant to batch run the following command on multiple ECS instances to start the ecs-tool-event plug-in. For more information, see Create and run a command.
sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --startAfter you start the ecs-tool-event plug-in, the plug-in automatically converts ECS system events into logs and stores the logs in the operating system of the node.
Step 2: Configure NPD and Draino for the Kubernetes cluster
Log on to any node in the Kubernetes cluster.
Configure NPD, which takes effect across the entire Kubernetes cluster.
Configure the following NPD files.
NoteFor information about how to configure NPD files, see the official NPD documentation.
node-problem-detector-config.yaml: is used to define the metrics that NPD monitors, such as system logs.node-problem-detector.yaml: is used to define how NPD operates in a Kubernetes cluster.rbac.yaml: is used to define the permissions granted to NPD in a Kubernetes cluster.Add the NPD files to an ECS instance on which NPD is not configured
Add the preceding NPD files to the ECS instance.
Modify the NPD files on an ECS instance on which NPD is configured
Add the log monitoring settings of the host_event.json file to the
node-problem-detector-config.yamlfile. The following sample code provides an example of the log monitoring settings:... host_event.json: | { "plugin": "filelog", # Specify the plug-in that is used to collect logs. Set this parameter to filelog. "pluginConfig": { "timestamp": "^.{19}", "message": "Ecs event type is: .*", "timestampFormat": "2006-01-02 15:04:05" }, "logPath": "/var/log/host_event.log", # Specify the path in which you want to store system event logs. Set this parameter to /var/log/host_event.log. "lookback": "5m", "bufferSize": 10, "source": "host-event", "conditions": [ { "type": "HostEventRebootAfter48", # Specify an event name, which is used in the Draino configuration. "reason": "HostEventWillRebootAfter48", "message": "The Host Is Running In Good Condition" } ], "rules": [ { "type": "temporary", "reason": "HostEventRebootAfter48temporary", "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*" }, { "type": "permanent", "condition": "HostEventRebootAfter48", "reason": "HostEventRebootAfter48Permanent", "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*" } ] } ...Configure the
node-problem-detector.yamlfile.Add the
/config/host_event.jsonconfiguration to the- --config.system-log-monitorline to enable NPD to monitor system event logs. The following sample code provides an example of the configuration:containers: - name: node-problem-detector command: ... - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.jsonAdd lines below the
items:line in the- name: configsection, as shown in the following sample code:... - name: config configMap: name: node-problem-detector-config items: - key: kernel-monitor.json path: kernel-monitor.json - key: docker-monitor.json path: docker-monitor.json -key: host_event.json # Add this line. path: host_event.json # Add this line. ...
Run the following commands for the files to take effect:
sudo kubectl create -f rbac.yaml sudo kubectl create -f node-problem-detector-config.yaml sudo kubectl create -f node-problem-detector.yamlRun the following command to check whether the NPD configuration takes effect:
sudo kubectl describe nodes -n kube-systemIf the following command output in which a HostEventRebootAfter48 entry is added to the Conditions section is returned, the NPD configuration is complete and works as expected. If the HostEventRebootAfter48 entry does not appear in the Conditions section, wait for 3 minutes to 5 minutes.

Configure Draino, which takes effect across the entire Kubernetes cluster.
Install Draino or modify the Draino configuration.
Run the following command for the Draino configuration to take effect:
Install Draino on an ECS instance on which Draino is not installed
Add the following YAML file to the ECS instance:
Modify the Draino configuration on an ECS instance on which Draino is installed
Open the Draino configuration file, find the
containers:section, and then add the event name that you specified in thenode-problem-detector-config.yamlfile in Step 2 "Configure NPD, which takes effect across the entire Kubernetes cluster" as shown in the following code. In this example, the event name is HostEventRebootAfter48.containers: - name: draino image: planetlabs/draino:dbadb44 # You'll want to change these labels and conditions to suit your deployment. command: - /draino - --debug ...... - KernelDeadlock - OutOfDisk - HostEventRebootAfter48 # Add this line.sudo kubectl create -f draino.yaml
Step 3: Take unhealthy nodes out of service and add new nodes
After pods are evicted from an unhealthy node, take the node out of service based on your business scenario, or use Autoscaler to automatically release the node and add a new ECS instance as a node to the Kubernetes cluster. For information about how to use Autoscaler, see the official Autoscaler documentation.
Verify the result
Log on to a node and run the following command to simulate a log of an ECS system event.
ImportantReplace the time in the command with the current system time.
sudo echo '2024-02-23 12:29:29 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z' > /var/log/host_event.logRun the following command to check the node. The following command output indicates that the Cloud Assistant plug-in detects the ECS system event, generates a Kubernetes event based on the event, and sets the status of the node to Unschedulable.
sudo kubectl describe nodes -n kube-system