Understand system events for instance O&M and monitoring - Elastic Compute Service

System events are defined by Alibaba Cloud to record and notify you of resource information, such as the execution states of O&M tasks, resource exceptions, and changes in the status of resources.

Note

This topic describes only system events of Elastic Compute Service (ECS). For information about the system events of other Alibaba Cloud services, see the relevant documentation.

Usage scenarios of system events

Notification for risks and exceptions
After system events that are not displayed in the ECS console are triggered, Alibaba Cloud pushes the events to the ECS console. The events include the events that may affect the availability and performance of ECS resources, such as instance restarts due to system maintenance and instance expiration. For specific critical system events, Alibaba Cloud sends additional SMS messages, emails, or internal messages. You can handle the events in the ECS console or by calling API operations. We recommend that you handle the system events at the earliest opportunity to prevent the events from affecting your business. For more information, see Query and handle ECS system events.
For example, when a subscription instance is about to expire, the ECS console prompts you to renew the instance within a specific period of time to ensure service continuity.
Automated O&M
The states of system events that are displayed in the ECS console are defined to help you understand the execution states of system O&M tasks. Changes in the status of system events are synchronized to CloudMonitor to help you build an automated O&M mechanism. For more information about the states of system events, see the States and windows of system events section of this topic.
Note
- Each event state corresponds to a CloudMonitor event. For example, the Executing and Executed states that the InstanceFailure.Reboot ECS event code supports correspond to the Instance:InstanceFailure.Reboot:Executing and Instance:InstanceFailure.Reboot:Executed CloudMonitor events.
Specific status change events are not displayed in the ECS console and cannot be handled in the ECS console or by calling API operations. Examples of the events include events that indicate changes in the status of instances and interruptions of spot instances. The states of the system events are not defined in ECS. However, the events are still reported to CloudMonitor when the events are triggered. This way, you can build an event-triggered automated O&M system based on your business requirements.
For example, status change events are triggered when you start or stop ECS instances. The events do not indicate risks or exceptions. If you want to log your operations to your system, you can configure event notifications for status change events and use the alert callback feature to write the startup and stop information of instances to operation logs.

System event categories

System events can be classified into the categories that are described in the following table based on the event causes.

Note

For information about the system event categories supported by ECS and how to handle ECS system events, see Summary.

Category	Description	Displayed in the ECS console
Scheduled O&M events	Alibaba Cloud may need to upgrade host software for security reasons or to predict and handle failure risks that lie in underlying host hardware and software. In these cases, if O&M tasks to be executed by Alibaba Cloud may affect the availability or performance of your ECS resources, Alibaba Cloud triggers and sends scheduled O&M events in advance to notify you of task details, such as execution times, objects, and impacts. After you receive a scheduled O&M event, you can handle the O&M event during an off-peak period within the event execution window to minimize business impacts. Note Scheduled O&M events, also known as proactive O&M events, are based on the O&M experience of Alibaba Cloud on millions of servers, the ability to serve tens of thousands of large enterprise customers, and the cutting-edge machine learning algorithms of Alibaba DAMO Academy to predict and handle failure risks that lie in underlying host hardware or software. When failure risks on the host cannot be prevented, Alibaba Cloud notifies you in advance by using scheduled O&M events. This way, you can switch your business before failures occur. If you do not respond to the scheduled O&M events at the earliest opportunity, your ECS instances may break down or restart when failures occur.	Yes Note When scheduled O&M events are triggered for instances of big data instance families or instance families that are equipped with local SSDs (excluding the i4p instance family), the events are displayed on the Local Disk-based Instance Events page. For information about local disk-based instance events, see O&M scenarios and system events for instances equipped with local disks.
Unexpected O&M events	This category of system events is triggered when ECS instances restart or break down due to unexpected issues, such as kernel panic, out-of-memory (OOM) errors, or hardware or software failures in underlying hosts. Alibaba Cloud sends the events after the events are triggered and restores affected ECS resources at the earliest opportunity. Alibaba Cloud also notifies you of the execution states of system O&M tasks related to the events. Note In most cases, unexpected O&M events refer to sudden downtime or restarts of ECS instances due to unpredictable failures of the underlying hosts or kernel errors in the operating systems of the ECS instances. ECS instance downtime or restart events caused by host failures (SystemFailure.Reboot) are occasional and inevitable. If the Service Level Agreement (SLA) for a single instance is violated, Alibaba Cloud pays compensation based on the SLA of the related service. In most cases, ECS instance restart events caused by operating system kernel errors (InstanceFailure.Reboot) are caused by applications. You can capture dump files to analyze the causes. For more information, see How do I enable the kdump service on a Linux instance?.	Yes Note When unexpected O&M events are triggered for instances of big data instance families or instance families that are equipped with local SSDs (excluding the i4p instance family), the events are displayed on the Local Disk-based Instance Events page. For information about local disk-based instance events, see O&M scenarios and system events for instances equipped with local disks.
Local disk-based instance events	This category of system events includes system events that are triggered for local disks and instances equipped with local disks. The system events that are triggered for local disks include system events triggered when local disks are damaged. The system events that are triggered for instances equipped with local disks include system events triggered when instances equipped with local disks fail due to local disk damages or when the hardware or software of underlying hosts fails for instances equipped with local disks. Note Local Disk-based Instance Events are not a system event category and are used only to display scheduled or unexpected O&M events for instances of big data instance families or instance families that are equipped with local SSDs (excluding the i4p instance family) and make the events easy to handle. For more information about local disk-based instance events, see O&M scenarios and system events for instances equipped with local disks.	Yes
Performance limited events of burstable instances	This category of system events is triggered when burstable instances exhaust their CPU credits and start to run at or near the baseline CPU utilization. The system events may affect instance management, instance O&M, and the operation of applications and result in issues such as slow access and latency.	Yes
Instance security events	This category of system events is triggered when instances face security threats. For example, instance security events are triggered when instances are under DDoS attacks or when blackhole filtering is triggered for instances.	Yes
Instance migration events due to upgrades at the underlying layer	This category of system events is triggered when instances need to be migrated from specific regions and zones due to an infrastructure upgrade plan of Alibaba Cloud. You can migrate instances based on the system events.	Yes
Status change events	This category of system events is triggered when operations, such as Start and Stop, on instances cause changes in the status of the instance lifecycle or when instance attribute changes cause changes in the status of the instance lifecycle or other status changes. Status change events are classified into the following categories: Lifecycle status change events: For example, lifecycle status change events are triggered when instances enter a different state, when spot instances are interrupted, and when snapshots are created. Other attribute change events: For example, other attribute change events are triggered when the performance mode of burstable instances is changed or when subscription disks are changed into pay-as-you-go disks.	Lifecycle status change events are not displayed in the ECS console. Specific other attribute change events are displayed in the ECS console.

System event severities

The following severities are assigned to system events based on the impacts of the system events on the normal operation of instances:

Critical: Critical system events may result in instance unavailability and must be handled at the earliest opportunity. For example, a critical system event is triggered when resources are released due to an overdue payment or when an instance is redeployed due to an instance error.
Warning: Warning system events affect your business. For example, a warning system event is triggered when a burstable instance cannot burst above its performance baseline. You must take note of the events or handle the events when appropriate.
Notification: Notification system events do not affect your business. For example, a notification system event is triggered when a snapshot is created for a disk. You can choose whether to pay attention to notification system events.

States and windows of system events

The following table describes the states defined for system events that are displayed in the ECS console.

Note

For information about the supported states of different system events, see the "CloudMonitor event" columns of tables in Summary.

Event state	Attribute	Description
Inquiring	Intermediate	The O&M task related to the system event is pending authorization. After you authorize the task to be executed, the event enters the Executing state.
Scheduled	Intermediate	The O&M task related to the system event is scheduled and pending execution. When the O&M task is executed, the event enters the Executing state.
Executing	Intermediate	The O&M task related to the system event is being executed.
Executed	Stable	The O&M task related to the system event is completed.
Avoided	Stable	The impacts of the system event are prevented because the affected instance is migrated within the user operation window.
Failed	Stable	The O&M task related to the system event failed.
Canceled	Stable	The O&M task related to the system event is automatically canceled.

The following figure shows the typical transitions between event states.

System events have the following windows:

User operation window
The user operation window of a system event starts when the event is sent and ends at the time when the related O&M task is executed as scheduled. You can manually execute the O&M task within the user operation window or wait for the system to automatically execute the task. Take note of the following items about the lengths of user operation windows:
- In most cases, the user operation window of a scheduled O&M event ranges from 24 to 48 hours.
  Note
  The lengths of user operation windows are unlimited for system events in the Inquiring state. The O&M tasks related to the events can start only after you authorize the tasks to be executed.
- In most cases, unexpected O&M system events caused by failures or unauthorized operations do not have a user operation window.
- For system events indicating that subscription instances are about to expire, the window is 3 days.
- For system events indicating that pay-as-you-go instances are about to be stopped due to overdue payments, the window is less than 1 hour.
Event execution window
The execution window of a system event starts when the related O&M task is executed and ends when the task is completed. Take note of the following items about the lengths of event execution windows:
- For system events such as failure recovery events, the window is within 10 minutes.
- Unexpected O&M events caused by failures or unauthorized operations have a short event execution window.

Operations that can be performed on system events

Operation	Description and references
Understand system events	To learn about system events and understand the event names, severities, usage scenarios, limits, states, and name formats, see this topic.
View system events	You can view system events in the ECS or CloudMonitor console or by using Alibaba Cloud CLI. For information about how to view system events in the ECS console or by using Alibaba Cloud CLI, see Query and handle ECS system events. For information about how to view system events in the CloudMonitor console, see View system events.
Handle system events	For specific critical system events, such as system events that affect the availability and performance of ECS resources, we recommend that you handle the events as suggested in the ECS or CloudMonitor console or by calling API operations at the earliest opportunity to ensure service availability. For information about the suggestions on how to handle all system events, see Summary. For information about how to view and handle pending system events, see Query and handle ECS system events. For information about how to handle system events related to local disks, see O&M scenarios and system events for instances equipped with local disks.
Monitor system events	To ensure the stability of services that run on ECS instances and automate O&M, we recommend that you configure event notifications to be notified of underlying environment changes. After you configure event notifications, the system uses the notification methods that you specify to send you notifications. For information about how to configure alert rules in the CloudMonitor console to push event notifications, see Subscribe to ECS system event notifications. For information about how to use a DingTalk chatbot to send event notifications to a DingTalk group, see Send event notifications by using a DingTalk chatbot.
Modify system event-related settings	You can modify system event-related settings based on your business requirements. You can modify the maintenance attributes of an ECS instance to configure whether to restart or redeploy the instance after a system event is handled. For more information, see Modify instance maintenance attributes. For scheduled system events that require ECS instances to be restarted, you can configure O&M tasks to handle the system events and specify the restart time of the instances. For more information, see Modify the scheduled restart time.