Detect and Resolve Elasticsearch O&M Events with Event Center - Elasticsearch

You can use the Event Center to view system O&M events for Alibaba Cloud Elasticsearch (ES). This helps you promptly detect service anomalies and quickly analyze and locate issues. This topic describes the event categories for ES and how to view and handle events.

Event categories

ES events are categorized by cause and impact as follows.

Note

For more information, see Appendix: Event details.

Event category	Definition	Cause and impact	Examples
System change	System change events are initiated by Alibaba Cloud. You are notified of these events and must check if your cluster is affected.	System change events caused by infrastructure changes or faults may affect cluster access. When this type of event is triggered, the system sends a notification. Check the notification and your cluster status promptly.	Kibana feature upgrade causes a brief service suspension. AMD instance families are upgraded to the latest generation.
Cluster health	The system regularly inspects cluster health based on actual usage. It displays unexpected diagnostic results as events.	To ensure the sustainability of the Alibaba Cloud service, the system automatically triggers a cluster health event when it detects a cluster resource anomaly or risk. This minimizes the impact. Note During the execution of an O&M event, the cluster may experience brief jitter but normal access is not affected. If automatic execution fails, you can manually trigger a node restart on the Event Center page. The manual intervention window is 24 to `48` hours. For specific execution times, see View and handle events.	An inspection finds that an ES node is offline.
Cluster change	These are operation events that you initiate to change a cluster. Failures or blocks can occur during the change process.	Cluster change events caused by instance type changes or kernel upgrades trigger a restart of the corresponding nodes. During the execution of an O&M event, the cluster may experience brief jitter but normal access is not affected.	Scale-in Restart a node

View and handle events

On the Event Center page, you can view information about events generated under the current account and handle them as needed.

Go to the Event Center.
1. Log on to the Alibaba Cloud Elasticsearch console.
2. In the navigation pane on the left, click Event Center.

View event information.

On the Event Center page, you can filter by conditions to view all events for a target instance of a selected type within a specified time period. Then, you can perform operations based on the event details.

Note

You can view all event information in the Event Center. You can also subscribe to events and set notifications for critical alerts that require prompt handling. When an alert is triggered, the system automatically sends an alert notification to the specified alert contacts by phone, text message, or email.

The event information and related handling operations are described in the following table.

Event information	Description
Cluster ID	The ID of the Alibaba Cloud ES instance that generated the event.
Node ID	The ID of the instance node that generated the event.
Event Level	The severity of the event. Levels include the following: Info: Records the status or operations of the system during normal operation. Often used for system status observation or debugging. Warning: A potential issue or anomaly exists in the system but does not affect the current operation. Continuous monitoring is required. Critical: A serious error or fault has occurred in the system. Immediate handling is required. Otherwise, service unavailability or data loss may occur.
Event Status	The execution status of the event. Statuses include To Be Handled, In Progress, Handled, Handling Failed, Handling Interrupted, Canceled, Execution to be confirmed, and Ready to continue. Among them: To Be Handled: The event is waiting to be executed at the system-set time or your scheduled time. Execution to be confirmed: You can decide whether to execute the event immediately or create a snapshot backup for the event based on the event details. Note Only some events related to local disks in system change events support this status. Only deployment events, such as an ES cluster upgrade or deploying a new version to a specified node, support snapshot backups. Ready to continue: The current change task has completed the grayscale change. You need to confirm the stability of the changed nodes and cluster and decide whether to execute subsequent tasks. For example, a change operation needs to be tested on some nodes first. After the change is verified in a small scope, it is then executed on all nodes. For events in the Handling Failed or Handling Interrupted state, find the cause and handle them promptly to avoid affecting normal business operations.
Event Description	The cause and impact of the event.
Occurred At and Ended At	The start and end time of the event execution.
Scheduled Handling Time and Execution End Time	The scheduled start time and estimated end time of the event. Note Only system change events support this setting.
Scheduled Handling Time and Execution End Time
Source	The source of the event. Sources include the following: Proactive Notification: ES proactively pushes events to Event Center after they are generated. Event Subscription: You subscribe to listen for specified events. When an event occurs, the system receives a corresponding notification.
Suggestion	You can handle related events based on the recommended operations. The supported handling operations vary for different events. The actual interface prevails. Contact Technical Support: If you have questions about an event, you can contact technical support for consultation. Restart: Immediately restart the specified node of the related instance. Schedule Restart: You must specify a restart time. The system will restart the specified node of the related instance at the scheduled time. The node restart time must be at least `5` minutes later than the scheduled time. The system will restart the node for you within `5` minutes of the scheduled time. Note When you restart, forcibly restart, or perform a grayscale restart on the current instance or node, the system automatically triggers the execution of a restart event for that instance or node. However, for redeployment events, such as an ES version upgrade, you still need to submit a ticket to contact technical support personnel.

Appendix: Event details

Event type	Event code and name	Event level	CloudMonitor event name	Description and impact
System change event	SystemUpdate.InfraDiskError System change event due to infrastructure disk failure	Critical	`Instance:SystemUpdate.InfraDiskError:Executing`: System change event in progress due to infrastructure disk failure `Instance:SystemUpdate.InfraDiskError:Executed`: System change event completed due to infrastructure disk failure	An infrastructure failure makes the local disk unavailable.
	SystemUpdate.InfraDiskStalled System change event due to infrastructure disk performance issues	Critical	`Instance:SystemUpdate.InfraDiskstalled:Executing`: System change event in progress due to infrastructure disk performance issues `Instance:SystemUpdate.InfraDiskstalled:Executed`: System change event completed due to infrastructure disk performance issues	The performance of the cloud disk is degraded due to an infrastructure failure.
	SystemUpdate.InfraFailureStop System change event due to an infrastructure-related instance stop	Critical	`Instance:SystemUpdate.InfraFailureStop:Scheduled`: Scheduled system change event due to an infrastructure-related instance stop `Instance:SystemUpdate.InfraFailureStop:Executing`: System change event in progress due to an infrastructure-related instance stop `Instance:SystemUpdate.InfraFailureStop:Executed`: System change event completed due to an infrastructure-related instance stop `Instance:SystemUpdate.InfraFailureStop:Failed`: System change event failed due to an infrastructure-related instance stop	The instance may stop due to a potential infrastructure failure.


	SystemUpdate.InfraMigrate System change event due to infrastructure maintenance	Critical	`Instance:SystemUpdate.InfraMigrate:Scheduled`: Scheduled system change event due to infrastructure maintenance `Instance:SystemUpdate.InfraMigrate:Executing`: System change event in progress due to infrastructure maintenance `Instance:SystemUpdate.InfraMigrate:Executed`: System change event completed due to infrastructure maintenance `Instance:SystemUpdate.InfraMigrate:Failed`: System change event failed due to infrastructure maintenance	The instance node restarts due to infrastructure maintenance. The instance node is redeployed due to infrastructure maintenance.
	SystemUpdate.SoftwareRepair System change event due to a software update	Warning	`Instance:SystemUpdate.SoftwareRepair:Scheduled`: Scheduled system change event due to a software update `Instance:SystemUpdate.SoftwareRepair:Executing`: System change event in progress due to a software update `Instance:SystemUpdate.SoftwareRepair:Executed`: System change event completed due to a software update	Description: The cluster control system restarts due to an upgrade. This upgrade involves changes to the Alibaba Cloud instance architecture, where the control deployment mode is upgraded from Basic Control (v2) to Cloud-native Control (v3). Note You can view the control deployment mode on the instance's Basic Information page. Impact: The upgrade is performed through a blue-green deployment within a scheduled time period. During this process, the number of cluster nodes doubles, but no extra fees are incurred. The upgrade process takes several hours, depending on the data volume. The old nodes are taken offline during the O&M window you set. This process involves a service interruption of about 1 to `2` seconds. Instance change operations are not supported during the upgrade. Please make the necessary business preparations in advance. Clusters of version `6.8.6` are upgraded to version `6.8.23`. The engine is fully compatible, and your services are not affected. After the upgrade, the Kibana private network is disabled. You need to log on to the Kibana console to enable it.
Cluster health event	HealthCheck.ClusterAbnormal Cluster health event due to an abnormal cluster status	Critical	`Instance:HealthCheck.ClusterAbnormal:Executed`: Cluster health event completed due to an abnormal cluster status `Instance:HealthCheck.ClusterAbnormal:Failed`: Cluster health event failed due to an abnormal cluster status	The instance restarts due to an abnormal cluster status.

Cluster change event	UserOperator.InstanceSpecModify Cluster change event due to an instance type change	Info	`Instance:UserOperator.InstanceSpecModify:Executing`: Cluster change event in progress due to an instance type change `Instance:UserOperator.InstanceSpecModify:Executed`: Cluster change event completed due to an instance type change	The instance restarts due to an instance type change. The instance node restarts due to an instance node change.

	UserOperator.InstanceUpdate Cluster change event due to an instance change operation	Info	`Instance:UserOperator.InstanceUpdate:Executing`: Cluster change event in progress due to an instance change operation `Instance:UserOperator.InstanceUpdate:Executed`: Cluster change event completed due to an instance change operation	The instance restarts due to an instance configuration change. The instance plugin is updated. The IK dictionary plugin for the instance is hot-updated.




	UserOperator.InstanceCoreUpdate Cluster change event due to an instance kernel upgrade	Info	`Instance:UserOperator.InstanceCoreUpdate:Executing`: Cluster change event in progress due to an instance kernel upgrade `Instance:UserOperator.InstanceCoreUpdate:Executed`: Cluster change event completed due to an instance kernel upgrade	The instance restarts due to a kernel version update.