System events and O&M for local disk instances - Elastic Compute Service

Local disks do not provide high availability of data. To enhance user experience on local disks, Alibaba Cloud provides various O&M capabilities to help you keep up on and handle exceptions that occur on your local disks. This topic describes common O&M scenarios and system events for Elastic Compute Service (ECS) instances equipped with local disks.

View and monitor system events for instances equipped with local disks

View the system events for instances equipped with local disks.

View the system events for instances equipped with local disks in the ECS console or by using Alibaba Cloud CLI. For more information, see Query and handle events.
View the system events for instances equipped with local disks in the CloudMonitor console. For more information, see View system events.

Monitor the system events for instances equipped with local disks.

To ensure the stability of services that run on ECS instances and automate O&M, we recommend that you configure event notifications to be notified of underlying environment changes. After you configure event notifications, the system uses the notification methods that you specify to notify you.

Configure alert rules in the CloudMonitor console to push event notifications. For more information, see Subscribe to system event notifications.
Use a DingTalk chatbot to send event notifications to a DingTalk group. For more information, see Send event notifications using a DingTalk chatbot.
Install the xdragon_hardware_detect_plugin plug-in on ECS bare metal instances to periodically check the health status of local disks on the instances. For more information, see Install the monitoring plugin.

Common O&M scenarios and related system events

The following figure shows the O&M scenarios that are common to ECS instances equipped with local disks and the system events for the instances.

Note

Modify the maintenance attributes of instances to change automatic recovery modes (also called maintenance actions) for the instances. For example, if the maintenance action for an instance is Automatically Re-deploy, the instance is automatically redeployed. For more information, see Modify instance maintenance attributes.

For information about system events triggered in the scenarios as shown in the preceding figure, see the following sections in this topic:

Scenario ①: SystemMaintenance.Reboot
Scenario ②: SystemMaintenance.Redeploy
Scenario ③: SystemFailure.Reboot
Scenario ④: SystemFailure.Redeploy
Scenario ⑤: Disk:ErrorDetected, SystemMaintenance.IsolateErrorDisk, SystemMaintenance.ReInitErrorDisk, SystemMaintenance.RebootAndIsolateErrorDisk, or SystemMaintenance.RebootAndReInitErrorDisk
Scenario ⑥: SystemMaintenance.StopAndRepair

Note

To ensure business continuity, back up data for affected ECS instances and switch to other instances before you execute O&M tasks on the instances. For example, you can divert traffic away from the affected ECS instances, disassociate the ECS instances from Server Load Balancer (SLB) instances, and back up disk data of the ECS instances.

Scenario ①

Procedure for handling a SystemMaintenance.Reboot system event:

You are notified when an instance is scheduled to be restarted.
Use one of following methods to handle the event:
- If you do not want the instance to be restarted within the scheduled time period, specify a different point in time at which you want the instance to be restarted. For more information, see Modify the scheduled restart time.
- Restart the instance within the user operation window. For more information, see Restart an instance.
  Note
  Restart the instance in the ECS console or by calling the RebootInstance operation. You cannot restart the instance from within the instance.
- Wait for the instance to be automatically restarted.
Check whether the instance and applications continue to work as expected.

For information about the event states supported by SystemMaintenance.Reboot, see System event summary. To view the figure that shows the typical transitions between event states, see the States and windows of system events section of the "Overview" topic.

Scenario ②

Procedure for handling a SystemMaintenance.Redeploy system event:

You are notified when an instance equipped with local disks is scheduled to be redeployed.
Make preparations such as modifying the /etc/fstab configuration file and backing up data.
For information about the required preparations, see the "Prerequisites" section of the Redeploy an instance to which local disks are attached topic.
Use one of the following methods to handle the event:
- Redeploy the instance within the user operation window. For more information, see Redeploy an instance to which local disks are attached.
- Wait for the instance to be automatically redeployed.
Note
When an instance equipped with local disks is redeployed, the instance is migrated to a different physical machine, and the local disks of the instance are re-initialized and lose all their data.
Check whether the instance and applications continue to work as expected. If yes, synchronize data based on your business requirements.

For information about the event states supported by SystemMaintenance.Redeploy, see System event summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section of the "Overview" topic.

Scenario ③

Procedure for handling a SystemFailure.Reboot system event:

The system restarts an instance due to a system error.
You are notified when the instance is being restarted.
Wait until the instance is restarted without manual intervention.
Check whether the instance and applications continue to work as expected.

For information about the event states supported by SystemFailure.Reboot, see System event summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section of the "Overview" topic.

Scenario ④

Procedure for handling a SystemFailure.Redeploy system event:

You are notified when an instance equipped with local disks is scheduled to be redeployed.
Make preparations such as modifying the /etc/fstab configuration file and backing up data.
For information about the required preparations, see the "Prerequisites" section of the Redeploy an instance to which local disks are attached topic.
Use one of the following methods to handle the event:
- Redeploy the instance within the user operation window. For more information, see Redeploy an instance to which local disks are attached.
- Wait for the instance to be automatically redeployed.
Note
When an instance equipped with local disks is redeployed, the instance is migrated to a different physical server, and the local disks of the instance are re-initialized and lose all their data.
Check whether the instance and applications continue to work as expected. If yes, synchronize data based on your business requirements.

For information about the event states supported by SystemFailure.Redeploy, see System event summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section of the "Overview" topic.

Scenario ⑤

In scenario ⑤,redeploy the affected instance to a different host or authorize the damaged local disk to be replaced. Take note of the following items when you replace the damaged local disk:

Only specific disks of instances equipped with local disks can be isolated. Isolate damaged disks only if system events involve disk isolation events or operations.
Disk isolation and disk maintenance are independent of each other. Disk isolation is required for disk maintenance, but cannot guarantee the result of disk maintenance. Local disk maintenance is not supported for all instances. Initiate disk maintenance only when you receive the notification of disk restoration from Alibaba Cloud.

If you want to restore the local disks for an instance, redeploy the instance. However, when the instance is redeployed, data stored on the local disks is lost. For more information, see Redeploy an instance to which local disks are attached.
When the damaged local disk is replaced, only data of the replaced local disk is lost. The data stored in other local disks on the instance is retained. To replace a damaged local disk on an instance, perform the following operations:
1. You are notified when a local disk on an instance is damaged and scheduled to be isolated.
2. Make preparations such as modifying the /etc/fstab configuration file and backing up data.
3. If the system disk involves a damaged disk isolation event or operation, authorize the damaged local disk to be isolated.
4. If the system event involves a reboot event or operation, restart the instance.
5. Alibaba Cloud removes the damaged local disk from the host on which your instance resides, inserts a new disk, and then sends you a disk restoration notification.
6. If the system event involves a disk restoration event or operation, authorize the disk to be restored.
7. If the system event involves a reboot event or operation, restart the instance.
Note
To replace a damaged local disk, work together with Alibaba Cloud. For more information, see Isolate damaged local disks in the ECS console and Isolate damaged local disks by using Alibaba Cloud CLI.
The following figure shows the event states supported by damage local disk-related system events and the transitions between the event states.

Scenario ⑥

In scenario ⑥, you can redeploy the affected instance to a different host or authorize an in-place repair to be performed. Take note of the following items when you authorize an in-place repair to be performed:

In-place repairs cannot ensure zero data loss or a 100% repair success rate. Before you authorize an in-place repair to be performed, make sure that you have backed up all your key business data.
Only specific disks on instances that are equipped with local disks can be repaired offline.
Within the repair window, the instance cannot be started and continues to be billed based on its billing method.
An in-place repair requires 14 business days to complete. Within the repair window, you can redeploy or release the instance that is being repaired to terminate the repair process.

If you want to restore the local disks for an instance, you can redeploy the instance. However, when the instance is redeployed, data stored on the local disks is lost. For more information, see Redeploy an instance to which local disks are attached.
Procedure to handle a SystemMaintenance.StopAndRepair system event:
1. You receive a system event indicating that an instance equipped with local disks needs to be repaired in-place.
2. Use one of the following methods to handle the event:
  - Within the user operation window, stop the instance and authorize an in-place repair to be performed.
  - Wait for the system to stop the instance and repair host hardware.
3. Alibaba Cloud repairs host hardware and sends a repair completion event when the hardware is repaired.
4. Check whether the instance and applications continue to work as expected. If yes, synchronize data based on your business requirements.

For information about the event states supported by SystemMaintenance.StopAndRepair, see System event summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section of the "Overview" topic.

References

You can call the AcceptInquiredSystemEvent operation to accept the default operation for a system event and authorize the system to perform the operation.