All Products
Search
Document Center

Elastic Compute Service:O&M scenarios and system events for instances equipped with local disks

Last Updated:Mar 15, 2024

Local disks do not provide high availability of data. To enhance user experience on local disks, Alibaba Cloud provides various O&M capabilities to help you keep up on and handle exceptions that occur on your local disks. This topic describes common O&M scenarios and system events for Elastic Compute Service (ECS) instances equipped with local disks.

Common O&M scenarios and related system events

The following figure shows the O&M scenarios that are common to instances equipped with local disks and system events related to the instances.

image
Note

You can change automatic recovery modes (also called maintenance actions) for instances by modifying the maintenance attributes of the instances. For example, if the maintenance action for an instance is Automatically Re-deploy, the instance is automatically redeployed. For more information, see Modify instance maintenance attributes.

For ECS bare metal instances, you can install the xdragon_hardware_detect_plugin plug-in to check the health status of local disks on the instances on a regular basis. For more information, see Install the monitoring plug-in.

For information about system events triggered in the scenarios as shown in the preceding figure, see the following sections in this topic:

Note

To ensure business continuity, we recommend that you back up data for affected ECS instances and switch to other instances before you execute O&M tasks on the instances. For example, you can divert traffic away from the affected ECS instances, disassociate the ECS instances from Server Load Balancer (SLB) instances, and back up disk data of the ECS instances.

Scenario ①

Procedure for handling a SystemMaintenance.Reboot system event:

  1. You are notified when an instance is scheduled to be restarted.

  2. Use one of following methods to handle the event:

    • If you do not want the instance to be restarted within the scheduled time period, specify a different time at which you want the instance to be restarted. For more information, see Modify the scheduled restart time.

    • Restart the instance within the user operation window. For more information, see Restart an instance.

      Note

      You must restart the instance in the ECS console or by calling the RebootInstance operation. You cannot restart the instance from within the instance.

    • Wait for the instance to be automatically restarted.

  3. Check whether the instance and applications continue to work as expected.

For information about the event states supported by SystemMaintenance.Reboot, see Summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section in the "Overview" topic.

Scenario ②

Procedure for handling a SystemMaintenance.Redeploy system event:

  1. You are notified when an instance equipped with local disks is scheduled to be redeployed.

  2. Make preparations such as modifying the /etc/fstab configuration file and backing up data.

    For more information about the required preparations, see the "Prerequisites" section in the Redeploy an instance equipped with local disks topic.

  3. Use one of the following methods to handle the event:

    Note

    When an instance equipped with local disks is redeployed, the instance is migrated to a different physical machine, and the local disks of the instance are re-initialized and lose all their data.

  4. Check whether the instance and applications continue to work as expected. If yes, synchronize data based on your business requirements.

For information about the event states supported by SystemMaintenance.Redeploy, see Summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section in the "Overview" topic.

Scenario ③

Procedure for handling a SystemFailure.Reboot system event:

  1. The system restarts an instance due to a system error.

  2. You are notified when the instance is being restarted.

    Wait until the instance is restarted without manual intervention.

  3. Check whether the instance and applications continue to work as expected.

For information about the event states supported by SystemFailure.Reboot, see ECS system events. For the figure that shows the typical transitions between event states, see States and windows of system events.

Scenario ④

Procedure for handling a SystemFailure.Redeploy system event:

  1. You are notified when an instance equipped with local disks is scheduled to be redeployed.

  2. Make preparations such as modifying the /etc/fstab configuration file and backing up data.

    For information about the required preparations, see the "Prerequisites" section in the Redeploy an instance equipped with local disks topic.

  3. Use one of the following methods to handle the event:

    Note

    When an instance equipped with local disks is redeployed, the instance is migrated to a different physical server, and the local disks of the instance are re-initialized and lose all their data.

  4. Check whether the instance and applications continue to work as expected. If yes, synchronize data based on your business requirements.

For information about the event states supported by SystemFailure.Redeploy, see Summary. For the figure that shows the typical transitions between event states, see States and windows of system events.

Scenario ⑤

For Scenario ⑤ in which a local disk is damaged on the host of an instance, you can redeploy the instance to another host or replace the disk. Take note of the following items when you replace a damaged disk:

  • Only specific disks of local disk instances can be isolated. You can isolate damaged disks only if disk isolation is included in the operations of system events.

  • Disk isolation and disk maintenance are independent of each other. Disk isolation is required for disk maintenance, but cannot guarantee the result of disk maintenance. Local disk maintenance is not supported for all instances. You can initiate disk maintenance only when you receive the notification of disk restoration from Alibaba Cloud.

  • You can restore the local disks for an instance by redeploying the instance. However, when the instance is redeployed, data stored on the local disks is lost. For more information, see Redeploy an instance equipped with local disks.

  • When the damaged local disk is replaced, only data of the replaced local disk is lost. The data stored in other local disks on the instance is retained. To replace a damaged local disk on an instance, perform the following operations:

    1. You are notified when a local disk on an instance is damaged and scheduled to be isolated.

    2. Make preparations such as modifying the /etc/fstab configuration file and backing up data.

    3. If the name of the system event contains IsolateErrorDisk, authorize the isolation of damaged disks.

    4. If the name of the system event contains Reboot, you must restart the instance.

    5. Alibaba Cloud removes the damaged local disk from the host on which your instance resides, inserts a new disk, and then sends you a disk restoration notification.

    6. If the system event contains disk restoration or related operations, authorize disk restoration.

    7. If the name of the system event contains Reboot, you must restart the instance.

    Note

    To replace a damaged local disk, you must work together with Alibaba Cloud. For more information, see Isolate damaged local disks in the ECS console and Isolate damaged local disks by using Alibaba Cloud CLI.

    The following figure shows the event states that are supported by damage disk-related system events and the transitions between the event states.

    image

Scenario ⑥

In scenario ⑥, you can redeploy the instance to another host or authorize an in-place repair to be performed. Take note of the following items when you authorize an in-place repair to be performed:

  • In-place repairs cannot ensure zero data loss or a 100% repair success rate. Before you authorize an in-place repair to be performed, make sure that you have backed up all your key business data.

  • Only specific disks on instances that are equipped with local disks can be repaired offline.

  • Within the repair window, the instance cannot be started and continues to be billed based on its billing method.

  • An in-place repair spans 14 business days. Within the repair window, you can redeploy or release the instance that is being repaired to terminate the repair process.

  • You can restore the local disks for an instance by redeploying the instance. However, when the instance is redeployed, data stored on the local disks is lost. For more information, see Redeploy an instance equipped with local disks.

  • Procedure to handle a SystemMaintenance.StopAndRepair system event:

    1. You receive a system event indicating that an instance equipped with local disks needs to be repaired in-place.

    2. Use one of the following methods to handle the event:

      • Within the user operation window, stop the instance and authorize an in-place repair to be performed.

      • Wait for the system to stop the instance and repair host hardware.

    3. Alibaba Cloud repairs host hardware and sends a repair completion event when the hardware is repaired.

    4. Check whether the instance and applications continue to work as expected. If yes, synchronize data based on your business requirements.

For information about the event states supported by SystemMaintenance.StopAndRepair, see Summary. For the figure that shows the typical transitions between event states, see the States and windows of system events section in the "Overview" topic.