This topic describes the O&M process of Alibaba Cloud and best practices for users when a system event occurs on an ECS instance that is equipped with local disks.

Common O&M scenarios

The following three scenarios where underlying failures occur are common to instances that are equipped with local disks:

  • Scenario 1: An instance exception occurs due to a software problem on the physical machine that hosts the instance
    • Impact: Typically, the physical machine that hosts the instance recovers after it is restarted. This is an unexpected restart for the instance.
    • What to do next: none.
  • Scenario 2: An instance exception occurs due to a damage to a local disk
    • Impact: Typically, the physical machine that hosts the instance recovers after it is restarted, but the damaged local disk cannot recover.
    • What to do next: You must select a method to replace the damaged local disk.
  • Scenario 3: An instance exception occurs due to a hardware damage on the physical machine that hosts the instance
    • Impact: Typically, the physical machine that hosts the instance must be taken offline for repair.
    • What to do next: You must redeploy the instance that is equipped with local disks and migrate the instance to a different physical machine. Synchronize data as needed to restore the instance and the local disks.

System events on ECS instances that are equipped with local disks

System events on ECS instances that are equipped with local disks refer to the O&M process used by ECS instances with local storage when the local disks are damaged. The code of the Block Storage event is ErrorDetected. The following table lists the solutions that you can select during the event window period.
Solution Description Instance event code Reference
Migrate an instance If you need to urgently restore a local disk and can accept the loss of data on the local disk, you can migrate the instance to a different physical machine to restore the capacity of all data disks, and remount and reformat data disks.
  • SystemMaintenance.Redeploy
  • SystemFailure.Redeploy
Redeploy an instance equipped with local disks
Isolate damaged disks Alibaba Cloud will replace the isolated damaged local disks as soon as possible. After the local disks are replaced, Alibaba Cloud sends you the system event that requires the instance to be restarted and the damaged local disks to be replaced. You can respond to the event within the event window period.
  • If the following event code is returned, the instance does not need to be restarted:
    • SystemMaintenance.IsolateErrorDisk
    • SystemMaintenance.ReinitErrorDisk
  • If the following event code is returned, the instance must be restarted:
    • SystemMaintenance.RebootAndIsolateErrorDisk
    • SystemMaintenance.RebootAndReinitErrorDisk
Isolate damaged local disks
The following figure shows the workflow of isolating a damaged disk and corresponding event states.Workflow of isolating a damaged disk and corresponding event states

References

  • For ECS Bare Metal Instances, you can install the xdragon_hardware_detect_plugin plug-in to check the health status of local disks on the ECS Bare Metal Instances on a regular basis. For more information, see Install the monitoring plug-in.
  • For more information about the types of local disks used by ECS, see Local disks.
  • For more information about the ECS instance families that support local storage, see Instance families.