When a damaged local disk is isolated, the corresponding ECS instance still resides on the same physical machine. This topic describes how to use Alibaba Cloud CLI to call ECS API operations to isolate damaged local disks. The procedure described in this topic is applicable only to ECS instances on which system events about local disks occur. You can also update the SDKs or call relevant API operations in Alibaba Cloud OpenAPI Explorer to complete the operations.

Prerequisites

An Elastic Compute Service (ECS) instance is created. Alibaba CLI is installed on the instance. For more information about how to install Alibaba Cloud CLI in different operating systems, see the following topics:

Background information

The system event codes that correspond to the isolation options of a damaged disk vary with the stage of events. For more information, see O&M scenarios and system events for instances equipped with local disks.
  • Before the damaged disk is isolated, the system event code is SystemMaintenance.IsolateErrorDisk. If the instance must be restarted, the code is SystemMaintenance.RebootAndIsolateErrorDisk.
  • After the damaged disk is isolated but before a new disk is re-initialized, the system event code is SystemMaintenance.ReInitErrorDisk. If the instance must be restarted, the code is SystemMaintenance.RebootAndReInitErrorDisk.
    Important After data disks are re-initialized, data on the isolated local disk is cleared.

Procedure

  1. Call the DescribeInstanceHistoryEvents operation to query system events within the specified region that are in the Inquiring state, and record the return values of EventId, DiskId, and Device.
    Run the following command in Alibaba Cloud CLI:
    aliyun ecs DescribeInstanceHistoryEvents \
    --RegionId <TheRegionId> \
    --InstanceEventCycleStatus.1 Inquiring
    Sample response in the JSON format:
    {
      "InstanceSystemEventSet": {
        "InstanceSystemEventType": [
          {
            "InstanceId": "i-2ze3tphuqvc93ci****3",
            "EventId": "e-2ze9y****wtqcvai68rl",
            "EventType": {
              "Code": 3,
              "Name": "SystemMaintenance.IsolateErrorDisk"
            },
            "EventCycleStatus": {
              "Code": 28,
              "Name": "Inquiring"
            },
            "EventPublishTime": "2017-11-30T06:32:31Z",
            "ExtendedAttribute" : {
              "DiskId": "d-disk1",
              "Device": "/dev/xvda"
            }
          }
        ]
      },
      "PageSize": 10,
      "PageNumber": 1,
      "TotalCount": 1,
      "RequestId": "02EA76D3-5A2A-44EB-****-8901881D8707"
    }
  2. Log on to the ECS instance to make preparations before you isolate the damaged local disk.
    1. Connect to the ECS instance. For more information, see Connection methods .
    2. Optional:Isolate the read and write operations of the local disk at the application layer.
    3. If the instance is a Linux instance, add the nofail parameter to the /etc/fstab configuration file of the instance for the local disk.
      /dev/vdd /mnt/vdd ext4 defaults,barrier=0,nofail 0 0
      ParameterDescription
      /dev/vddThe device name of the local disk, which is the Device value returned by the DescribeInstanceHistoryEvents operation.
      /mnt/vddThe mount point of the local disk, which can be queried by using the mount | grep "/dev/vdd" command.
      ext4The file system type of the local disk, which can be queried by using the blkid /dev/vdd command.
      barrier=0The mount option used to disable barriers in the file system.
      nofailIndicates that the booting sequence of the ECS instance is not interrupted even if the local disk specified in the file system does not exist.
    4. Unmount the local disk.
      umount /dev/vdd
      Important If you do not unmount the local disk, the device name of the local disk will change after the local disk is isolated and repaired. In this case, applications may read from or write to another disk.
  3. Call the AcceptInquiredSystemEvent operation to respond to the specified system event.
    Run the following command in Alibaba Cloud CLI:
    aliyun ecs AcceptInquiredSystemEvent --RegionId <TheRegionId> --EventId <TheEventId>
  4. Determine whether to restart the instance.
    • When the event code is SystemMaintenance.IsolateErrorDisk:
      • If only the RequestId value is returned, you do not need to restart the instance.
      • If the return value of code is SwitchToOffline.OnlineIsolateFail, you must restart the instance.
    • When the event code is SystemMaintenance.RebootAndIsolateErrorDisk, you must restart the instance after you call the AcceptInquiredSystemEvent operation.
    To restart the instance, run the following command in Alibaba Cloud CLI:
    aliyun ecs RebootInstance --InstanceId <TheInstanceId>
    Note After the instance is restarted, the isolated damaged local disk is temporarily converted to a 1 MiB dummy hard disk to facilitate subsequent operations. At the application layer, you must continuously isolate read and write operations on the damaged local disk and configure the nofail parameter in the /etc/fstab file.
  5. Wait until Alibaba Cloud replaces the damaged local disk on the physical machine and publishes the SystemMaintenance.ReInitErrorDisk or SystemMaintenance.RebootAndReInitErrorDisk event. This process takes one to five days.
  6. Call the AcceptInquiredSystemEvent operation again to respond to the system event. The local disk enters the re-initializing state.
    Run the following command in Alibaba Cloud CLI:
    aliyun ecs AcceptInquiredSystemEvent --RegionId <TheRegionId> --EventId <TheEventId>
  7. Determine whether to restart the instance.
    • When the event code is SystemMaintenance.ReinitErrorDisk:
      • If only the RequestId value is returned, you do not need to restart the instance.
      • If the return value of code is SwitchToOffline.OnlineReInitFail, you must restart the instance.
    • When the event code is SystemMaintenance.RebootAndReinitErrorDisk, you must restart the instance after you call the AcceptInquiredSystemEvent operation.
    To restart the instance, run the following command in Alibaba Cloud CLI:
    aliyun ecs RebootInstance --InstanceId <TheInstanceId>

What to do next

After the damaged disk is isolated, check the status of the instance and local disk. The replaced local disk is restored to its original capacity, and you can reformat data disks. For more information, see Initialize a data disk up to 2 TiB in size on a Windows instance or Initialize a data disk whose size does not exceed 2 TiB on a Linux instance.