All Products
Search
Document Center

Elastic Compute Service:Isolate damaged local disks

Last Updated:Mar 27, 2024

If a local disk on a physical machine that hosts an Elastic Compute Service (ECS) instance is damaged, the instance remains on the physical machine after the local disk is isolated. This topic describes how to isolate damaged local disks in the ECS console and by using Alibaba Cloud CLI. The procedure described in this topic can be performed only to handle the local disk-related system events of ECS instances.

Background information

Only damaged local disks used by instances of big data instance types can be isolated. You can isolate damaged local disks when you handle the following system events:

  • The Disk:ErrorDetected event, which is triggered when a damage alert is generated for a local disk.

  • The SystemMaintenance.IsolateErrorDisk event, which is triggered when a damaged local disk needs to be isolated due to system maintenance.

  • The SystemMaintenance.RebootAndIsolateErrorDisk event, which is triggered when an instance needs to be restarted and a damaged local disk used by the instance needs to be isolated due to system maintenance.

  • The SystemMaintenance.ReInitErrorDisk event, which is triggered when a damaged local disk needs to be re-initialized due to system maintenance.

  • The SystemMaintenance.RebootAndReInitErrorDisk event, which is triggered when an instance needs to be restarted and a damaged local disk used by the instance needs to be re-initialized due to system maintenance.

For more information, see O&M scenarios and system events for instances equipped with local disks.

Procedure

Isolate damaged local disks in the ECS console

  1. Log on to the ECS console.

  2. In the left-side navigation pane, click Events.

  3. In the left-side navigation pane of the Events page, click Local Disk-based Instance Events.

  4. On the Local Disk-based Instance Events page, click the Local Disk Damaged Events tab.

  5. Find the instance that you want to manage and click Repair in the Actions column.

  6. In the Configurations Modification step, modify the configuration file of the instance. Then, click Next.

    修改配置

    For some Linux instances, if the Configurations Modification step is displayed, follow the on-screen instructions to perform the following operations. In this section, a damaged local disk named /dev/vdd is used.

    1. Connect to the ECS instance.

      For more information, see Connect to a Linux instance by using a password or key.

    2. (Optional) Isolate the read and write operations of the local disk at the application layer.

    3. Add the nofail parameter to the /etc/fstab configuration file of the instance for the local disk.

      /dev/vdd /mnt/vdd ext4 defaults,barrier=0,nofail 0 0

      Parameter

      Description

      /dev/vdd

      The device name of the local disk, which is the Device value returned by the DescribeInstanceHistoryEvents operation.

      /mnt/vdd

      The mount point of the local disk, which can be queried by using the mount | grep "/dev/vdd" command.

      ext4

      The file system type of the local disk, which can be queried by using the blkid /dev/vdd1 command.

      barrier=0

      The mount option used to disable barriers in the file system.

      nofail

      Indicates that the boot sequence of the ECS instance is not interrupted even if the local disk specified in the file system does not exist.

    4. Unmount the local disk.

      umount /dev/vdd
      Important

      If you do not unmount the local disk, the device name of the local disk changes after the local disk is isolated and repaired. In this case, applications may read data from or write data to another disk.

  7. In the Damaged Disk Isolation step, click OK.

    If the next step is not displayed, refresh the page.

  8. (Optional) In the Instance Restart step, click Restart.

    If the Instance Restart step is displayed, click Restart to restart the instance.

    Note After the instance is restarted, the isolated damaged local disk is temporarily converted to a 1 MiB dummy hard disk to facilitate subsequent operations. At the application layer, you must continuously isolate read and write operations on the damaged local disk and configure the nofail parameter in the /etc/fstab file.
  9. After the instance is restarted, click OK in the New Disk Inserting step.

    Wait for Alibaba Cloud to replace the damaged local disk on the physical machine that hosts the instance. In most cases, the replacement process requires up to five business days to complete. After the local disk is replaced, you receive an event that requires you to restore the disk.

  10. After you receive the event, click Restore in the Disk Restoration step.

    If the next step is not displayed, refresh the page.

  11. (Optional) In the Instance Restart step, click Restart.

    If the Instance Restart step is displayed, click Restart to restart the instance.

  12. After the instance is restarted, click Complete in the Complete step.

Isolate damaged local disks by using Alibaba Cloud CLI

An Elastic Compute Service (ECS) instance is created. Alibaba CLI is installed on the instance. For information about how to install Alibaba Cloud CLI on different operating systems, see the following topics:

  1. Call the DescribeInstanceHistoryEvents operation to query system events that are in the Inquiring state in the specified region, and record the return values of the EventId, DiskId, and Device parameters.

    Run the following commands in Alibaba Cloud CLI:

    aliyun ecs DescribeInstanceHistoryEvents \
    --RegionId <TheRegionId> \
    --InstanceEventCycleStatus.1 Inquiring

    The following code shows a sample response in the JSON format:

    {
      "InstanceSystemEventSet": {
        "InstanceSystemEventType": [
          {
            "InstanceId": "i-2ze3tphuqvc93ci****3",
            "EventId": "e-2ze9y****wtqcvai68rl",
            "EventType": {
              "Code": 3,
              "Name": "SystemMaintenance.IsolateErrorDisk"
            },
            "EventCycleStatus": {
              "Code": 28,
              "Name": "Inquiring"
            },
            "EventPublishTime": "2017-11-30T06:32:31Z",
            "ExtendedAttribute" : {
              "DiskId": "d-disk1",
              "Device": "/dev/xvda"
            }
          }
        ]
      },
      "PageSize": 10,
      "PageNumber": 1,
      "TotalCount": 1,
      "RequestId": "02EA76D3-5A2A-44EB-****-8901881D8707"
    }
  2. Log on to the ECS instance to make preparations before you isolate the damaged local disk.

    1. Connect to the ECS instance.

      For more information, see Connect to a Linux instance by using a password or key.

    2. (Optional) Isolate the read and write operations of the local disk at the application layer.

    3. If the instance is a Linux instance, add the nofail parameter to the /etc/fstab configuration file of the instance for the local disk.

      /dev/vdd /mnt/vdd ext4 defaults,barrier=0,nofail 0 0

      Parameter

      Description

      /dev/vdd

      The device name of the local disk, which is the Device value returned by the DescribeInstanceHistoryEvents operation.

      /mnt/vdd

      The mount point of the local disk, which can be queried by using the mount | grep "/dev/vdd" command.

      ext4

      The file system type of the local disk, which can be queried by using the blkid /dev/vdd1 command.

      barrier=0

      The mount option used to disable barriers in the file system.

      nofail

      Indicates that the boot sequence of the ECS instance is not interrupted even if the local disk specified in the file system does not exist.

    4. Unmount the local disk.

      umount /dev/vdd
      Important

      If you do not unmount the local disk, the device name of the local disk changes after the local disk is isolated and repaired. In this case, applications may read data from or write data to another disk.

  3. Call the AcceptInquiredSystemEvent operation to respond to the specified system event.

    Run the following command in Alibaba Cloud CLI:

    aliyun ecs AcceptInquiredSystemEvent --RegionId <TheRegionId> --EventId <TheEventId>
  4. Determine whether to restart the instance.

    • When the event code is SystemMaintenance.IsolateErrorDisk:

      • If only the RequestId value is returned, you do not need to restart the instance.

      • If the return value of code is SwitchToOffline.OnlineIsolateFail, you must restart the instance.

    • When the event code is SystemMaintenance.RebootAndIsolateErrorDisk, you must restart the instance after you call the AcceptInquiredSystemEvent operation.

    To restart the instance, run the following command in Alibaba Cloud CLI:

    aliyun ecs RebootInstance --InstanceId <TheInstanceId>
    Note

    After the instance is restarted, the isolated damaged local disk is temporarily converted to a 1 MiB dummy hard disk to facilitate subsequent operations. At the application layer, you must continuously isolate read and write operations on the damaged local disk and configure the nofail parameter in the /etc/fstab file.

  5. Wait until Alibaba Cloud replaces the damaged local disk on the physical machine and publishes the SystemMaintenance.ReInitErrorDisk or SystemMaintenance.RebootAndReInitErrorDisk event. In most cases, the replacement process requires up to five business days to complete.

  6. Recall the AcceptInquiredSystemEvent operation to respond to the system event. The local disk enters the Re-initializing state.

    Run the following command in Alibaba Cloud CLI:

    aliyun ecs AcceptInquiredSystemEvent --RegionId <TheRegionId> --EventId <TheEventId>
  7. Determine whether to restart the instance.

    • When the event code is SystemMaintenance.ReinitErrorDisk:

      • If only the RequestId value is returned, you do not need to restart the instance.

      • If the return value of code is SwitchToOffline.OnlineReInitFail, restart the instance.

    • When the event code is SystemMaintenance.RebootAndReinitErrorDisk, you must restart the instance after you call the AcceptInquiredSystemEvent operation.

    To restart the instance, run the following command in Alibaba Cloud CLI:
    aliyun ecs RebootInstance --InstanceId <TheInstanceId>

Result

A few minutes after the damaged local disk is replaced, the local disk damaged event disappears.

What to do next

After the damaged disk is isolated, check the status of the instance and local disk. The replaced local disk is restored to its original capacity, and you can reformat data disks. For more information, see Initialize a data disk whose size does not exceed 2 TiB on a Windows instance or Initialize a data disk whose size does not exceed 2 TiB on a Linux instance.