repair the damaged Linux system disk of an ECS instance by using the self-rescue solution, guide to automated repair - CloudOps Orchestration Service

If you want to repair the Linux system disk of an Elastic Compute Service (ECS) instance that has an exception, you can use the one-click repair feature provided by CloudOps Orchestration Service (OOS).

Prerequisites

A Resource Access Management (RAM) user is created and attached to the AliyunOOSFullAccess, AliyunROSFullAccess, AliyunECSFullAccess, and AliyunVPCFullAccess policies. For more information, see Create a RAM user and Policy overview.

Usage notes

Scenarios

In certain cases, the Linux operating system of an ECS instance may fail to be started. If this exception occurs, the ECS instance may appear in the Running state in the ECS console, but the applications deployed in the instance are inaccessible. The network of the instance can neither be pinged nor connected to by using the workbench or SSH. If you connect to the instance by using Virtual Network Computing (VNC) in the ECS console, you may see one of the following error messages:

unexpected inconsistency;RUN fsck MANUALLY

Give root password for maintenance (or type CTRL-D to continue)

Enter 'help' for a list of built-in commands.
(initramfs)

The possible causes include but are not limited to the following ones:

The instance is forcibly shut down or restarted.
The instance suddenly breaks down, resulting in inconsistent file system data.
You have detached a data disk, but have not removed the attachment information from the /etc/fstab file.
The /etc/fstab file is missing or damaged.
The initrd file is damaged.
The file system is damaged due to other reasons.

Solution overview

The self-rescue solution in this topic uses an OOS template that has been tested by Alibaba Cloud. This is an automated solution for one-click repair. The OOS template involves the following steps:

Create an image backup for the instance to be repaired.
Detach the system disk from the instance, and attach the system disk to a temporary instance.
Check and repair the system disk on the temporary instance.
After the system disk is repaired, reattach the system disk to the original instance and try to restart the instance.
Release the temporary instance.

Supported operating systems

CentOS: 7.2 64-bit, 7.3 64-bit, 7.4 64-bit, 7.5 64-bit, 7.6 64-bit, 7.7 64-bit, and 8.0 64-bit
Debian: 8.9 64-bit, 8.11 64-bit, 9.8 64-bit, 9.9 64-bit, and 9.11 64-bit
openSUSE: 42.3 64-bit and 15.1 64-bit
SUSE Linux Enterprise Server: 12 SP4 64-bit and 12 SP2 64-bit
Alibaba Cloud Linux: 2.1903 64-bit
Ubuntu: 18.04 64-bit

Important

Linux startup failures may be caused by many reasons. It is not guaranteed that the self-rescue solution can repair all instances that fail to be started. During the repair, a temporary instance is created and certain fees are incurred. The fee is generally less than USD 0.2. The self-rescue solution requires modifications to the fstab and initrd files of the system. Before modification, an image backup is automatically created for the instance to be repaired. This way, you can restore data later by using the image backup. You are charged for the image backup. For more information, see Snapshots. After the instance is repaired, you can delete the image backup to reduce costs.

Procedure

Log on to the CloudOps Orchestration Service console.
In the left-side navigation pane, choose Automated Task > Public Template. On the Public Template page, search for the ACS-ECS-RescueUnreachableInstance-Linux template.
Click Create Execution.
On the Create Task page, click Next Step: Parameter Settings.
In the Parameter Settings step, set the following parameters:
- UnreachableInstanceId: required. The ID of the instance whose system disk is to be repaired.
- CredentialType: required. The type of the authentication credential that is used to reattach the repaired system disk to the original instance. You can set this parameter to KeyPairName or Password.
- Credential: required. The authentication credential. If you set the CredentialType parameter to KeyPairName, specify the name of a key pair. If you set the CredentialType parameter to Password, specify a password.
- ImagePrefix: optional. The prefix for the name of the image backup. The default prefix is OOSRescueBackup-.
- HelperInstanceTypes: optional. The type of the temporary instance to be created. By default, the instance type with the lowest unit price is selected from the provided instance types.
- OOSAssumeRole: Select Use Existing Permissions of Current Account.
Click Next Step: OK. In the OK step, click Create.

You can view the execution status of the OOS template on the execution details page and view the execution status of the repair script by using the rtCommandOutput parameter. The execution takes about 5 to 10 minutes. If the template is successfully executed, the instance has been repaired and is in the Running state.

(Optional) Attach the system disk and repair the instance

If the template execution is canceled or interrupted, the system disk of the instance may not be reattached. In this case, the instance has no system disk. To reattach the system disk, perform the following steps:

Log on to the ECS console and find the instance to be repaired.
Click the instance ID to go to the instance details page.
Click the Block Storage (Disks) tab. On this tab, you can view the ID of the system disk on the Cloud Disk tab.
Click Attach Cloud Disk.
The system disk ID prefixed with d-bp is displayed.
Copy the system disk ID and paste it to the search box to search for the system disk.
- If the disk ID is found:
  1. Select the disk ID.
  2. Specify the logon credential. You can use a key pair or a custom password as the logon credential. If you use a custom password as the logon credential, specify the logon password and confirm the password.
  3. Click OK. Then, click Attach.
  4. If the disk is attached, you can see that the instance status changes to Stopped.
- If the disk ID is not found:
  1. View the execution details of the OOS template, and find the HelperInstanceId parameter in the output of the untilStackReady task.
  2. Copy the instance ID that is prefixed with i-.
  3. Search for the instance ID in the ECS console, and release the temporary instance.
  4. After the instance is released, perform the previous steps: Copy the system disk ID that is prefixed with d-bp and paste the system disk ID to the search box to search for the system disk, and then attach the disk to the instance to be repaired.

Implementation logic

Repair the /etc/fstab file

Check whether the /etc/fstab file exists. If the /etc/fstab file exists, back up the file.
Repair the /etc/fstab file. If the /etc/fstab file does not exist or fails to be parsed, create the default /etc/fstab file.
If the nofail parameter is not set, set it to prevent startup failures.
If the fsck command is enabled, disable it.

Update the ramdisk file

Check whether the ramdisk file exists in /boot. If the ramdisk file exists, back up the file.
Rebuild the ramdisk file.