All Products
Search
Document Center

CloudOps Orchestration Service:Repair the damaged Linux system disk of an ECS instance by using the self-rescue solution

Last Updated:May 15, 2024

Symptoms and causes

The Linux operating system of an Elastic Compute Service (ECS) instance may fail to be started due to some reasons. In this case, the ECS instance may be in the Running state in the ECS console, but the applications deployed on the instance are inaccessible. The network of the instance can neither be pinged nor connected to by using the workbench or Secure Shell (SSH). If you connect to the network of the instance by using Virtual Network Computing (VNC) in the ECS console, the following error messages may be returned:

unexpected inconsistency;RUN fsck MANUALLY

Or

Give root password for maintenance (or type CTRL-D to continue)

Or

Enter 'help' for a list of built-in commands.
(initramfs)

The possible causes include but are not limited to the following ones:

  • The instance is forcibly stopped or restarted, or suddenly breaks down. This causes the inconsistency of file system data.

  • You have detached a data disk but not deleted the attach information from the /etc/fstab file.

  • The /etc/fstab file is missing or damaged.

  • The initrd file is damaged.

  • The file system is damaged due to other reasons. You can use the self-rescue solution described in this topic to repair the damaged Linux instance.

Solution overview

CloudOps Orchestration Service (OOS) is a free automated O&M platform provided by Alibaba Cloud. OOS allows you to use a simple template in the YAML format to manage and execute automated tasks.

The self-rescue solution described in this topic is an automated solution based on an OOS template and has been tested by Alibaba Cloud. This solution allows you to repair a damaged Linux instance with a few clicks. For more information, see the "Procedure" section of this topic.

The self-rescue solution is implemented based on the following process: OOS creates an image backup for the damaged instance, detaches the system disk from the instance, attaches the system disk to a temporary instance, and then checks and repairs the system disk. After the system disk is repaired, OOS reattaches it to the damaged instance, restarts the instance, and then releases the temporary instance.

The following operating systems are supported:

  • CentOS: 7.2 64-bit, 7.3 64-bit, 7.4 64-bit, 7.5 64-bit, 7.6 64-bit, 7.7 64-bit, and 8.0 64-bit

  • Debian: 8.9 64-bit, 8.11 64-bit, 9.8 64-bit, 9.9 64-bit, and 9.11 64-bit

  • openSUSE: 42.3 64-bit and 15.1 64-bit

  • SUSE Linux Enterprise Server: 12 SP4 64-bit and 12 SP2 64-bit

  • Alibaba Cloud Linux: 2.1903 LTS 64-bit

  • Ubuntu: 18.04 64-bit

Precautions

  • Linux startup failures may be caused by many reasons. It is not guaranteed that the self-rescue solution can repair all instances that fail to be started.

  • During the repair, a temporary instance is created and you are charged for additional fees. The general cost is less than USD 1.

  • The self-rescue solution modifies the fstab and initrd files in the system. Before the modification starts, an image backup is automatically created for the instance to be repaired. You can use the image backup to restore data of the instance later. You are charged for the storage of the image backup. For more information, see Snapshots. After you confirm that the instance is repaired, you can delete the image backup.

Prepare an account with required permissions

If you use an Alibaba Cloud account, skip this section and perform the steps in the "Procedure" section of this topic.

If you use a Resource Access Management (RAM) user or role, make sure that the RAM user or role has the required permissions on OOS, Resource Orchestration Service (ROS), ECS, and Virtual Private Cloud (VPC). You can grant the permissions by using one of the following methods:

  1. Grant the following system permissions to the RAM user or role: AliyunOOSFullAccess, AliyunROSFullAccess, AliyunECSFullAccess, and AliyunVPCFullAccess. For more information, see Policy overview.

  2. Create a custom policy and attach the policy to the RAM user. For more information, see the policy content.

Procedure

  1. Log on to the CloudOps Orchestration Service (OOS) console. Select the region in which the damaged instance resides.

  2. In the left-side navigation pane, choose Automated Task > Public Template. On the Public Template page, search for the ACS-ECS-RescueUnreachableInstance-Linux template.

  3. Click Create Execution.

  4. On the Create Task page, configure basic information. Then, click Next Step: Parameter Settings.

  5. In the Parameter Settings step, configure the following parameters:

    • UnreachableInstanceId: required. The ID of the instance whose system disk is to be repaired.

    • CredentialType: required. The type of the authentication credential that is used to reattach the repaired system disk to the damaged instance. You can set this parameter to KeyPairName or Password.

    • Credential: required. The authentication credential. If you set the CredentialType parameter to KeyPairName, specify the name of a key pair. If you set the CredentialType parameter to Password, specify a password.

    • ImagePrefix: optional. The prefix for the name of the image backup. The default prefix is OOSRescueBackup-.

    • HelperInstanceTypes: optional. The type of the temporary instance to be created. By default, the instance type with the lowest unit price is selected from the provided instance types.

    • OOSAssumeRole: Select Use Existing Permissions of Current Account.

  6. Click Next Step: OK.

You can view the execution status of the OOS template in the execution details and view the execution status of the repair script by checking the rtCommandOutput parameter.

The execution takes about 5 to 10 minutes. If the template is executed, the instance has been repaired and is in the Running state.

Tips

If the template execution is canceled or interrupted, the system disk of the damaged instance may not be reattached after it is detached. In this case, the instance has no system disk. To reattach the system disk, perform the following steps: Log on to the ECS console, find the instance to which the system disk is to be attached, and then click the instance ID to go to the instance details page. Click Disks and then Attach Disk. You can view the ID of the original system disk. Copy the system disk ID prefixed with d-bp and paste it in the disk search box to search for the disk.

  • Select the disk ID if it is available. You can use a key pair or custom password as the logon credential. This credential is used when you log on to the instance as the root user after the instance is started. For example, if you use a custom password as the logon credential, specify the logon password, confirm the password, and then attach the disk. If the disk is attached, you can see that the instance status is changed to Stopped.

  • If the disk ID is not found, view the execution details of the OOS template. Find the output parameter helperInstanceId in the untilStackReady task and copy the instance ID prefixed with i-. Then, search for the instance ID in the ECS console. If the temporary instance is found, release the instance. After the release, perform the same steps as above: Copy the system disk ID prefixed with d-bp to the disk search box, find the disk, and then attach it.

Internal implementation logic of the self-rescue solution

Repair the /etc/fstab file

  • Check whether the /etc/fstab file exists. If the /etc/fstab file exists, back up the file.

  • Repair the /etc/fstab file. If the /etc/fstab file does not exist or fails to be parsed, create the default /etc/fstab file.

• If the nofail parameter is not configured, configure it to prevent startup failures.

• If the fsck command is enabled, disable it.

Update the ramdisk file

  • Check whether the ramdisk file exists in /boot. If the ramdisk file exists, back up the file.

  • Rebuild the ramdisk file.