Symptoms and causes

In certain cases, the Linux operating system of an Elastic Compute Service (ECS) instance may fail to be started. If this occurs, the ECS instance may appear in the Running state in the ECS console, but the applications deployed in the instance are inaccessible. The network of the instance can neither be pinged nor connected to by using the workbench or Secure Shell (SSH). If you connect to the network of the instance by using Virtual Network Computing (VNC) in the ECS console, you may see the following error messages:
unexpected inconsistency;RUN fsck MANUALLY
or
Give root password for maintenance (or type CTRL-D to continue)
or
Enter 'help' for a list of built-in commands.
(initramfs)

The causes include but are not limited to the following:

  • The instance is forcibly shut down or restarted, or suddenly breaks down, resulting in inconsistent file system data.
  • You have detached a data disk, but have not removed the attach information from the /etc/fstab file.
  • The /etc/fstab file is missing or corrupted.
  • The initrd file is corrupted.
  • The file system is damaged due to other causes. You can use the self-rescue solution in this topic to repair the damaged Linux instance.

Solution overview

Operation Orchestration Service (OOS) is a free automated O&M platform provided by Alibaba Cloud. It allows you to use a simple template in the YAML format to manage and execute automated tasks.

The self-rescue solution in this topic uses an OOS template that has been tested by Alibaba Cloud, which provides an automated solution for one-click repair. For more information about how to use the solution, see the "Procedure" section.

In the self-rescue solution, first create an image backup for the damaged instance, detach the system disk from the instance, attach the system disk to a new temporary instance, and then check and repair the system disk. After the system disk is repaired, reattach it to the original instance, restart the instance, and then release the temporary instance.

The following operating systems are supported:

  • CentOS: 7.2 64-bit, 7.3 64-bit, 7.4 64-bit, 7.5 64-bit, 7.6 64-bit, 7.7 64-bit, and 8.0 64-bit
  • Debian: 8.9 64-bit, 8.11 64-bit, 9.8 64-bit, 9.9 64-bit, and 9.11 64-bit
  • openSUSE: 42.3 64-bit and 15.1 64-bit
  • SUSE Linux Enterprise Server 12: SP4 64-bit and SP2 64-bit
  • Aliyun Linux: 2.1903 64-bit
  • Ubuntu: 18.04 64-bit

Precautions

  • Linux startup failures may be caused by many reasons. We do not guarantee that the self-rescue solution can repair all instances that fail to be started.
  • During the repair, a temporary instance is created and certain fees are incurred. The price is generally less than RMB 1.
  • The self-rescue solution requires modifications to the fstab and initrd files of the system. Before modification, an image backup is automatically created for the instance to be repaired so that you can restore data later by using the image backup. You are charged for the image backup. For more information, see Snapshot billing. After you confirm that the instance is repaired, you can delete the image backup.

Prepare an account with required permissions

If you are using an Alibaba Cloud account, skip this section and perform the steps in the "Procedure" section.

If you are using a Resource Access Management (RAM) user or role, make sure that the RAM user or role has the permissions on OOS, Resource Orchestration Service (ROS), ECS, and Virtual Private Cloud (VPC). You can grant permissions in the following ways:

  1. Grant the following system permissions to the RAM user or role: AliyunOOSFullAccess, AliyunROSFullAccess, AliyunECSFullAccess, and AliyunVPCFullAccess. For more information, see Policy overview in RAM documentation.
  2. Create a custom policy and attach the policy to the RAM user. For more information, see policy content.

    For more information about the procedure, see RAM documentation.

Procedure

  1. Log on to the OOS console. Select the region where the damaged instance resides.
  2. Click Public Templates in the left-side navigation pane and select ACS-ECS-RescueUnreachableInstance-Linux.
  3. Click Create Execution.
  4. Click Next: Parameter Settings.
  5. Set the following parameters:
    • unreachableInstanceId: required. The ID of the instance whose system disk is to be repaired.
    • credentialType: required. The type of the authentication credential used to reattach the repaired system disk to the original instance. You can select KeyPairName or Password.
    • credentialValue: required. The authentication credential value. If you set credentialType to KeyPairName, enter the name of a key pair. If you set credentialType to Password, enter a password.
    • imagePrefix: optional. The prefix for the name of the image backup. The default prefix is OOSRescueBackup-.
    • helperInstanceTypes: optional. The type of the temporary instance to be created. By default, the instance type with the lowest unit price is selected from the instance type list entered.
    • Permissions: Select Use Existing Permissions of Current Account.
  6. Click Next: OK and then Confirm and Create.

You can view the execution status of the OOS template in the execution details and view the execution status of the repair script by checking the rtCommandOutput parameter.

The execution takes about 5 to 10 minutes. Please wait. If the execution is successful, your instance has been repaired and is in the Running state.

Tips

If the template execution is canceled or interrupted, the system disk of the damaged instance may not be reattached after it is detached. In this case, the instance has no system disk. To reattach the system disk, perform the following steps: Log on to the ECS console, find the target instance, and then click the instance ID to go to the instance details page. Click Disks and then Attach Disk. Now you can view the ID of the original system disk. Copy the system disk ID prefixed with d-bp and paste it in the disk search box to search for the disk.

  • If the disk ID is found, select the disk ID. You can choose to use a key pair or custom password as the logon credential. This credential is used when you log on to the instance as the root user after the instance is started. For example, if you choose to use a custom password as the logon credential, specify the password in the Logon Password and Confirm Password fields, click OK, and then click Attach. If the disk is attached, you can see that the instance status is changed to Stopped.
  • If the disk ID is not found, view the execution details of the OOS template. Find the output parameter helperInstanceId in the untilStackReady task and copy the instance ID prefixed with i-. Then search for the instance ID in the ECS console. If the temporary instance is found, release the instance. After the release, perform the same steps as above: Copy the system disk ID prefixed with d-bp to the disk search box, find the target disk, and then attach it.

Internal implementation logic of the self-rescue solution

Repair the /etc/fstab file

  • Check whether the /etc/fstab file exists. If the /etc/fstab file exists, back up the file.
  • Repair the /etc/fstab file. If the /etc/fstab file does not exist or fails to be parsed, create the default /etc/fstab file.
  • If the nofail parameter is not set, set it to prevent startup failures.
  • If the fsck command is enabled, disable it.

Update the ramdisk file

  • Check whether the ramdisk file exists in /boot. If the ramdisk file exists, back up the file.
  • Rebuild the ramdisk file.