If the underlying hardware that hosts an ECS instance fails, Alibaba Cloud will perform a check within one minute to verify whether the failure can be reversed and if the instance can be repaired. If the instance cannot be repaired, within about five minutes the instance will be restarted for failover. Metadata of the recovered instance, such as the instance ID, private IP address, and public IP address, is not changed.

Recovery modes

By default, when an ECS instance goes down unexpectedly or an active O&M plan is carried out, Alibaba Cloud automatically restarts the instance. You can customize the modes to recover an instance.

Recovery mode Description System event value Instance without local storage * Instance with local storage or supporting SGX-based encryption **
Automatically restart (default) The instance restarts automatically and is restored to its previous lifecycle state. SystemFailure.Reboot Supported Supported
Stop The instance remains in the Stopped state. This mode is applicable to scenarios in which failover or node switchover is enabled at the application layer. This mode can avoid service conflicts that may occur after an instance restarts automatically. SystemFailure.Stop Supported Supported
Automatically redeploy The local disks of the instance are redeployed, data on the local disks is deleted, and SGX is reset. SystemFailure.Redeploy N/A Supported

* Instance families without local storage, such as the g-series, c-series, and r-series instance families. For more information, see Instance families.

** The following instance families are supported:
  • Instance families with local storage, such as the d-series, i-series, and gn5-series instance families.
  • Instance families that support Intel ® SGX-based encryption, such as the ebmhfg5 ECS Bare Metal Instance family with high clock speed.

Limits

  • You can select how to automatically recover an ECS instance, but you cannot intervene with ongoing recovery.
  • You cannot manually restart an instance while automatic recovery is in process.

Improve fault tolerance

To make full use of the automatic recovery feature and failover operation of an instance, make sure that you have completed the following operations:

  • Add core applications such as SAP HANA to the automatic startup item list to avoid any interruptions to your business operations.
  • Enable the automatic reconnection feature for your applications. For example, allow applications to automatically connect to MySQL, SQL Server, or Apache Tomcat.
  • If you use Server Load Balancer (SLB), deploy multiple ECS instances in a cluster. When an ECS instance is in the automatic recovery process, other ECS instances can continue to provide access to your services.
  • We recommend that you back up data on the local disk on a regular basis to ensure that you have redundant copies available to redeploy your instance.

What to do next