Cross-region and cross-zone disaster recovery-ECS - Cloud Backup

This topic describes the basic capabilities and benefits of the Elastic Block Storage (EBS) async replication feature for Elastic Compute Service (ECS) disaster recovery.

Overview

Cloud Backup provides cross-region and cross-zone disaster recovery capabilities based on the async replication feature to meet various business requirements.

Async replication is implemented on disks without the need to install an agent on the protected instance.

If a fault occurs on the primary system, the business system is switched to the disaster recovery system. This effectively prevents system failures caused by regional disasters, ensures business availability, and meets the recovery point objective (RPO) and recovery time objective (RTO) goals of your business.

Async replication is a feature that protects data across regions or across zones within the same region based on the data replication capability of EBS. For more information, see Overview.

The following table describes the differences between continuous data replication (CDR) and async replication.

Item	CDR	Async replication
Scenarios	Disaster recovery for a single virtual machine (VM). If you do not mind intrusions into the system, you can use this replication technology.	Disaster recovery that ensures the consistency of VM groups. If you do not expect intrusions into the system, you can use this replication technology.
Intrusive to the system	Yes	No
Replication implementation	An agent is installed on the operating system of the protected instance, so that Cloud Backup replicates data written to the disks and sends the data to a gateway in real time. The gateway stores the data in an Object Storage Service (OSS) bucket and then writes the data to the disk at the disaster recovery site.	Data is replicated by using the async replication and snapshot features.
Recovery implementation	Supports multiple recovery points. A shadow ECS instance and a gateway server are created for the protected ECS instance at the disaster recovery site. Cloud Backup reads data from the OSS bucket to the shadow ECS instance, writes the data to the ECS instance at the disaster recovery site, and then creates a recovery point based on the snapshot mechanism.	Supports only a single recovery point. Cloud Backup creates a recovery point by replicating the snapshot to the disaster recovery site.
Consistency group	Not supported	Supported

Benefits of disaster recovery

Agentless replication

Async replication does not require agents, does not intrude into the system, is universally applicable to operating systems, and does not consume computing resources at the disaster recovery site.

Multi-VM consistency

ECS disaster recovery provides multi-VM consistency to meet the high requirements for enterprise applications.

Ease of use

After you create a protection group for an application, you can add all the ECS instances of the application to the protection group and enable replication. You do not need to focus on the mappings between disks and ECS instances. ECS instances and disks are mapped by Cloud Backup.

Terms

Term	Description
site pair	Cross-region and cross-zone disaster recovery is implemented based on async replication. Async replication is used to replicate data from one site to another site across regions or across zones in a region. Therefore, you must pair two sites according to your business requirements. These two sites are referred to as a site pair. Protection groups must be created for the site pair. Disaster recovery is implemented only in the forward direction for the protection groups in a site pair. For example, disaster recovery is performed from Protection Group A to Protection Group B, and the forward protection is initiated from Region 1 to Region 2. Disaster recovery is performed from Protection Group C to Protection Group D, and the forward protection is initiated from Region 2 to Region 1. In this case, you must create two site pairs. A protection group can belong to only one site pair. Only one replication technology can be used for one site pair.
protection group	A protection group can contain multiple ECS instances. This way, you can use one plan to perform operations on multiple ECS instances at the same time. You can select the common type (no associations exist between multiple VMs) or the consistency group type. Only one underlying technology can be applied to the ECS instances in a protection group to implement disaster recovery: CDR or async replication. You must determine the underlying technology when you create a protection group. The normal states of a protection group include Starting Replication, Replicating Full Data, Replicating Incremental Data, Failover in Progress, Failover Completed, Reverse Replicating, Failback in Progress, and Failback Completed. The abnormal states include Replication Error, Failover Failed, and Failback Failed. A failover is performed for all the protected ECS instances in a protection group. Therefore, the role of all the protected ECS instances in a protection group must be the same.
protected instance	An ECS instance or database that is protected by Cloud Backup. Database protection will be supported in the future. Roles are classified into primary and secondary roles. Primary roles refer to the instances on which services are running, and secondary roles refer to the instances that are used for disaster recovery.
production site	The zone or region where your production business operates initially.
disaster recovery site	The zone or region for disaster recovery of your production business.
failover	The process of switching services to the disaster recovery site when a fault occurs at the production site. Failover is classified into planned failover and unplanned failover. The difference lies in whether the ECS instance at the production site fails during the switchover.
failback	The process of switching services from the disaster recovery site to the production site when the fault at the production site is rectified.
forward protection	The replication direction of the protection group and ECS instances. In forward protection, data and services are replicated from the production site to the disaster recovery site.
reverse protection	The replication direction of the protection group and ECS instances. After a failover, the disaster recovery site (Site B) becomes the new production site, and the production site (Site A) becomes the new disaster recovery site. In this case, after the replication is started, data is replicated from Site B to Site A. The reverse protection takes effect on the site pair. After a failback, Site A becomes the production site and Site B becomes the disaster recovery site again. In this case, after the replication is started, data is replicated from Site A to Site B. The forward protection resumes on the site pair.

Architecture

The following figure shows the technical architecture of disaster recovery based on CDR and async replication.

p637389_target

Supported disaster recovery scenarios

Disaster recovery scenario	Type
failover	Switch After Data Synchronization During the failover, Cloud Backup stops the protected instances in the protection group, and performs the final data synchronization after all the protected instances are stopped. The failover starts after the data is synchronized. This ensures that the data at the disaster recovery site is the same as that at the production site. This type of failover applies to scenarios such as planned disaster recovery drills and business migration. Switch Now During the failover, Cloud Backup attempts to stop the protected instances in the protection group. Cloud Backup does not wait until all the protected instances are stopped or perform the final data synchronization. Some data may be lost within the RPO range. This type of failover applies to scenarios in which a fault cannot be rectified within a short period of time at the production site and business must be immediately switched to the disaster recovery site.
failback	Switch After Data Synchronization During the failback, Cloud Backup stops the protected instances in the protection group, and performs the final data synchronization after all the protected instances are stopped. The failback starts after the data is synchronized. The service unavailability time is longer than the time for the immediate failback. The production site works properly in such failback scenarios. Switch Now During the failback, Cloud Backup attempts to stop the protected instances in the protection group. Cloud Backup does not wait until all the protected instances are stopped or perform the final data synchronization. This type of failback applies to scenarios where a fault cannot be rectified within a short period of time at the disaster recovery site and business must be immediately switched to the production site. During the failback, some data may be lost.

Disaster recovery process

To implement disaster recovery protection for critical applications in the Cloud Backup console, perform the following steps:

Step 1: Plan resources.
Before you perform disaster recovery, you must plan the required compute, network, and storage resources. You must determine the number of servers, storage capacity, and virtual private clouds (VPCs).
Step 2: Create a disaster recovery site pair.
Create VPCs and vSwitches for the disaster recovery site, and configure CIDR blocks. During the test, you can use the default configurations to create VPCs and vSwitches. You can also configure the same VPC CIDR block and vSwitch CIDR block for the production site and the disaster recovery site. During actual disaster recovery, you can configure CIDR blocks as required.
Step 3: Configure network and security settings.
Create resource mappings, including the zone mapping, vSwitch mapping, and security group mapping.
Step 4: Create a protection group.
Step 5: Add protected instances.
Add instances to be protected.
Step 6: Start replication.
Start disaster recovery protection, a process of replicating data from the production site to the disaster recovery site.
Note
You can perform a fault drill if the protection group is in Incremental Replication status or has a recovery point. For more information, see Fault drill.
Step 7: Perform a failover.
- Switch After Data Synchronization
  During the failover, Cloud Backup stops the protected instances in the protection group, and performs the final data synchronization after all the protected instances are stopped. The failover starts after the data is synchronized. This ensures that the data at the disaster recovery site is the same as that at the production site. This type of failover applies to scenarios such as planned fault drills and business migration.
- Switch Now
  During the failover, Cloud Backup attempts to stop the protected instances in the protection group. Cloud Backup does not wait until all the protected instances are stopped or perform the final data synchronization. Some data may be lost within the recovery point objective (RPO) range. This type of failover applies to scenarios where a fault cannot be rectified within a short period of time at the production site and business must be immediately switched to the disaster recovery site.

Billing

If you use the async replication feature for disaster recovery, the following fees are incurred:

The fees for using Cloud Backup clients for ECS disaster recovery
You are charged for using Cloud Backup clients for ECS disaster recovery based on the number of instances that are protected. For more information, see Pricing.
The fees for using the pay-as-you-go ECS instances and disks created at the disaster recovery site are included in your ECS bills. For more information, see Pay-as-you-go.
The fees for the traffic generated during cross-region replication are included in your ECS bills. For more information, see Disk disaster recovery.