Deploy Cross-Region Disaster Recovery for Business Continuity - Cloud Backup

The disaster recovery system is deployed across two Alibaba Cloud regions. If the production site fails—for example, due to a tsunami or earthquake—the business system switches to the disaster recovery site. Because the production and disaster recovery sites reside in different regions, this solution delivers Disaster Recovery as a Service with a recovery point objective (RPO) as low as 1 minute and a recovery time objective (RTO) as low as 15 minutes. This ensures highly reliable business continuity and effectively prevents system failures caused by regional disasters.

Preparations

Before implementing cross-region disaster recovery, select a region other than your production environment as the destination region for disaster recovery. In that region, create a virtual private cloud (VPC), and create both a replication vSwitch and a recovery vSwitch.

Step 1: Create a disaster recovery site pair

After completing the preparations, protect your source ECS instance with cross-region disaster recovery as follows:

Log on to the Cloud Backup console.
Select Disaster Recovery > ECS Disaster Recovery, then click Switch to Continuous Replication-based Disaster Recovery in the upper-left corner of the page.
Click Add, select Cross-region disaster recovery as the type, and enter the Production site information and Disaster recovery site information.
Click Create.

Step 2: Add protected servers

After creating the disaster recovery site pair, add protected servers as follows:

Click the Protected Servers tab and confirm the disaster recovery site pair information in the upper-right corner.
Click Add next to Protected Servers, then select the ECS instances you want to protect.
Click Confirmation to complete the addition. The server status will first show "Installing client" and then change to "Initialized".
Note
If the server status does not show Initialized, click More > Server Operations > Restart Server to complete client initialization.

Step 3: Start replication

Start disaster recovery replication to copy your server to the cloud and maintain real-time replication. Follow these steps:

Click the Protected Servers tab. In the Actions column for the server you want to replicate, choose More > Failover > Start Replication.

In the Start Replication panel, configure the following parameters, then click Start.

Parameter	Description
Recovery Point Policy	Select a time interval from the drop-down list. Cloud Backup creates a recovery point at this interval each day. The unit is hours.
Disk Type	Supported types include ultra disk, ESSD, and SSD.
Copy Network	Select a replication network from the drop-down list. Cloud Backup uses this network to replicate disaster recovery data to the cloud. By default, Cloud Backup reads available vSwitches from the secondary site VPC. The replication and recovery networks can use the same vSwitch. Using the same network speeds up recovery. If the replication and recovery networks are in different zones, RTO increases. We recommend configuring the same zone as the Recovery Network.
Restore Network	Select a recovery network from the drop-down list. During disaster recovery (such as drills or failover), Cloud Backup uses this network to restore data—for example, to create recovered ECS instances. By default, Cloud Backup reads available vSwitches from the secondary site VPC. The replication and recovery networks can use the same vSwitch. Using the same network speeds up recovery. If the replication and recovery networks are in different zones, RTO increases. We recommend configuring the same zone as the Replication Network.
Automatically Restart After Replication Interruption	Specifies whether to automatically restart replication after an interruption. Select this option to restart the replication task if it stops.

The disaster recovery replication then proceeds through three stages: Starting Replication, Full Replication, and Real-time Replication.

Starting Replication: The ECS disaster recovery service scans system data and estimates the total data volume. This stage usually takes a few minutes.
Full Replication: The ECS disaster recovery service transfers all valid data from the entire server to Alibaba Cloud. The duration depends on data volume and network bandwidth. The console progress bar shows replication progress.
Real-time Replication: After full replication completes, Alibaba Cloud holds a complete copy of your data. Then, Alibaba Cloud Replication Service (AReS) monitors all disk write operations on the server and continuously replicates them to Alibaba Cloud in real time.

(Optional) Disaster recovery drill

Once real-time replication starts, you can perform a disaster recovery drill on your server.

A disaster recovery drill launches the protected server in the cloud and validates application correctness. It is a critical part of the disaster recovery process because it:

Verifies that the protected application can start normally in the cloud.
Ensures that operators are familiar with the recovery process so they can smoothly perform a switchover if the primary site fails.

Perform a disaster recovery drill as follows:

On the Protected Servers tab, click Disaster Recovery Drill in the Actions column for the server you want to test.
In the Disaster Recovery Drill panel, select the Recovery Network, IP Address, whether to Use ECS Instance Type, Disk Type, Recovery Point, Elastic IP Address, and Post-switch Script. Then click Start.
Note
- Cloud Backup automatically retains 24 recovery points from the last 24 hours for each server.
- If you do not use an ECS instance type, you must also specify CPU and memory.
Alibaba Cloud then starts the server in the background based on your selected point in time. Real-time data replication continues unaffected during the drill.
After a few minutes, the drill completes. Click the link under Drill Information to verify data and applications.
Purge the drill environment.
After verification, click Purge Drill Environment in the Actions column for the server. This deletes the recovered ECS instance.
Note
After verifying the ECS instance created during the drill, purge the drill environment as soon as possible to reduce costs.

Step 4: Failover

Regular disaster recovery drills ensure your business can start in the cloud at any time. If your primary site suffers a major failure and you need to restart core services immediately in the cloud, perform a failover.

Warning

Use failover only when the protected server has a critical failure. This operation stops real-time replication. You must restart replication and perform a full replication to resume disaster recovery protection.

Perform a failover as follows:

On the Protected Servers tab, in the Actions column for the server, choose More > Failover > Failover.
In the Failover panel, select the Recovery Network, IP Address, whether to Use ECS Instance Type, Disk Type, Recovery Point, Elastic IP Address, and Post-switch Script. Then click Start.
Important
You can use the Current Time recovery point only once.
After failover completes, click the link under Failover/Failback Information to check data and applications.
- If the application runs correctly at the current point in time, choose More > Failover > Confirm Failover.
  Note
  After completing failover or switching recovery points—and confirming that the recovered application has taken over business—performing Confirm Failover cleans up disaster recovery resources in the cloud to save costs.
- If the application state is unsatisfactory—for example, due to database consistency issues or corrupted source data already synchronized to the other region—before confirming failover, choose More > Failover > Change Recovery Point.
Note
Changing the recovery point works like failover—you only need to select an earlier recovery point.

Step 5: Reverse replication

After replicating a protected server from one region—for example, Region A—to another—for example, Region B—you can perform reverse replication from Region B back to Region A.

Perform reverse replication as follows:

On the Protected Servers tab, in the Actions column for the server, choose More > Failback > Reverse Registration, then confirm reverse registration of the protected server.
In the Actions column, choose More > Failback > Start Reverse Replication.
In the Start Reverse Replication panel, select whether to enable Original Machine Recovery, then select the Replication Network and Recovery Network. Then click Start.
Warning
Cross-region and cross-zone disaster recovery support original machine recovery. When enabled, data on the target ECS host will be purged. Use this option with caution.
When the server enters reverse real-time replication, in the Actions column, choose More > Failback > Failback.
In the Failback panel, enter CPU and Memory information, select the Recovery Network and IP Address, and edit the Post-recovery Script.
After failback completes, in the Actions column, choose More > Failover > Register to re-register the protected server.