The disaster recovery system is deployed across two Alibaba Cloud regions. If the production site fails—for example, due to a tsunami or earthquake—the business system switches to the disaster recovery site. Because the production and disaster recovery sites reside in different regions, this solution delivers Disaster Recovery as a Service with a recovery point objective (RPO) as low as 1 minute and a recovery time objective (RTO) as low as 15 minutes. This ensures highly reliable business continuity and effectively prevents system failures caused by regional disasters.
Preparations
Before implementing cross-region disaster recovery, select a region other than your production environment as the destination region for disaster recovery. In that region, create a virtual private cloud (VPC), and create both a replication vSwitch and a recovery vSwitch.
Step 1: Create a disaster recovery site pair
After completing the preparations, protect your source ECS instance with cross-region disaster recovery as follows:
Log on to the Cloud Backup console.
Select , then click Switch to Continuous Replication-based Disaster Recovery in the upper-left corner of the page.
Click Add, select Cross-region disaster recovery as the type, and enter the Production site information and Disaster recovery site information.
Click Create.
Step 2: Add protected servers
After creating the disaster recovery site pair, add protected servers as follows:
Click the Protected Servers tab and confirm the disaster recovery site pair information in the upper-right corner.
Click Add next to Protected Servers, then select the ECS instances you want to protect.
Click Confirmation to complete the addition. The server status will first show "Installing client" and then change to "Initialized".
NoteIf the server status does not show Initialized, click to complete client initialization.
Step 3: Start replication
Start disaster recovery replication to copy your server to the cloud and maintain real-time replication. Follow these steps:
Click the Protected Servers tab. In the Actions column for the server you want to replicate, choose .
In the Start Replication panel, configure the following parameters, then click Start.
Parameter
Description
Recovery Point Policy
Select a time interval from the drop-down list. Cloud Backup creates a recovery point at this interval each day. The unit is hours.
Disk Type
Supported types include ultra disk, ESSD, and SSD.
Copy Network
Select a replication network from the drop-down list. Cloud Backup uses this network to replicate disaster recovery data to the cloud.
By default, Cloud Backup reads available vSwitches from the secondary site VPC. The replication and recovery networks can use the same vSwitch. Using the same network speeds up recovery. If the replication and recovery networks are in different zones, RTO increases. We recommend configuring the same zone as the Recovery Network.
Restore Network
Select a recovery network from the drop-down list. During disaster recovery (such as drills or failover), Cloud Backup uses this network to restore data—for example, to create recovered ECS instances.
By default, Cloud Backup reads available vSwitches from the secondary site VPC. The replication and recovery networks can use the same vSwitch. Using the same network speeds up recovery. If the replication and recovery networks are in different zones, RTO increases. We recommend configuring the same zone as the Replication Network.
Automatically Restart After Replication Interruption
Specifies whether to automatically restart replication after an interruption. Select this option to restart the replication task if it stops.
The disaster recovery replication then proceeds through three stages: Starting Replication, Full Replication, and Real-time Replication.
Starting Replication: The ECS disaster recovery service scans system data and estimates the total data volume. This stage usually takes a few minutes.
Full Replication: The ECS disaster recovery service transfers all valid data from the entire server to Alibaba Cloud. The duration depends on data volume and network bandwidth. The console progress bar shows replication progress.
Real-time Replication: After full replication completes, Alibaba Cloud holds a complete copy of your data. Then, Alibaba Cloud Replication Service (AReS) monitors all disk write operations on the server and continuously replicates them to Alibaba Cloud in real time.
(Optional) Disaster recovery drill
Once real-time replication starts, you can perform a disaster recovery drill on your server.
A disaster recovery drill launches the protected server in the cloud and validates application correctness. It is a critical part of the disaster recovery process because it:
Verifies that the protected application can start normally in the cloud.
Ensures that operators are familiar with the recovery process so they can smoothly perform a switchover if the primary site fails.
Perform a disaster recovery drill as follows:
On the Protected Servers tab, click Disaster Recovery Drill in the Actions column for the server you want to test.
In the Disaster Recovery Drill panel, select the Recovery Network, IP Address, whether to Use ECS Instance Type, Disk Type, Recovery Point, Elastic IP Address, and Post-switch Script. Then click Start.
NoteCloud Backup automatically retains 24 recovery points from the last 24 hours for each server.
If you do not use an ECS instance type, you must also specify CPU and memory.
Alibaba Cloud then starts the server in the background based on your selected point in time. Real-time data replication continues unaffected during the drill.
After a few minutes, the drill completes. Click the link under Drill Information to verify data and applications.
Purge the drill environment.
After verification, click Purge Drill Environment in the Actions column for the server. This deletes the recovered ECS instance.
NoteAfter verifying the ECS instance created during the drill, purge the drill environment as soon as possible to reduce costs.
Step 4: Failover
Regular disaster recovery drills ensure your business can start in the cloud at any time. If your primary site suffers a major failure and you need to restart core services immediately in the cloud, perform a failover.
Use failover only when the protected server has a critical failure. This operation stops real-time replication. You must restart replication and perform a full replication to resume disaster recovery protection.
Perform a failover as follows:
On the Protected Servers tab, in the Actions column for the server, choose .
In the Failover panel, select the Recovery Network, IP Address, whether to Use ECS Instance Type, Disk Type, Recovery Point, Elastic IP Address, and Post-switch Script. Then click Start.
ImportantYou can use the Current Time recovery point only once.
After failover completes, click the link under Failover/Failback Information to check data and applications.
If the application runs correctly at the current point in time, choose .
NoteAfter completing failover or switching recovery points—and confirming that the recovered application has taken over business—performing Confirm Failover cleans up disaster recovery resources in the cloud to save costs.
If the application state is unsatisfactory—for example, due to database consistency issues or corrupted source data already synchronized to the other region—before confirming failover, choose .
NoteChanging the recovery point works like failover—you only need to select an earlier recovery point.
Step 5: Reverse replication
After replicating a protected server from one region—for example, Region A—to another—for example, Region B—you can perform reverse replication from Region B back to Region A.
Perform reverse replication as follows:
On the Protected Servers tab, in the Actions column for the server, choose , then confirm reverse registration of the protected server.
In the Actions column, choose .
In the Start Reverse Replication panel, select whether to enable Original Machine Recovery, then select the Replication Network and Recovery Network. Then click Start.
WarningCross-region and cross-zone disaster recovery support original machine recovery. When enabled, data on the target ECS host will be purged. Use this option with caution.
When the server enters reverse real-time replication, in the Actions column, choose .
In the Failback panel, enter CPU and Memory information, select the Recovery Network and IP Address, and edit the Post-recovery Script.
After failback completes, in the Actions column, choose to re-register the protected server.