The primary and disaster recovery systems are deployed in different regions of Alibaba
Cloud. If the primary system encounters a failure, the business system switches to
the disaster recovery system. Elastic Compute Service (ECS) disaster recovery provides
a highly reliable disaster recovery service by deploying the primary and disaster
recovery systems in different regions. This service features a recovery point objective
(RPO) of as low as 1 minute and a recovery time objective (RTO) of as low as 15 minutes.
Cross-region disaster recovery can guarantee business continuity and prevent system
failures that are caused by regional disasters.
Before you begin
Before you implement cross-region disaster recovery, you must select a region to deploy
the disaster recovery system. The region must be different from the region where the
production environment is deployed. You must create a virtual private cloud (VPC)
in the region. In addition, you must create a vSwitch for replication and a vSwitch
for restoration in the VPC.
Step 1: Create a disaster recovery site pair
To create a disaster recovery site pair that provides cross-region disaster recovery
protection for ECS instances in the primary site, perform the following steps:
- Log on to the HBR console.
- In the left-side navigation pane, choose .
- In the upper-right corner of the Disaster Recovery Center page, click + Add.
- On the Create Disaster Recovery Site Pair (Continuous Data Replication) panel, set the parameters and click Create.
- Set Type to Region to Region.
- Configure the primary site information.
The primary site is used to specify the location of the server that needs disaster
recovery on the cloud.
Parameter |
Description |
Name |
Specify the name of the primary site. For example, you can specify Hangzhou Primary
Site. The name can be up to 60 characters in length. The name must meet the following
requirements:
- The name cannot start with a special character or digit.
- The name can contain only the following special characters: periods (.), underscores
(_), and hyphens (-).
|
Region |
Select the region where the primary site resides from the Region drop-down list. For example, you can select China (Hangzhou).
|
VPC |
Select the VPC that is created for the primary site from the VPC drop-down list. For example, you can select Default VPC.
|
- Configure the secondary site information.
The compute and storage resources that are used by the secondary site are created
in the specified VPC.
Parameter |
Description |
Name |
Specify the name of the secondary site. For example, you can specify Shanghai Secondary
Site. The name can be up to 60 characters in length. The name must meet the following
requirements:
- The name cannot start with a special character or digit.
- The name can contain only the following special characters: periods (.), underscores
(_), and hyphens (-).
|
Region |
Select the region where the secondary site resides from the Region drop-down list. For example, you can select China (Shanghai).
|
VPC |
Select the VPC where the secondary site resides from the VPC drop-down list. For example, you can select Default VPC.
|
Step 2: Add the ECS instances to be protected
To add the ECS instances to be protected, perform the following steps:
- Click the Protected Server tab. In the upper-right corner of this tab, select the disaster recovery site pair
that you created in Step 1 from the drop-down list.
- On the Protected Server tab, click + Add. Select the ECS instances and click OK.
You can select 1 to 10 ECS instances.
In the Server Status column, the status of the added ECS instances is Agent Installing
and then changes to Initialized. If the status of an ECS instance is not Initialized,
choose in the Operation column to initialize the instance.
Step 3: Start replication
To enable real-time replication of ECS instances to Alibaba Cloud, perform the following
steps:
- On the Protected Server tab, find the ECS instance that you want to replicate and choose in the Operation column.
- On the Enable Replication panel, set the parameters and click Start.
Parameter |
Description |
Recovery Point Policy |
Select the interval at which recovery points are created from the drop-down list.
Unit: hours. For example, if you select 1 hour, HBR creates a recovery point every
hour.
|
Use SSD |
Specify whether to use SSD. If you select this check box, SSDs are used for replication. If you use SSDs, the
I/O performance of the ECS instance on the cloud after server migration or failover
is significantly improved. However, the usage cost increases. We recommend that you
select as needed.
|
Replication Network |
Select a replication network from the drop-down list. HBR uses this network to replicate
data for disaster recovery.
By default, HBR reads the available vSwitches of the secondary VPC network. If the
replication network and the recovery network are not in the same zone, the RTO becomes
longer. We recommend that you configure the same zone for the replication network
and the recovery network.
|
Recovery Network |
Select a recovery network from the drop-down list. HBR uses this network to restore
data for disaster recovery.
By default, HBR reads the available vSwitches of the secondary VPC network. If the
replication network and the recovery network are not in the same zone, the RTO becomes
longer. We recommend that you configure the same zone for the replication network and the recovery network.
|
Automatic restart after replication interruption |
Specify whether to automatically resume replication if an interruption occurs. |
The ECS instance then enters the
Enable Replicating,
Initial Full Sync, and
Replicating states in sequence.
- Enable Replicating: ECS disaster recovery is scanning data on the ECS instance and evaluating the overall
data volume. In most cases, this process takes a few minutes.
- Initial Full Sync: ECS disaster recovery is replicating valid data on the ECS instance to Alibaba Cloud.
The replication duration depends on factors such as the data volume and the network
bandwidth of the ECS instance. The progress bar in the Server Status column shows
the replication progress.
- Replicating: After all valid data on the ECS instance is replicated to Alibaba Cloud, Aliyun
Replication Service (AReS) monitors all write operations that are performed on the
disks of the ECS instance and replicates the incremental data to Alibaba Cloud in
real time.
(Optional) Perform a disaster recovery drill
After an ECS instance enters the Replicating state, you can perform a disaster recovery
drill on the ECS instance.
A disaster recovery drill is an important part of disaster recovery. It allows you
to run a protected ECS instance on the cloud to verify whether your applications can
run as expected. A disaster recovery drill has the following features:
- Allows you to easily check whether an application can run on a restored ECS instance
as expected.
- Familiarizes you with the disaster recovery process and makes sure that a smooth failover
can be performed when the primary site encounters a failure.
To perform a disaster recovery drill, perform the following steps:
- On the Protected Server tab, find the ECS instance and click Test Failover in the Operation column.
- On the Test Failover panel, set the Recovery Network, IP Address, Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script parameters. Then, click Start.
Note
- HBR automatically retains 24 recovery points that are created in the most recent 24
hours for each ECS instance.
- If you do not select Use ECS Specification, you must set the CPU and Memory parameters.
Alibaba Cloud then runs the application on a restored ECS instance at the specified
time. The disaster recovery drill does not affect real-time data replication.
After the disaster recovery drill is completed within a few minutes, click the link
in the Test Failover Information column to verify restored data and applications.
- Clear the drill environment.
After the verification is completed, click
Cleanup Test Environment in the
Operation column. Then, the restored ECS instance is deleted.
Note After the restored ECS instance is verified, we recommend that you delete the restored
ECS instance at the earliest opportunity to reduce costs.
Step 4: Perform a failover
Regular disaster recovery drills ensure that you can run your applications on restored
ECS instances at any time. When a critical error occurs in the primary site, you can
switch your workloads to the secondary site.
Warning Failover is applicable to protected ECS instances where a critical error occurs. During
the failover, ECS disaster recovery stops real-time data replication. To resume replication
for a protected ECS instance, you must choose More > Server Operation > Restart Replication
in the Operation column.
To perform a failover, perform the following steps:
- On the Protected Server tab, find the ECS instance and choose in the Operation column.
- On the Failover panel, set the Recovery Network, IP Address, Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script parameters. Then, click Start.
Notice You can restore the ECS instance to the current point in time only once.
- After the failover is completed, click the link in the Recovered Instance ID/Name column to verify restored data and applications.
- If the applications run as expected after being restored to the current point in time,
choose in the Operation column.
Note After you complete the failover or change the recovery point and verify that applications
restored from the protected ECS instance are running your business, you can commit
the failover to release the cloud resources that are occupied during failover to save
resources.
- If the applications do not meet the requirements after being restored to the current
point in time, for example, data in the restored database is inconsistent with that
in the source database or dirty data on the source ECS instance is synchronized to
the restored ECS instance in the destination region, choose in the Operation column to change the recovery point before you commit the failover.
Note The procedure for changing the recovery point is similar to that for failover, except
that you must select a recovery point earlier than the current point in time.
Step 5: Perform a reverse replication
After you replicate applications on a protected ECS instance in Region A to Region
B, you can also perform a reverse replication to replicate applications from Region
B to Region A.
To perform a reverse replication, perform the following steps:
- On the Protected Server tab, find the ECS instance and choose in the Operation column. In the message that appears, confirm that you want to perform a reverse registration
on the ECS instance.
- In the Actions column, choose .
- On the Initiate Reverse Replication panel, set the Original machine recovery, Replication Network, and Recovery Network parameters. Then, click Start.
Warning Cross-region disaster recovery and cross-zone disaster recovery allow you to replicate
applications back to the original ECS instance. However, when you replicate applications
back to the original ECS instance, data on the original ECS instance is overwritten.
Perform this operation with caution.
- After the ECS instance enters the Reversed Enable Replicating state, choose in the Operation column.
- On the Failback panel, set the CPU, Memory, Recovery Network, IP Address, and Post Script parameters. Then, click Start.
- After the failback is completed, choose in the Operation column to register the protected ECS instance again.