Best Practices for Cross-Zone Disaster Recovery and Multi-Site High Availability on the Cloud

By Deng Qinglin, Alibaba Cloud Technical Expert
Contributed by ECS Team

System Disaster Recovery

When it comes to disaster recovery, it is bound to be associated with a fault. Common fault types include changes, hardware faults, power outages and network disconnection, and natural disasters, and the frequency of occurrence decreases in turn. However, low frequency does not mean it is not important. Failures caused by power outages or natural disasters are often fatal.

In 2021, the data center of a European cloud service company caught fire, causing the data center to burn down completely. Some customers' data was permanently lost and could not be recovered. Therefore, even if the application is deployed on the cloud, municipal faults (such as power outages, network disconnections, and faults caused by extreme natural disasters) cannot be avoided. Thus, it is necessary to make corresponding disaster recovery solutions.

Currently, the main types of disaster recovery can be divided into the following three categories:

① Cross-zone disaster recovery mainly includes zone-disaster recovery, dual-active disaster recovery, and multi-active in the same region.

② Cross-region disaster recovery mainly includes cross-region double-reading disaster recovery, cross-region application dual-active disaster recovery, and cross-region dual-active disaster recovery.

③ Other types include three data centers across two zones, three-active architecture across two zones, and unitization.

No disaster recovery solution can be applied to all scenarios. We need to comprehensively evaluate the actual business development trend, the characteristics of the business system, and the cost of resources that can be invested, and finally, select the most suitable disaster recovery architecture solution.

Mainstream Disaster Recovery Architecture

The disaster recovery capability is mainly evaluated by two metrics: RPO and RTO.

RPO refers to the maximum amount of data loss the service system can tolerate in the event of a failure. The more important the system is, the smaller the RPO is required. If data is backed up, a smaller RPO means data is backed up more frequently. For example, a general system may back up data once a day, and a very important system may back up data once an hour. If data is synchronized, a smaller RPO means the reliability of the data synchronization link is higher or the latency is lower, the pressure on the entire production environment and network is greater, and the cost is higher.

RTO is the maximum amount of time the system can tolerate from application failure to failure recovery. The more important the system is, the smaller the RTO is required.

The preceding figure shows the comparison of the current four mainstream disaster recovery architectures.

1. Disaster Recovery in the Same Region

At least two data centers are deployed in the same city. The secondary data center usually does not provide service and is mainly used as a backup of the primary data center. Data synchronization between the primary and secondary data centers is in the form of one-way synchronization.

The advantage is that the deployment is simple, and the same set of architecture can be completely copied to another data center. Data is synchronized in one direction, and there is very little transformation in the business.

The disadvantage is there is a waste of resources in the secondary data center. Cutting off the flow at critical moments is not advised, which is prone to inconsistent versions, parameters, and operating systems. RTO requires ten minutes.

When you switch the zone-disaster recovery architecture, you need to switch between the primary and secondary databases first. If it is in the cold reserve, the application service needs to be started, and the top-level DNS also needs to switch over resolution. The whole process takes more than ten minutes.

2. Dual-Active Disaster Recovery in the Same Region

Two data centers provide services at the same time. All operations involving the data level in the secondary data center will be returned to the primary data center for data consistency. Therefore, the distance between the two data centers is required to be less than 50km, and the RT is less than 2ms. If the request falls in the secondary data center, cross-data center operations are involved. If the RT across data centers is very large, the performance of data requests in the primary and secondary data centers will also vary significantly, which cannot provide a good user experience. Data synchronization in this architecture is one-way synchronization.

The advantage is that it solves the problem of resource waste in the secondary data center. Since the service status is maintained daily, it can be switched at any time when a fault occurs. Only the primary/secondary switchover of the database is required. The RTO is minute-level.

The disadvantages are that it is limited to the same zone, and the distance is limited.

3. Cross-Region Application Dual-Active Disaster Recovery (Pseudo-Cross-Region Dual-Active Disaster Recovery)

It has many similarities with the cross-zone dual-active disaster recovery. The only difference is that the read and write operations of the secondary data center are separated. The read operations directly read the standby data center, but the write operations will be done in the primary data center to ensure data consistency. The distance between the two data centers is less than 100 km, and the RT is less than 7 ms. If the distance between the two data centers is too long, the performance of the request between the two data centers will vary immensely. This architecture is more suitable for systems that read more and write less.

The advantage is that it has a certain degree of regional-level disaster recovery capability. Although the architecture requires a distance of less than 100 km, for most prefecture-level cities, 100 km can already cover two prefecture-level cities, and RTO only takes minutes.

The disadvantage is that the business system needs to be able to accept a certain degree of cross-data center network delay. In addition, the business needs to undergo a certain degree of transformation, mainly in terms of reading/write splitting operations. The disaster recovery distance is still very limited, so it is called pseudo-cross-region dual-active disaster recovery.

4. Cross-Region Dual-Active Disaster Recovery

The true cross-region dual-active disaster recovery supports the distance between two data centers greater than 1000 km, and the RT exceeds 10 ms. We used a unitized solution to solve the performance difference in the previous two data centers. Unitization means that after a request falls to a unit, all request operations are processed in a closed loop within the unit to avoid involving cross-data center operations. Therefore, no matter which data center the request falls in, consistent processing efficiency and good user experience can be ensured. Data is required to be synchronized in both directions between data centers to achieve unitization.

The advantage is that the disaster recovery capability is very strong and almost unlimited. The RTO is at the minute level.

The disadvantage is that the deployment is complex because it involves two-way data synchronization, including databases, Redis caching, Rocket MQ, and stateful middleware. The business transformation cost is very high, involving dimensions such as unitization and access layer.

Elastic Computing Practice in Disaster Recovery

The preceding figure shows the original architecture of the dual-active disaster recovery in the same region.

When a user accesses the domain name, the request is sent to the Internet SLB. SLB has primary and secondary disaster recovery capabilities between two zones, so the request will be routed to a zone. Then, SLB in a zone forwards the request to a specific business server. The business server sends all data operations to the primary data center. The data between the primary and secondary data centers are synchronized in one direction. The underlying system operates all cloud products in all regions.

The system provides cross-zone-level disaster recovery in this case. The RPO is less than 100 milliseconds, and the RTO is less than ten minutes. We also designed a policy that prioritizes RPC calls from the same data center to improve the performance of the application and try to avoid RPC calls across data centers.

The disadvantage of this architecture is that it does not have the capability of disaster recovery at the regional level. Secondly, the system will operate cloud products in all regions. If a problem occurs in the system, the operation of cloud products in all regions will be affected, and the impact will be very large. Overseas access is slow because the system is deployed in China. Overseas access involves cross-border issues, and overseas users cannot open pages in serious cases.

We introduced the second version of architecture (whose core is unitization) to address the shortcomings of the preceding version. Unitization means all operating systems of a region are deployed in this region. The region A service does not involve resource operations of region B. The region has a dual-zone disaster recovery capability.

The only difference between this architecture and the original version is that the system of this version only operates cloud products in this region. In case of a problem, the fault is relatively controllable, which will only affect this region (and no other regions), thus reducing the blast radius of the fault.

This version of the unit still has the cross-zone-level disaster recovery capability, RPO is less than 100 milliseconds, RTO is less than ten minutes, and the policy that prioritizes RPC calls from the same data center is still retained.

The disadvantage is that this architecture does not have the capability of disaster recovery at the regional level. Then, the user experience is very poor. For example, when operating cloud product resources on the system, if the region switch is involved, the entire page needs to be refreshed. In addition, the problem of slow cross-border access still exists.

We evolved the architecture to the third version to solve the shortcomings above. The core is globalization, which is essentially multi-site high availability. Each unit has a domain name to provide services during unitization. After globalization, all domain names are unified into one domain name to provide external services, and the top-level DNS performs intelligent proximity resolution.

The region still has a zone-level disaster recovery. Multiple regions are deployed worldwide, but regions are divided into main regions and unit regions. All data write operations are returned to the primary center, and write operations in the unit region are returned to the primary region. After a write operation is returned to the center, the center data is synchronized to its unit region in one direction. If it is two-way synchronization, the topological relationship of synchronization will form a very complex mesh, so the one-way synchronization mode is adopted.

This architecture provides region-level disaster recovery and intelligent proximity resolution. This improves user experience. There is no need to switch between domain names repeatedly.

However, due to a large number of regional deployments, data synchronization takes a long time to ensure the RPO is less than ten seconds and the RTO is less than ten minutes. The policy of prioritizing RPC calls from the same data center is still retained within the regional unit. When a fault occurs, the request will be routed to the nearest region.

This architecture is complex to deploy and involves data synchronization issues. Therefore, the system needs to be reformed to a certain extent. For example, write operations must be returned to the center, and after data is modified, cache updates will also be involved. In addition, all write operations must be returned to the center. Therefore, write operations are still cross-zone disaster recovery, and cross-region disaster recovery capabilities are not truly achieved.

Disaster Recovery Construction on the Cloud

The construction of cloud-based disaster recovery is divided into three stages: analysis, design, and implementation.

In the analysis phase, you need to consider whether the business needs disaster recovery and to what extent. For example, the initial stage of the system pays more attention to the number of users. After the number of users reaches a certain level, you need to care about the stability of the system and consider the disaster tolerance capability. In addition, when conducting disaster recovery, you need to sort out the system business separately (whether it is the core business, the acceptable RPO, etc.)

The design phase will be based on the data derived from the analysis phase.

The process in the implementation phase involves teamwork, resource investment at the organizational level, and (more importantly) a detailed plan on how to recover after a failure. In addition, normal disaster recovery drills and maintenance of disaster recovery systems, including personnel training, are all very large systematic projects.

Alibaba Cloud provides multiple cloud products and services to help users complete disaster recovery construction quickly and efficiently.

If the system is not deployed on the cloud, you can use the Server Migration Center to migrate the entire system to the cloud quickly. It can support migration on multiple platforms and environments and does not depend on the underlying environment of the source server. It can support migration without service interruption. All operations can be completed through white-screen configuration in the console. The security of data transmission is guaranteed during the process. Resumable upload and incremental migration are supported.

If the system is already on the cloud, Alibaba Cloud also provides Resource Orchestration Service (ROS). After you determine the filtering conditions, you can copy the system to another region or zone quickly.

_10

After the service is deployed, you can use Data Transmission Service (DTS) for data synchronization or data backup. DTS is very powerful and supports migration between homogeneous or heterogeneous data sources and migration without service interruption. It also supports one-way synchronization and two-way synchronization between data sources.

Multi-Site High Availability (MSHA) can transform the business and integrate data synchronization products (such as DTS). It can quickly build the overall disaster recovery capability of the business, including from a single region to multiple regions, from a single cloud to multiple clouds, and from primary and secondary to multi-active disaster recovery.

In addition, MSHA has accumulated a lot of practical experience, including public cloud, private cloud, and hybrid cloud. It provides a console where you can manage and switch over disaster recovery.

_11

Alibaba Cloud DNS performs proximity access based on intelligent DNS resolution. Mainstream DNS resolution services can provide more intelligent resolution lines, such as state level, regional level, country level, etc.

The cloud database provides zone-level primary and secondary capabilities for both the highly available version of RDS and the dual-availability version of Redis, eliminating the need for users to do the processing themselves.

Networks between different regions in remote scenarios can be connected through Cloud Enterprise Network (CEN) to connect multiple VPC-based networks.

Q&A

Q1: What is the main difference between disaster recovery in available zones and traditional disaster recovery?

A: Traditional disaster recovery refers to zone-disaster recovery. The secondary data center does not provide services but mainly provides backup. The advantage is that the business transformation is very small, and the deployment is simple, The disadvantage is that the resource is wasted; since the service is not usually provided, the flow is not cut at critical moments. There is also a combination of dual-active disaster recovery in the same zone. (For example, there are three active data centers across two zones, which refers to two available zone-level data centers in a certain city that provide services at the same time, while the other city mainly does disaster recovery and does not provide services at ordinary times, which is similar to the combination of dual-active disaster recovery in the same zone and zone-disaster recovery among multiple cities.)

Q2: How does multi-site high availability ensure data synchronization?

A: The database can synchronize data. The cloud also provides related products (such as DTS), which can help users synchronize data more easily. DTS also provides synchronization capabilities for middleware (such as Redis and RocketMQ). There are many solutions in the open-source industry, but they involve operation and maintenance.

Q3: What is the difference in data synchronization between multi-site high availability and multi-active in the same region?

A: You can use RDS or Redis directly on the cloud for data synchronization of multi-active in the same region. The high-availability version and the dual-zone version directly provide the zone-level disaster recovery capability. Users do not need to do data synchronization by themselves. However, multi-site high availability and dual-active disaster recovery involve cross-region issues, and RDS or Redis cannot provide the corresponding capability. Therefore, users need to use DTS to establish synchronization links or implement related data synchronization components by themselves. Cloud products provide very rich disaster recovery capabilities but mainly focus on internal regions.

Q4: In the case of internationalization, do all users of remote access return to the original master center or only write DB back to the original master center?

A: Write mainly refers to the data read and write operations carried out by the app. Write operations in all regions are sent to the central region. After the central region writes the data to the database, the data is synchronized to each region with the synchronization capability of DTS. At the same time, DTS provides the messaging capability of binlogs. Each unit subscribes to the DTS binlog Message Service of the primary data center to be informed when the cache needs to be discarded or refreshed.

Q5: What kind of system needs to be transformed?

A: Some systems can be unitized, but some systems (such as inventory service) cannot be unitized because inventory deduction requires strong global consistency. Therefore, unitized deployment is impossible. Therefore, in the analysis phase, we need to determine whether the system can meet the requirements of unitization transformation or system RTO and RPO. Different business systems and scenarios have different disaster recovery architectures. You need to select different disaster recovery architectures based on actual business scenarios.

Community

Best Practices for Cross-Zone Disaster Recovery and Multi-Site High Availability on the Cloud

System Disaster Recovery

Mainstream Disaster Recovery Architecture

1. Disaster Recovery in the Same Region

2. Dual-Active Disaster Recovery in the Same Region

3. Cross-Region Application Dual-Active Disaster Recovery (Pseudo-Cross-Region Dual-Active Disaster Recovery)

4. Cross-Region Dual-Active Disaster Recovery

Elastic Computing Practice in Disaster Recovery

Disaster Recovery Construction on the Cloud

Q&A

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

ECS(Elastic Compute Service)

Elastic High Performance Computing Solution

Elastic High Performance Computing

Apsara Stack