Cross-availability zone disaster recovery and remote multi-active on the cloud

01 System disaster tolerance

When it comes to disaster tolerance, it is bound to be related to failure. Common failure types include change, hardware failure, power failure and natural disaster, and the frequency of occurrence decreases in turn. However, the low frequency of occurrence does not mean that it is unimportant. Faults caused by power failure or natural disasters are often fatal.

On March 10, 2021, the computer room of OVH, the largest cloud service company in Europe, was on fire in France, resulting in the complete burning of the data center, resulting in the offline of 3.5 million websites, and the permanent loss of some customers' data, which could not be recovered. The CEO of OVH Company reminded customers to use their own disaster recovery plan in the description of the fire on Twitter. It can be seen that even if the application is deployed on the cloud, it cannot avoid municipal failures, such as power failure and network outage and failures caused by extreme natural disasters. Therefore, it is also necessary to make corresponding disaster recovery plans.

At present, the main disaster tolerance types can be divided into the following three categories:

① In the same city (across available zones), it is mainly divided into disaster recovery in the same city, double living in the same city and multiple living in the same city.

② In different places (across regions), it is mainly divided into dual reading in different places, dual living in different places and dual living in different places.

③ Other types include two places and three centers, two places and three activities, and unitization.

There is no disaster tolerance scheme that can be applied to all scenarios. We need to make a comprehensive assessment based on the actual business development trend, the characteristics of the business system, and how much resources and costs can be invested to finally select the most suitable disaster tolerance architecture scheme.

02 Mainstream disaster tolerance architecture

Disaster tolerance capacity mainly includes RPO and RTO.

RPO refers to the maximum tolerance of data loss in case of failure. The more important the system is, the smaller the RPO is required. For data backup, the smaller the RPO means the higher the frequency of data backup. For example, the general system may be backed up once a day, and the very important system may be backed up once an hour; If data synchronization is done, the smaller RPO means that the reliability of the data synchronization link is required to be higher or the delay is lower, the greater the pressure on the entire production environment and network is, and the higher the cost is required.

RTO refers to the maximum time that the application can accept from failure to failure recovery. The more important the system is, the smaller the RTO is required.

The right side of the figure above shows the disaster recovery capability level established by the National Information Commission, which is divided into six levels: 1-6. Among them, 6 is the most stringent level, the RTO requirement is several minutes, and the RPO requirement is 0, which means that the system data is not allowed to be lost.

The above figure shows the comparison of the current four mainstream disaster recovery architectures.

1. Local disaster recovery

At least two computer rooms are deployed in the same city. The standby computer room does not provide service capability at ordinary times. It is mainly used as the backup of the host computer room. The data between the master and the standby computer room is in the form of one-way synchronization.

The advantage is that the deployment is simple, the same set of architecture can be completely copied to another computer room, the data can be synchronized in one direction, and the business transformation is very little.

The disadvantage is the waste of resources in the standby data center; Don't dare to cut the stream at the critical moment, and it is easy to have inconsistent versions, parameters and operating systems; RTO requires ten minutes.

When the local disaster recovery architecture performs disaster recovery switching, the primary and standby switching of the database is required first. If it is in the cold standby state, it also needs to start the application service, and the top-level DNS also needs to do the resolution switch. The whole process takes more than ten minutes.

2. Double living in the same city

The two computer rooms provide external services at the same time. For data consistency, all operations involving data level in the standby data center will be sent back to the main data center, so the distance between the two data centers is required to be less than 50km, and RT is required to be less than 2ms. If the request falls in the standby data center, it involves cross-machine room operation. If the cross-machine room RT is very large, the performance difference between the primary and secondary data centers of data requests will also be very large, which can not provide a good user experience. The data under this architecture adopts one-way synchronization.

The advantage is to solve the problem of resource waste in the standby data center, and because the service status is maintained on a daily basis, it can be switched at any time in case of failure. It only needs to do the active and standby switch of the database, and the RTO is minute level.

The disadvantage is that it is limited in the same city and the distance is limited.

3. Remote application of double activation (pseudo remote application of double activation)

It has many similarities with the same city double living. The only difference is that the read and write operations of the standby data center are separated. The read operation directly reads the standby data center, while the write operation will be printed to the database of the main data center to ensure data consistency. The distance between two data centers is required to be less than 100km, and RT is required to be less than 7ms. If the distance between the two computer rooms is too far, the performance of requests between the two computer rooms will vary greatly. This architecture is more suitable for systems with more read and less write.

The advantage is that it has a certain degree of region-level disaster tolerance capability. Although the distance required by the architecture is less than 100km, for most prefecture-level cities, 100km can cover two prefecture-level cities, and RTO only needs minutes.

The disadvantage is that the business system needs to be able to accept certain cross-machine room network delay; In addition, the business needs to be reformed to a certain extent, mainly in terms of read/write separation operation; The disaster-tolerance distance is still very limited, so it is called "pseudo remote double living".

4. Double living in different places

The distance between the two machine rooms can be more than 1000km and RT can be more than 10ms for real long-distance double living. In order to solve the performance difference between the previous two computer rooms, we used a modular solution. Unitization means that after a request falls to a unit, all request operations are processed in a closed loop within the unit to avoid involving cross-machine room operations. Therefore, no matter which machine room the request is sent to, it can ensure basically consistent processing efficiency and good user experience. In order to realize the unit, it is required to synchronize the data between each computer room.

The advantage is that the disaster tolerance capability is very strong, almost unlimited, and the RTO is at the minute level.

The disadvantage is that the deployment is complex, because it involves two-way synchronization of data, not only databases, but also Redis cache, Rocket MQ, stateful middleware, etc; The cost of business transformation is very high, involving the dimensions of cell and access layer.

03 Practice of elastic computing in disaster recovery

The above figure shows the original structure of the city's double living.

When the user accesses the domain name, the request will be sent to the public SLB. SLB has active and standby disaster tolerance capability between two availability zones, and will be routed to an availability zone. The SLB in the availability zone forwards the request to the specific business server. The business server will send all data operations to the primary data center. The data between the active and standby systems will be synchronized in one direction. The underlying system will operate all cloud products in all regions.

At this time, the system has a disaster tolerance capability at the cross-zone level, with RPO less than 100 ms and RTO less than 10 min. In order to improve the performance of the application and avoid cross-machine room RPC calls as much as possible, we also designed a strategy of giving priority to RPC calls in the same machine room.

The disadvantage of this architecture is that it does not have disaster tolerance capability at the regional level; Secondly, the system will operate cloud products in all regions. If there is a problem in the system, it will affect the operation of cloud products in all regions; The speed of overseas users' access is slow because the system is deployed in China, and overseas users' access involves cross-border problems, and even cannot open the page in serious cases.

In order to solve the shortcomings of the above versions, we introduced the second version of the architecture, whose core is modularization. Unitization means that all operating systems of a region are deployed in the region, and region A services do not involve resource operations of region B. The disaster tolerance capability of dual availability zones is available in the region.

The only difference between this architecture and the original version is that when operating cloud products, this version only operates cloud products of this region. In case of a problem, the fault surface is relatively controllable, which will only affect this region and not other regions, reducing the explosion radius of the fault.

This version of the unit still has a disaster tolerance capability at the cross-zone level. The RPO is less than 100 ms, and the RTO is less than 10 minutes. It still retains the priority policy of RPC calls to the same machine room.

Its disadvantages are that the disaster tolerance level still does not have the disaster tolerance capability at the regional level; Secondly, the user experience is very poor. For example, when operating cloud product resources on the system, if it involves geographical switching, the entire page needs to be refreshed; In addition, the problem of slow cross-border access still exists.

In order to solve the above problems, we have evolved the architecture to the third version. The core is globalization, which is essentially a multi-living architecture in different places. Each unit has a domain name to provide external services during the unit; After globalization, all domain names will be unified into one domain name to provide external services, and the top-level DNS will be used for intelligent nearby resolution.

The region is still disaster-tolerant at the zone level. Although many regions are deployed globally, they are also divided into main regions and unit regions. All data write operations will be returned to the main center, and cell region write operations will be returned to the main region. After the write operation is returned to the center, the center data will be synchronized to its cell region in one direction. If it is two-way synchronization, it will cause the topology of synchronization to form a very complex mesh, so the one-way synchronization mode is adopted.

This architecture has the disaster tolerance capability at the regional level, provides the intelligent resolution capability nearby, and has a better user experience. It does not need to repeatedly jump between domain names.

Because of the large number of geographical deployments, data synchronization takes a long time, which can only ensure that the RPO is less than 10 seconds and the RTO is less than 10 minutes. The priority policy of RPC and machine room calls is still reserved in the regional unit. In case of failure, the request will be routed to another region.

The deployment of this architecture is complex and involves data synchronization. Therefore, the system needs to be modified to a certain extent. For example, the write operation needs to be returned to the center, and the cache update will also be involved after the data is modified; In addition, all write operations need to flow back to the center, so the write operations are still disaster tolerant across the availability zones, without truly achieving disaster tolerance across regions.

04 On-cloud disaster recovery construction

The construction of disaster recovery on cloud is mainly divided into three stages, namely, the analysis stage, the design stage and the implementation stage.

In the analysis stage, it is necessary to consider whether and to what extent the business needs disaster recovery. For example, at the initial stage of the system, more attention should be paid to the number of users. After the number of users reaches a certain level, the stability of the system needs to be concerned, and finally the disaster tolerance capability should be considered. In addition, when doing disaster recovery, we need to sort out the system business separately, whether it is the core business, and what is the acceptable RPO.

The design phase will be based on the data obtained in the analysis phase.

The implementation process of the implementation phase involves team cooperation and resource investment at the organizational level, and more importantly, the detailed design of the plan on how to recover after a real failure. In addition, it is also necessary to carry out normalized disaster recovery drills. The maintenance of disaster recovery system, including personnel training, is a huge systematic project.

Alibaba Cloud provides a lot of cloud products and services to help users complete the construction of disaster recovery efficiently and quickly.

If the system has not been deployed on the cloud, you can quickly migrate the entire system to the cloud through the server migration center. It can support the migration of multiple platforms and environments, and does not rely on the underlying environment of the source server. It can support the migration without downtime. All operations can be completed through white-screen configuration on the console. There is sufficient security guarantee for data transmission during the process, which supports breakpoint continuous transmission and incremental migration.

If the system is already on the cloud, Alibaba Cloud also provides the ability to arrange resources. After determining the filter conditions, you can quickly copy the system to another region/availability zone.

After the service deployment is completed, you can use DTS for data synchronization or data backup. DTS is very powerful. It supports the migration between homogeneous or heterogeneous data sources, as well as non-stop migration. Support one-way synchronization and two-way synchronization between data sources.

Multi-live disaster tolerance MSHA can transform the business, integrate DTS and other data synchronization products internally, and can quickly build the overall disaster tolerance capability of the business, including from a single region to multiple regions, from a single cloud to multiple clouds, and from active and standby to multi-active disaster tolerance architecture.

In addition, MSHA has accumulated a lot of practical experience, including public cloud, private cloud, hybrid cloud, etc. It also provides a console on which disaster recovery management and switching can be completed.

Cloud resolution DNS is based on intelligent DNS resolution for nearby access. At present, the mainstream DNS resolution services can provide more intelligent resolution lines, such as state level, regional level, national level, etc.

The cloud database provides active and standby capacity at the zone level for both the highly available version of RDS and the dual-available version of Redis, so users do not need to handle it by themselves.

The network between different regions in the remote scene can use the cloud enterprise network to connect multiple networks based on different VPCs.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us