Best Practices for Disaster Recovery and Multi location Mobility Across Availability Zones on the Cloud

On July 4, 2022, [Observable, Reliable - CloudOps Series Salon for On cloud Automation Operation and Maintenance the First Bullet] was officially launched. For four consecutive days, four themes were shared. The last lecturer was Deng Qinglin, an expert on Alibaba Cloud elastic computing technology. The theme shared by him was "Disaster tolerance across cloud availability zones and multi activity in different places". The following is a summary of his speech for everyone to read:

01 System disaster recovery

When it comes to disaster recovery, it is bound to be related to failures. Common fault types include change, hardware fault, power failure and natural disaster, and the frequency of occurrence decreases in turn. However, the low frequency of occurrence does not mean that it is unimportant. Faults caused by power failure or natural disasters are often fatal.

On March 10, 2021, the computer room of OVH, the largest cloud service company in Europe, in France caught fire, resulting in the complete burning of the data center, which led to the offline of 3.5 million websites, and the permanent loss of some customers' data, which could not be recovered. The CEO of OVH prompted the customer to enable their own disaster recovery scheme in the description of the fire on Twitter. It can be seen that even if the application is deployed on the cloud, it cannot avoid municipal failures, such as power outages and failures caused by extreme natural disasters. Therefore, corresponding disaster recovery plans need to be prepared.

At present, the main types of disaster recovery can be divided into the following three categories:

① Local (cross zone), mainly including local disaster recovery, local dual live and local multi live.

② Non local (cross regional), mainly divided into non local dual reading, non local application dual live and non local dual live.

③ Other types include two places and three centers, two places and three activities, and unitization.

No disaster recovery scheme can be applied to all scenarios. We need to comprehensively evaluate the actual business development trend, the characteristics of the business system and the amount of resource costs that can be invested to finally select the most appropriate disaster recovery architecture scheme.

02 Mainstream disaster recovery architecture

Disaster recovery capability mainly includes RPO and RTO.

RPO refers to the maximum extent of tolerating data loss in case of failure. The more important the system is, the smaller the RPO is required. If data is backed up, the smaller the RPO, the higher the frequency of data backup. For example, ordinary systems may be backed up once a day, and very important systems may be backed up once an hour; If data synchronization is done, the smaller the RPO is, the higher the reliability or lower the delay of the data synchronization link is required. The greater the pressure on the entire production environment and network, the higher the cost is required.

RTO refers to the maximum acceptable time for application from failure to failure recovery. The more important the system is, the smaller the RTO is required.

The right side of the above figure shows the disaster recovery capability level formulated by the National Information Commission, which is divided into six levels, namely, 1-6. Among them, 6 is the most stringent level. The RTO requirement is several minutes, and the RPO requirement is 0, which means that the system data cannot be lost.

The above figure shows the comparison of four mainstream disaster recovery architectures.

1. Local disaster recovery

At least two machine rooms shall be deployed in the same city. The standby machine room does not provide service capacity at ordinary times and is mainly used as the backup of the host room. The data between the primary and standby machines shall be synchronized in a one-way manner.

The advantage is that the deployment is simple. The same architecture can be completely copied to another computer room. The data can be synchronized in a one-way way, and the business transformation is rare.

The disadvantage is the waste of resources in the data center; Do not dare to cut streams at critical moments, and it is easy to have inconsistent versions, parameters, operating systems, etc; RTO requires ten minutes.

When the local disaster recovery architecture is switched for disaster recovery, it is first necessary to switch the active and standby databases. If it is cold standby, the application service needs to be started, and the top-level DNS needs to be switched for resolution. The whole process takes more than ten minutes.

2. Double live in the same city

The two machine rooms provide external services at the same time. For data consistency, all operations involving the data layer in the standby data center will be returned to the primary data center. Therefore, the distance between the two data centers is required to be less than 50km and the RT is less than 2ms. If the request falls in the standby data center, it involves cross machine room operations. If the RT across machine rooms is very large, the performance difference between the primary and standby data centers for data requests will be very large, which will not provide a good user experience. The data under this architecture adopts one-way synchronization.

The advantage is that it solves the problem of resource waste in the standby data center, and because the service status is maintained on a daily basis, it can be switched at any time in case of a failure. Only the active and standby databases need to be switched, and the RTO is minute level.

The disadvantage is that it is limited in the same city area and the distance is limited.

3. Remote application double live (pseudo remote double live)

It has many similarities with the local dual active system. The only difference is that the read and write of the standby data center are separated. The read operation directly reads from the standby data center, while the write operation is sent to the database of the primary data center to ensure data consistency. The distance between two data centers is required to be less than 100km, and the RT is required to be less than 7ms. If the distance between two machine rooms is too far, the performance of requests between the two machine rooms will vary greatly. This architecture is more suitable for systems with more reads and less writes.

The advantage is that it has a certain degree of disaster tolerance capability at the regional level. Although the distance required by the architecture is less than 100km, for most prefecture level cities, 100km can already cover two prefecture level cities, and RTO only takes minutes.

The disadvantage is that the business system needs to be able to accept a certain cross machine room network delay; In addition, the business needs to be reformed to a certain extent, mainly in the aspect of read/write separation; Disaster tolerance distance is still very limited, so it is called "pseudo remote dual live".

4. Remote double live

The real remote dual live system can accept that the distance between two machine rooms is more than 1000km and the RT is more than 10ms. In order to solve the performance difference between the previous two machine rooms, we used a modular solution. Unitization means that after a request falls into a unit, all request operations are handled in a closed loop within the unit to avoid cross machine room operations. Therefore, no matter which machine room the request is sent to, it can ensure basically consistent processing efficiency and a good user experience. In order to achieve unity, it is required that the data between each computer room should be synchronized in both directions.

The advantage is that the disaster recovery capability is very strong, almost unlimited, and the RTO is minute level.

The disadvantage is that the deployment is complex, because it involves two-way data synchronization, not only databases, but also Redis cache, Rocket MQ, stateful middleware, etc; The cost of business transformation is very high, involving such dimensions as cell and access layer.

03 Practice of elastic computing in disaster recovery

The above figure shows the original architecture of the live in one city.

When the user accesses the domain name, the request will be made to the public SLB. The SLB has the disaster recovery capability of primary and standby between two zones, and will be routed to a zone. The SLB in the zone then forwards the request to the specific business server. The business server will send all data operations to the primary data center, and the data between the primary and standby will be synchronized in one direction. The underlying system operates all cloud products in all regions.

At this time, the system has disaster recovery capability at the cross zone level, with RPO less than 100 ms and RTO less than 10 min. In order to improve the performance of the application and avoid cross machine room RPC calls as much as possible, we have also designed a strategy of giving priority to RPC calls in the same machine room.

The disadvantage of this architecture is that it does not have disaster recovery capability at the regional level; Secondly, the system will operate cloud products in all regions. Once the system goes wrong, it will affect cloud product operations in all regions, with a very large impact; The speed of overseas users' access is slow because the system is deployed in China. Overseas users' access involves cross-border issues, and even cannot open the page in serious cases.

In order to solve the shortcomings of the above versions, we have introduced the second version architecture, whose core is unit. Unitization means that all the operating systems of the region are deployed in the region, and the region A service does not involve the resource operations of region B. The region is internally a dual zone disaster recovery capability.

The only difference between this architecture and the original version is that this version only operates the cloud products of this region when operating cloud products. In case of a problem, the fault surface is relatively controllable, which will only affect this region, not other regions, reducing the explosion radius of the fault.

This version of the unit still has cross zone level disaster recovery capability. The RPO is less than 100 ms, and the RTO is less than 10 min. The policy of giving priority to RPC calls in the same machine room is still maintained.

Its shortcomings are that the disaster recovery level still does not have the disaster recovery capacity at the regional level; Secondly, the user experience is very poor. For example, when operating cloud product resources on the system, if the region switching is involved, the entire page needs to be refreshed; In addition, the problem of slow cross-border visits still exists.

In order to solve the above problems, we have evolved the architecture to the third version. The core is globalization, which is essentially a multi live architecture in different places. Each unit has a domain name to provide external services when it is unitized; After globalization, all domain names will be unified into one domain name to provide external services, and the top-level DNS will perform intelligent nearby resolution.

The region is still a zone level disaster recovery. Although there are multiple regions deployed globally, the regions are also divided into main regions and unit regions. All data writes will be returned to the primary center, and cell region writes will be returned to the primary region. After the write operation is returned to the center, the center data will be synchronized to its cell region in one direction. If the synchronization is bidirectional, the topology of synchronization will form a very complex mesh, so the unidirectional synchronization mode is adopted.

This architecture has the disaster recovery capability at the regional level, and provides intelligent nearby resolution capability. The user experience is better, and there is no need to repeatedly jump between various domain names.

Because of the large number of regional deployments, data synchronization takes a long time. It can only ensure that the RPO is less than 10 seconds and the RTO is less than 10 minutes. The regional unit still retains the priority policy of RPC calls in the same machine room. When a failure occurs, the request will be routed to another region.

This architecture is complex to deploy and involves data synchronization. Therefore, the system needs to be modified to a certain extent. For example, the write operation needs to be returned to the center, and the cache will be updated after the data is modified; In addition, all write operations must flow back to the center. Therefore, write operations are still disaster recovery across zones, without truly achieving disaster recovery across regions.

04 On cloud disaster recovery construction

The construction of cloud disaster recovery is mainly divided into three stages, namely, the analysis stage, the design stage and the implementation stage.

In the analysis phase, it is necessary to consider whether the business needs disaster recovery and to what extent. For example, in the initial stage of the system, more attention is paid to the number of users. After the number of users reaches a certain level, the stability of the system needs to be concerned, and finally the disaster tolerance ability needs to be considered. In addition, in disaster recovery, it is necessary to sort out the system business separately, whether it is a core business, and what RPO is acceptable.

The design phase will be based on the data obtained in the analysis phase.

The implementation process at the implementation stage involves team cooperation and resource input at the organizational level, and more importantly, the detailed design of the plan on how to recover after a real failure. In addition, regular disaster recovery drills and disaster recovery system maintenance, including personnel training, are also required.

Alibaba Cloud provides a lot of cloud products and services to help users complete disaster recovery efficiently and quickly.

If the system has not been deployed on the cloud, you can quickly migrate the entire system to the cloud through the server migration center. It can support migration of multiple platforms and environments, and does not rely on the underlying environment of the source server. It can support non-stop migration. All operations can be completed through white screen configuration on the console. The data transmission in the process has sufficient security assurance, and supports breakpoint continuous transmission and incremental migration.

If the system is already on the cloud, Alibaba Cloud also provides the ability to arrange resources. After determining the filter conditions, you can quickly copy the system to another region/zone.

After service deployment, DTS can be used for data synchronization or data backup. DTS is very powerful. It supports migration between homogeneous or heterogeneous data sources, as well as non-stop migration. One way synchronization and two-way synchronization are supported between data sources.

Multi active disaster recovery MSHA can transform the business and integrate DTS and other data synchronization products internally. It can quickly build the overall disaster recovery capability of the business, including from a single region to multiple regions, from a single cloud to multiple clouds, and from primary and standby to multi active disaster recovery architecture.

In addition, MSHA has accumulated a lot of practical experience, including public cloud, private cloud, hybrid cloud, etc. It also provides a console on which disaster recovery management and switching can be completed.

Cloud resolution DNS is based on intelligent DNS resolution for nearby access. At present, mainstream DNS resolution services can provide more intelligent resolution lines, such as state level, regional level, national level, etc.

ApsaraDB provides active and standby capabilities at the zone level for both the highly available version of RDS and the dual zone version of Redis, so users do not need to deal with them themselves.

The network between different regions in the remote scenario can use the cloud enterprise network to connect multiple VPC based networks.

Q&A link, audience questions

What is the main difference between Q1 zone disaster recovery and traditional disaster recovery?

A: Traditional disaster recovery refers to local disaster recovery. The backup data center usually does not provide external services, but mainly provides backup. The advantages are very small business transformation and simple deployment; The disadvantage is that it wastes resources and does not dare to cut the flow at critical moments because it does not provide services at ordinary times. There is also a combination form of dual active in one city. For example, two cities and three centers, which means that in one city there are two zone level computer rooms that provide services to the outside world at the same time, while in another city there is mainly disaster recovery, which usually does not provide services. This is similar to the combination of dual active in one city and disaster recovery in one city among multiple cities.

Q2 How to ensure data synchronization for multi live in different places?

A: The database itself has the ability to synchronize. The cloud also provides related products, such as DTS, to help users more easily synchronize. For middleware such as Redis and RocketMQ, DTS also provides synchronization capabilities. Of course, there are many solutions in the open source industry, but they need to involve operation and maintenance.

Q3 What is the difference between multi live data synchronization in different places and multi live data synchronization in the same city?

A: If it is multi active in the same city, you can directly use RDS or Redis on the cloud, and the highly available version and dual zone version directly provide disaster tolerance at the same city level. Users do not need to do their own data synchronization construction; While multi active and dual active in different regions involve cross regions, RDS or Redis cannot provide corresponding capabilities. Therefore, users need to establish synchronization links with data transmission services such as DTS, or implement relevant data synchronization components themselves. Cloud products provide very rich disaster recovery capabilities, but mainly focus on the region.

Q4 In the case of internationalization, do all users accessing from different places return to the original master center, or only write the DB back to the original master center?

A: Write mainly refers to the reading and writing of the data carried by the app. Write operations in all regions will be switched to the central region. After the central region writes to the database, the data will be synchronized to each region through DTS's synchronization capability. At the same time, DTS provides the message capability of Binlog. Each unit subscribes to the DTS binlog message service of the main data center to know when the cache needs to be discarded or refreshed.

Q5 What systems need to be modified?

A: Some systems can complete the cellular transformation, but some systems, such as inventory services, cannot achieve cellular deployment because inventory deduction requires strong global consistency. Therefore, in the analysis phase, we need to judge whether the system can accept the requirements of unit transformation or system RTO and RPO. Different business systems and scenarios have different disaster recovery architectures, which need to be selected in combination with the actual business scenarios.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us