The Enterprise Multi-Active Disaster Recovery System: Construction Ideas and Best Practices in the Cloud-Native Era

As for the concept of cloud-native, you may often hear the interpretation of microservices and containers. So, what is the relationship between these technologies and enterprise disaster recovery? Disaster recovery is demanded in all industries. However, building disaster recovery and multi-active business is something that each enterprise needs to think about and learn. I hope this article can provide some relevant ideas.

Functional Evolution of a Disaster Recovery System

The multi-active business we are discussing today is a part of the disaster recovery system. Let's take a look at the evolution of the entire disaster recovery architecture:

Disaster Recovery 1.0: During the construction process of the original application system, the business system is deployed in the data center based on the traditional architecture. So, what are the related emergency measures or troubleshooting methods? This period only takes data backup into account, mainly in the form of cold backup. In addition to data centers that provide business, additional data centers may be used in disaster scenarios. From a system construction perspective, it may be an option to use a separate data center to synchronize data to another center for cold backup and switch over in case of problems. However, data center switchover is not selected in normal cases. Even if the financial industry performs annual disaster recovery system routine drills, the personnel involved are not too afraid to switch when they encounter problems in the production process.
Disaster Recovery 2.0: Application is given more consideration. For example, in cloud-native or further up the application in the traditional IOE system, the switch is not just a simple slice over and load out the original cold backup data. Instead, it hopes to pull up the application in another data center very quickly when slicing over. We usually require the active-active mode to prevent replication at the data layer from being delayed. However, in general, some dual active-active backups have some requirements. For example, active-active backup in the same zone can be performed within a certain range. The active-active mode is more applicable in AQ mode, which completes full-service on the production and performs other services on the other data centers.
Disaster Recovery 3.0: Alibaba Cloud hopes to achieve active geo-redundancy. The word active means it is not limited to two data centers; it can use three or more data centers. For example, Alibaba's business is distributed in multiple data centers, and it is required to provide business support for external businesses at the same time. The meaning of active geo-redundancy is not limited to the distance, such as 200 kilometers or the same zone, since today's data centers are deployed throughout China.

An Overview of Business Continuity Management and Disaster Recovery

For business continuity management (BCM), there is a systematic approach to the standards and guidance accumulated over the years in the construction of the disaster recovery system, which includes the following dimensions:

Compared with the original disaster recovery solution, multi-active business does not deploy peer-to-peer services in another data center directly. Instead, it selects valuable services because it is difficult to realize multi-active business for all services in terms of cost and technology during the construction of the disaster recovery system.
Real-time service operation should be guaranteed. The core business must be guaranteed not to stop service for various reasons, such as power failure of data centers.
Management (M) represents the security system. Today, all industries may have different means and ways of management, but Alibaba offers ways to transform this part of things into technologies, tools, and products. It hopes everyone will use this method and product to build a multi-active business quickly when building such capability in the future.

The BCM system and IT disaster recovery are guiding frameworks based on practices. In terms of integrity, business continuity is a goal at the top, and the following are various implementation methods. The IT plan and the plan for troubleshooting special faults when business continuity occurs can be seen at the bottom. These requirements were originally taken into consideration in disaster recovery in the past. However, now, we consider these matters in the product system from the perspective of multi-active business.

There are several common disaster recovery methods mentioned here. Some common methods are from cold backup to active-active backup in the same zone or active-active backup in the same zone plus geo-cold backup (three data centers across two zones.) These are relatively standard methods in the industry. Active geo-redundancy is similar and provides multi-active capability in three data centers across two zones at the same time, which is different from traditional disaster recovery. There are also differences in construction costs between these two architectures. For example, the construction of active geo-redundancy will bring more investment in construction costs compared to traditional architectures, such as active-active backup in the same zone and three data centers across two zones.

When building multi-active capability, we also consider the situation of the business. For instance, if only the read of both sides is required in different industries or multi-active businesses, then the construction cost and the service switching time are different under different circumstances as well. Minute-level switching can be achieved from the horizontal timeline of the active geo-redundancy capability. However, if the switch is based on cold backup, it may take days to switch.

Why Does Alibaba Perform Multi-Active Business?

In Alibaba's business model, the reason why it needs to perform multi-active business is similar to the reasons mentioned above. As mentioned earlier, if the three data centers across two zones mode do not use multi-active business, it needs to build another computer room, making the cost very high. This happens because that data center is only used for the usual data synchronization and is not in operation. During this time, the corresponding version of the production system and the version of the disaster recovery system need to be updated continuously. However, in real-world situations, if the original cold backup system or three data centers across two zones fail, the relevant personnel do not dare to switch because it is likely that the system cannot switch back anymore.

There are three main demands for multi-active business:

Resources: With the rapid development of today's business, the capacity of resources in a single zone is limited. We know that cloud-native and cloud computing provide high availability and disaster recovery capabilities. However, cloud computing is deployed in different data centers, and the capability of active geo-redundancy itself needs to be supported by the underlying infrastructure. Therefore, we hope to expand our business without being limited by physical data centers. This also allows multiple data centers to connect their businesses at the same time.
The business has diversified requirements and needs to be deployed locally or remotely.
Demands for Disaster Recovery Events: For example, problems, such as fiber optic cable digging, or weather-related issues with power supply and heat dissipation in data centers, can lead to the failure of a single data center. Today's requirements are not limited to a single data center. Instead, multiple data centers are deployed in different forms throughout China and can be adjusted flexibly according to the business model.

These appeals are urgent for multi-active capability. Therefore, Alibaba has developed multi-active solutions and products based on its business requirements and technical capabilities.

Breakdown of Multi-Active Architecture

Multi-Site Mutual Backup Mode: Today, we discussed how good cloud-native and cloud computing are. However, there is no multi-active capability; these technologies are idle state. They do not work in the cold backup state, and people determine under what state to switch to the cold backup. As the layers of reporting on the business impact are relatively large, more mature customers will have some preplanning, such as what kind of impact and failure need to do this switchover. However, customers generally do not dare to perform the switchover based on the cold backup mode.
Active-Active Backup in the Same Zone: This mode has certain distance restrictions. The common active-active mode is applied to upper-layer applications. For example, data can be distributed to both data centers of the cloud-native PaaS layer. At the data layer, this layer can be used as a local reserve, and the database can be cut to the standby data center if there is a problem in the host center. The advantage is that the machines and resources in the two data centers are in an active state. In addition, when the data center is in an active state, you do not have to worry that the version in production will be different from the standby data center for the switchover.
Three Data Centers across Two Zones: In addition to providing backup in the same zone, this model is more powerful for failure response by building a cold backup data center offsite, which is similar to the first cold backup solution. The cold backup data center is often not used and may do some other synchronization, only performing the switchover when the failure occurs.
Active Geo-Redundancy: Multiple data centers can provide services simultaneously. Due to distance limitations, data replication at the data layer may be limited by the network, resulting in latency. There are many technical problems to be solved, such as how to switch from the Beijing data center to the Shanghai data center quickly and how to switch if the underlying data is not completely synchronized due to physical restrictions. Our operation mode is not like the original disaster recovery mode, but a lot of preparation work and the subsequent data compensation process need to be done. When we integrate this set of things into the product system, there is no way to break the physical limits. As a result, we use the architecture for optimization.

Progressive Multi-Active Disaster Recovery Architecture

As for the key core business, we will sort out the business when working on a multi-active system or project. Today, we are discussing unitized sorting.

In general, half of the data centers should be separated under modes, such as active-active mode and three data centers across two zones, which is the simplest. According to this mode, the rules of business splitting can be found. For example, the rules can be divided into half businesses according to the user number, just as the bank may divide the businesses in half by the card number or the place where the user belongs. In the multi-active mode, we want to be flexible, for example, how large the processing capacity of the computer room is, what the faults are like, and the traffic can be adjusted to 50%, 60%, or other ratios. The same is true for multiple data centers; traffic access can be distributed uniformly.

In terms of technology, multi-site mutual backup is one-way data replication, while the multi-site active-active mode is a two-way process. It means that either of these two data centers may have problems and can be switched to each other. An important part is the technical implementation. At the digital level, we must find ways to avoid circular replication when the data of the other data center is copied back to the new device after synchronizing data. In the case of multiple data centers, the traditional method is to use serial numbers in the database. Therefore, we need to consider that the serial numbers generated by multiple data centers cannot be duplicated, which requires products with some rules to solve this problem.

Multi-Active Disaster Recovery Solutions

Multi-Active Solution Architecture

1. Access Layer: The first matter to address inside the multi-active business is the very important traffic access layer. The access layer supports fine-grained control over access rules. Based on the rules of service sharding, the traffic needs to be mapped to each data center on the lower layer precisely. After traffic comes in, it must determine in which data center the traffic user should provide services. How is this implemented?

The traditional way is to switch domain names. For example, the frontend domain name has two data centers. The address of the domain name is cut during the switching. Then, the entire business was originally connected to data center A, and the domain name can be switched to another data center B. The problem with this method is that it affects the business being done. For another example, after a problem occurs in one data center, you need to switch your business to the other data center quickly. If you switch through domain names, the ongoing business at the underlying layer will be affected. In addition, this bottom layer switching cannot be linked with the entire cloud-native PaaS layer. This happens because the upper layer is cut, and the lower layer is unaware that the previous traffic has been switched to the other data center. The middle of the call may still be in the original data center unit inside, which has a greater impact on business continuity. In extreme cases, this mode can be used to solve some problems. For instance, if a data center cannot perform any business, and there is a standby data center, cutting the domain name is also one way.

Another solution is to use cloud-native microservices. You can mark the traffic with microservices. Then, you can pass the markdown from the cloud-native microservices system; try to think of the requests as being processed in a certain unit or data center, and do not jump to other data centers.

2. Application Layer: The access route specifications at the intermediate layer include service routing components, which can be separately provided in our product system. For example, some customers propose that they do not want to use the complete solution because they may have all the open-source components used in this intermediate layer of the solution, but they still want to achieve multi-active capability. Therefore, the upper layer can be used to control and cut the traffic with our entire multi-active business, defining the number of logical units precisely and providing APIs for intermediate calls. The former layer provides the globally unique serial number, routing rule, and sharding rule. Marking and traffic identification seems to be relatively simple among the layer. However, in a multi-active scenario, some distributed messages are used in decoupling, and messages are used in the architecture. If they are switched without being consumed in a certain data center, the ways they need to be synchronized to another data center should be solved with the help of cloud-native.

3. Data Layer: This layer involves logic related to replication and writing. The write-prohibition control in our solution will have logic on the database that generates code automatically once the switch occurs in the frontend. For example, when the target data center is restored, the code with time will be generated automatically, and the write will only be re-released when the data is restored. We will prohibit writing to ensure the protection of the database and the judgment of database latency. If the underlying data synchronization capability is not strong enough, switchover and most businesses can be implemented, but many writing businesses may not be implemented because the database is limited by the write-prohibition rules. In addition, the overall requirements for data replication in the case of multiple data centers are higher than those in the case of data synchronization.

Based on the complete solution system, we proposed the concept shown in the preceding figure. The four-letter abbreviation MSHA indicates the capability to provide the cloud-native multi-active product today. We hope to play a small role in these four numbers with 0-minute, 1-minute, 5-minute, and 10-minute prevention.

The first is 0-minute prevention. With the traffic switchover mentioned earlier, you can deploy a blue-green release environment in two data centers. Even in the same data center, it is possible to define two units under the logic of the console, which can be carried out very quickly by the blue-green release in the same data center. The blue-green release of a data center is limited by the support of technical products. Through this component, it can figure out which resources belong to one unit and which resources belong to another unit and implement blue-green release for this unit quickly.

The second is 5-minute positioning. Originally, local cold backup and disaster recovery technologies often make difficult decisions or determine who needs to bear the consequences of a switchover. Based on this platform, we want to be able to view the impact of faults today intuitively, such as what kind of action the stakeholder needs to do or what operation the stakeholder needs to perform to restore the application when a fault occurs. Besides, this system is used to find the fault quickly when a fault occurs. For example, after positioning the problem within five minutes, the system decides whether to cut the traffic.

The third is a 10-minute restore. Finally, we hope we can control the whole process of the whole business restart in recovery within 10 minutes through this mode.

Best Practices of Multi-Active Disaster Recovery

The following examples show the use of multi-active disaster recovery for enterprises outside Alibaba. The cloud does not mean that all high availability services are provided by the cloud when deploying on the cloud. When using resources, you will find that the cloud has different regions, and the same region has different availability zones. When using a public cloud, you must consider the real-world situation. For example, if most customers are located in the South, a node may be opened in the data center on the south side. When a problem occurs in the Alibaba data center, the customers' business may be affected accordingly. The corresponding business deployed on the cloud and the products in the cloud may also provide high availability. However, once a fault occurs in a data center, the business will still be affected. Therefore, the solution provided is to deploy the multi-active capability on the cloud and in the data center as commercial software.

Case 1: Active-Active Backup in the Same Zone

The customer of a logistics service uses the multi-active architecture in the same zone. Although the traditional technology has a few problems, the advantages of using multi-active architecture are reflected in the corresponding SDK, which can be identified automatically and does not need to do too much transformation to the business. You can pass on the request for tagging automatically. After the disaster recovery is completed, RTO is much faster than before.

Case 2: Multi-Site Dual-Read Mode

The difficulty in this case of multi-site dual-read mode is that it exceeds thousands of kilometers. Due to the distance limit, whether read or write is challenging. The data replication itself has a delay. The purpose of using the logic of this set of products is to distinguish between the control and traffic layers. One is the read business, one is imported to the read data center, and one is the replication state. By doing so, minute-level RTO improves significantly, allowing online dynamic and flexible service switching.

Case 3: Multi-Site Active-Active Mode

The enterprise customer that uses the multi-site active-active mode has two data centers currently and may expand its capabilities in the future. When doing this program, a lot of product adaptability development has been made. The basic capability of the original product has a lot of work in the middle layer to achieve reading. Besides, the entire process starts from the R&D of the multi-active product, moving forward for application scenarios and then fully adapting to the business. The core focus is business continuity. Therefore, this does not necessarily mean that all businesses will use the multi-active capability in data centers in the future. Instead, this focuses on critical businesses. For example, during Double 11 every year, our core business is to ensure that orders are not affected. With decoupling or other methods, the priority of logistics will not be as high as order placement in terms of business continuity. The key issue is the ways the services and products involved in the core transaction link in the multi-active dimension to ensure that the failover will not occur.

We recommend experiencing the multi-active management platform for yourself. After two or more units are defined inside the console, when one of the data centers goes down, we want to switch its application to another data center through multi-active mode quickly. The premise of switching is that you need to define nodes in the console. Whether they are logical nodes in a single data center or multiple physical data centers, they must be mapped to the multi-active management platform. In the platform, we configure some rules, such as the simplified service access, the dimension to which the access traffic is divided, or tagging by ID. When switching traffic, it is relatively simple to dynamically display the dimensions of traffic to another data center and configure quickly when a fault occurs.

When we help our customers deploy capabilities today, we often perform some traffic cuts and drills inside the system through the console to check if the data center is affected in some way. This happens because the system also supports some other solutions, such as performing fault drills and working with these faults to switch applications to another data center.

Summary

The multi-active disaster recovery has been practiced in Alibaba's internal business for many years, and it took a long time to evolve it into a product. The purpose is to help enterprises build up their multi-active capability within 30 days of using these products and solutions. In particular, many public cloud products are off-the-shelf enterprises, which take less time to build. We hope that this set of products and solutions can help enterprises implement failover and build multi-active capability quickly at the minute level.

Community

The Enterprise Multi-Active Disaster Recovery System: Construction Ideas and Best Practices in the Cloud-Native Era

Functional Evolution of a Disaster Recovery System

An Overview of Business Continuity Management and Disaster Recovery

Why Does Alibaba Perform Multi-Active Business?

Breakdown of Multi-Active Architecture

Progressive Multi-Active Disaster Recovery Architecture

Multi-Active Disaster Recovery Solutions

Best Practices of Multi-Active Disaster Recovery

Case 1: Active-Active Backup in the Same Zone

Case 2: Multi-Site Dual-Read Mode

Case 3: Multi-Site Active-Active Mode

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Security Center

Data Security Center (Original SDDP)

Microservices Engine (MSE)

Time Series Database (TSDB)