As for the concept of cloud-native, you may often hear the interpretation of microservices and containers. So, what is the relationship between these technologies and enterprise disaster recovery? Disaster recovery is demanded in all industries. However, building disaster recovery and multi-active business is something that each enterprise needs to think about and learn. I hope this article can provide some relevant ideas.
The multi-active business we are discussing today is a part of the disaster recovery system. Let's take a look at the evolution of the entire disaster recovery architecture:
For business continuity management (BCM), there is a systematic approach to the standards and guidance accumulated over the years in the construction of the disaster recovery system, which includes the following dimensions:
The BCM system and IT disaster recovery are guiding frameworks based on practices. In terms of integrity, business continuity is a goal at the top, and the following are various implementation methods. The IT plan and the plan for troubleshooting special faults when business continuity occurs can be seen at the bottom. These requirements were originally taken into consideration in disaster recovery in the past. However, now, we consider these matters in the product system from the perspective of multi-active business.
There are several common disaster recovery methods mentioned here. Some common methods are from cold backup to active-active backup in the same zone or active-active backup in the same zone plus geo-cold backup (three data centers across two zones.) These are relatively standard methods in the industry. Active geo-redundancy is similar and provides multi-active capability in three data centers across two zones at the same time, which is different from traditional disaster recovery. There are also differences in construction costs between these two architectures. For example, the construction of active geo-redundancy will bring more investment in construction costs compared to traditional architectures, such as active-active backup in the same zone and three data centers across two zones.
When building multi-active capability, we also consider the situation of the business. For instance, if only the read of both sides is required in different industries or multi-active businesses, then the construction cost and the service switching time are different under different circumstances as well. Minute-level switching can be achieved from the horizontal timeline of the active geo-redundancy capability. However, if the switch is based on cold backup, it may take days to switch.
In Alibaba's business model, the reason why it needs to perform multi-active business is similar to the reasons mentioned above. As mentioned earlier, if the three data centers across two zones mode do not use multi-active business, it needs to build another computer room, making the cost very high. This happens because that data center is only used for the usual data synchronization and is not in operation. During this time, the corresponding version of the production system and the version of the disaster recovery system need to be updated continuously. However, in real-world situations, if the original cold backup system or three data centers across two zones fail, the relevant personnel do not dare to switch because it is likely that the system cannot switch back anymore.
There are three main demands for multi-active business:
These appeals are urgent for multi-active capability. Therefore, Alibaba has developed multi-active solutions and products based on its business requirements and technical capabilities.
As for the key core business, we will sort out the business when working on a multi-active system or project. Today, we are discussing unitized sorting.
In general, half of the data centers should be separated under modes, such as active-active mode and three data centers across two zones, which is the simplest. According to this mode, the rules of business splitting can be found. For example, the rules can be divided into half businesses according to the user number, just as the bank may divide the businesses in half by the card number or the place where the user belongs. In the multi-active mode, we want to be flexible, for example, how large the processing capacity of the computer room is, what the faults are like, and the traffic can be adjusted to 50%, 60%, or other ratios. The same is true for multiple data centers; traffic access can be distributed uniformly.
In terms of technology, multi-site mutual backup is one-way data replication, while the multi-site active-active mode is a two-way process. It means that either of these two data centers may have problems and can be switched to each other. An important part is the technical implementation. At the digital level, we must find ways to avoid circular replication when the data of the other data center is copied back to the new device after synchronizing data. In the case of multiple data centers, the traditional method is to use serial numbers in the database. Therefore, we need to consider that the serial numbers generated by multiple data centers cannot be duplicated, which requires products with some rules to solve this problem.
Multi-Active Solution Architecture
1. Access Layer: The first matter to address inside the multi-active business is the very important traffic access layer. The access layer supports fine-grained control over access rules. Based on the rules of service sharding, the traffic needs to be mapped to each data center on the lower layer precisely. After traffic comes in, it must determine in which data center the traffic user should provide services. How is this implemented?
The traditional way is to switch domain names. For example, the frontend domain name has two data centers. The address of the domain name is cut during the switching. Then, the entire business was originally connected to data center A, and the domain name can be switched to another data center B. The problem with this method is that it affects the business being done. For another example, after a problem occurs in one data center, you need to switch your business to the other data center quickly. If you switch through domain names, the ongoing business at the underlying layer will be affected. In addition, this bottom layer switching cannot be linked with the entire cloud-native PaaS layer. This happens because the upper layer is cut, and the lower layer is unaware that the previous traffic has been switched to the other data center. The middle of the call may still be in the original data center unit inside, which has a greater impact on business continuity. In extreme cases, this mode can be used to solve some problems. For instance, if a data center cannot perform any business, and there is a standby data center, cutting the domain name is also one way.
Another solution is to use cloud-native microservices. You can mark the traffic with microservices. Then, you can pass the markdown from the cloud-native microservices system; try to think of the requests as being processed in a certain unit or data center, and do not jump to other data centers.
2. Application Layer: The access route specifications at the intermediate layer include service routing components, which can be separately provided in our product system. For example, some customers propose that they do not want to use the complete solution because they may have all the open-source components used in this intermediate layer of the solution, but they still want to achieve multi-active capability. Therefore, the upper layer can be used to control and cut the traffic with our entire multi-active business, defining the number of logical units precisely and providing APIs for intermediate calls. The former layer provides the globally unique serial number, routing rule, and sharding rule. Marking and traffic identification seems to be relatively simple among the layer. However, in a multi-active scenario, some distributed messages are used in decoupling, and messages are used in the architecture. If they are switched without being consumed in a certain data center, the ways they need to be synchronized to another data center should be solved with the help of cloud-native.
3. Data Layer: This layer involves logic related to replication and writing. The write-prohibition control in our solution will have logic on the database that generates code automatically once the switch occurs in the frontend. For example, when the target data center is restored, the code with time will be generated automatically, and the write will only be re-released when the data is restored. We will prohibit writing to ensure the protection of the database and the judgment of database latency. If the underlying data synchronization capability is not strong enough, switchover and most businesses can be implemented, but many writing businesses may not be implemented because the database is limited by the write-prohibition rules. In addition, the overall requirements for data replication in the case of multiple data centers are higher than those in the case of data synchronization.
Based on the complete solution system, we proposed the concept shown in the preceding figure. The four-letter abbreviation MSHA indicates the capability to provide the cloud-native multi-active product today. We hope to play a small role in these four numbers with 0-minute, 1-minute, 5-minute, and 10-minute prevention.
The first is 0-minute prevention. With the traffic switchover mentioned earlier, you can deploy a blue-green release environment in two data centers. Even in the same data center, it is possible to define two units under the logic of the console, which can be carried out very quickly by the blue-green release in the same data center. The blue-green release of a data center is limited by the support of technical products. Through this component, it can figure out which resources belong to one unit and which resources belong to another unit and implement blue-green release for this unit quickly.
The second is 5-minute positioning. Originally, local cold backup and disaster recovery technologies often make difficult decisions or determine who needs to bear the consequences of a switchover. Based on this platform, we want to be able to view the impact of faults today intuitively, such as what kind of action the stakeholder needs to do or what operation the stakeholder needs to perform to restore the application when a fault occurs. Besides, this system is used to find the fault quickly when a fault occurs. For example, after positioning the problem within five minutes, the system decides whether to cut the traffic.
The third is a 10-minute restore. Finally, we hope we can control the whole process of the whole business restart in recovery within 10 minutes through this mode.
The following examples show the use of multi-active disaster recovery for enterprises outside Alibaba. The cloud does not mean that all high availability services are provided by the cloud when deploying on the cloud. When using resources, you will find that the cloud has different regions, and the same region has different availability zones. When using a public cloud, you must consider the real-world situation. For example, if most customers are located in the South, a node may be opened in the data center on the south side. When a problem occurs in the Alibaba data center, the customers' business may be affected accordingly. The corresponding business deployed on the cloud and the products in the cloud may also provide high availability. However, once a fault occurs in a data center, the business will still be affected. Therefore, the solution provided is to deploy the multi-active capability on the cloud and in the data center as commercial software.
The customer of a logistics service uses the multi-active architecture in the same zone. Although the traditional technology has a few problems, the advantages of using multi-active architecture are reflected in the corresponding SDK, which can be identified automatically and does not need to do too much transformation to the business. You can pass on the request for tagging automatically. After the disaster recovery is completed, RTO is much faster than before.
The difficulty in this case of multi-site dual-read mode is that it exceeds thousands of kilometers. Due to the distance limit, whether read or write is challenging. The data replication itself has a delay. The purpose of using the logic of this set of products is to distinguish between the control and traffic layers. One is the read business, one is imported to the read data center, and one is the replication state. By doing so, minute-level RTO improves significantly, allowing online dynamic and flexible service switching.
The enterprise customer that uses the multi-site active-active mode has two data centers currently and may expand its capabilities in the future. When doing this program, a lot of product adaptability development has been made. The basic capability of the original product has a lot of work in the middle layer to achieve reading. Besides, the entire process starts from the R&D of the multi-active product, moving forward for application scenarios and then fully adapting to the business. The core focus is business continuity. Therefore, this does not necessarily mean that all businesses will use the multi-active capability in data centers in the future. Instead, this focuses on critical businesses. For example, during Double 11 every year, our core business is to ensure that orders are not affected. With decoupling or other methods, the priority of logistics will not be as high as order placement in terms of business continuity. The key issue is the ways the services and products involved in the core transaction link in the multi-active dimension to ensure that the failover will not occur.
We recommend experiencing the multi-active management platform for yourself. After two or more units are defined inside the console, when one of the data centers goes down, we want to switch its application to another data center through multi-active mode quickly. The premise of switching is that you need to define nodes in the console. Whether they are logical nodes in a single data center or multiple physical data centers, they must be mapped to the multi-active management platform. In the platform, we configure some rules, such as the simplified service access, the dimension to which the access traffic is divided, or tagging by ID. When switching traffic, it is relatively simple to dynamically display the dimensions of traffic to another data center and configure quickly when a fault occurs.
When we help our customers deploy capabilities today, we often perform some traffic cuts and drills inside the system through the console to check if the data center is affected in some way. This happens because the system also supports some other solutions, such as performing fault drills and working with these faults to switch applications to another data center.
The multi-active disaster recovery has been practiced in Alibaba's internal business for many years, and it took a long time to evolve it into a product. The purpose is to help enterprises build up their multi-active capability within 30 days of using these products and solutions. In particular, many public cloud products are off-the-shelf enterprises, which take less time to build. We hope that this set of products and solutions can help enterprises implement failover and build multi-active capability quickly at the minute level.
Alibaba Clouder - December 3, 2020
Alibaba Cloud Community - December 23, 2021
Alipay Technology - March 4, 2021
Alibaba Clouder - July 15, 2020
Alipay Technology - May 14, 2020
Alibaba Developer - January 5, 2022
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.Learn More
Protect, backup, and restore your data assets on the cloud with Alibaba Cloud database services.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
More Posts by Alibaba Developer