Application Disaster Recovery - Well-Architected Framework

"Active-active deployment" is an advanced form of "application disaster recovery" technology, which refers to the establishment of a production system in the same city or different data centers that corresponds to a portion or all of the local production system, and all applications in the data centers provide services simultaneously. When a disaster occurs, the active-active system can switch business traffic within minutes, and users may not even feel the occurrence of the disaster. "Local active-active architecture" and "remote active-active architecture" (code-named "unitization") are typical implementation technologies for active-active applications.

Advantages of Active-active Deployment

Minute-level RTO (Recovery Time Objective): Fast recovery time, with an average recovery time within 30 seconds for Alibaba internal service systems and an average recovery time of 1 minute for external customer service systems.
Optimal resource utilization: There is no idle resource, and resources in multiple data centers can be fully utilized to avoid resource waste.
High success rate of switching: Depending on mature active-active technology architecture and visual operation and maintenance platform, compared to the existing disaster recovery architecture, the success rate of switching is higher. The success rate of thousands of switches per year within Alibaba is over 99.9%.
Precise traffic control: Active-active deployment supports fine-grained traffic control from top to bottom, allowing specific business traffic to be directed to the corresponding data center. Enterprises can use this advantage to implement features such as full-domain gray release and key traffic protection.

Design Principles for Active-active Deployment

Active-active deployment is a universal active-active architecture solution that supports cross-region and cross-platform deployment. The standard architecture of active-active deployment needs to meet the following four design standards:

Business Flow Active (BFA): The ultimate presentation of active-active deployment is the business flow, and the active-active disaster recovery system has the ability to finely allocate production traffic based on business characteristics.
Local Region Active (LRA): Applications are the smallest service set of distributed systems. When the primary center enters the disaster recovery state, it should have the ability to switch the entire or partial application globally.
Ultra Distance Active (UDA): In the case of extreme long-distance (distance between data centers exceeding 300 kilometers), the business system can still have good accessibility. The RTO and RPO are within minutes when entering the disaster recovery state.
Hybrid Cloud Active (HCA): Shield the details of disaster recovery for the business and provide a unified active-active programming paradigm. Maintain compatibility with cloud platform technologies in support of active-active scenarios in different deployment modes such as public cloud, private cloud, hosted private cloud, and edge computing nodes.

Typical Architectures of Active-active Deployment

Active-active Deployment in the Same City

Active-active deployment in the same city means that applications deployed in multiple data centers within the same city simultaneously provide services externally. In the same city scenario, the physical distance between data centers is small (less than 100 kilometers). In a multi-data center scenario, the network and service communication between data centers are essential. A failure in one data center can affect the whole system, and the scope of influence is uncontrollable. The challenge of active-active deployment architecture lies in traffic routing and isolation between data centers. When a failure occurs in a data center, fast switching at the data center level can be achieved. In more granular scenarios, if a specific application within a center fails, application-level switching is also required. To achieve traffic scheduling between data centers, in the active-active deployment architecture in the same city, multiple logical regions called "cells" are established for deployment of multiple services. The business traffic within each cell is preferred to stay within its own region as much as possible. Since the cross-data center round trip time (RTT) is small within the same city, the cloud services of multiple cells adopt a single-cluster mode to avoid the complexity of data consistency.

Active-active deployment in the same city has minimal intrusion on the application system. Based on flexible traffic scheduling and traffic routing between cells, it can quickly recover from faults and achieve decoupling between business recovery and fault recovery.

Active-active Deployment in Different Regions

Disaster recovery in the same city cannot withstand disasters at the regional level. According to the disaster recovery standards in the banking industry, the construction of disaster recovery centers must meet the "three no-principle" (i.e., the disaster recovery center should not be located in the same earthquake zone, the same river basin, and the same power grid as the service center), so the distance between the disaster recovery center and the production center is generally more than 300 kilometers. Active-active deployment in different regions means that applications deployed in multiple data centers in different regions simultaneously provide services externally. Active-active deployment in different regions faces the challenge of network latency caused by long distances. Due to the large network latency, it is difficult for cloud services to provide services in a single-cluster mode across regions, and using a multi-cluster mode will introduce complexity in maintaining data consistency. To solve the problem of physical distance limitation in a single cluster, Alibaba proposed the "unitization" solution for active-active deployment. The core idea is to shard data and use top-down traffic routing to enable specific shards of data to be read and written in specific centers, thus solving the problem of data consistency and addressing business disaster recovery and horizontal scaling. The logical center that can be horizontally scaled is called a "unit". There are two types of units: central units and normal units. Business deployment in a unit can be classified into three types: global business, core business, and shared business. The central unit deploys global business, core business, and shared business, while the normal unit deploys core business and shared business. The normal unit has horizontal scalability and can be replicated freely.

Global business: Strongly consistent business with writing and reading in the central unit.
Core business: Business that has been unitized, with writing and reading in the normal unit.
Shared business: Read service for global business that is frequently relied upon by the core business, with writing in the central unit and reading in the normal unit.

Active-active Deployment in Hybrid Cloud Environments

Hybrid cloud combines different deployment modes such as public cloud, private cloud, hosted private cloud, and edge computing nodes, and provides support for various technical and business practices in cloud infrastructure, middleware, and development lifecycle platforms. IDC reports show that in 2021, 14% of customers will choose public cloud separately, while 86% of enterprises adopt a multi-cloud and hybrid cloud architecture. Active-active deployment in hybrid cloud means that applications deployed in a hybrid cloud environment provide services externally. The architecture of active-active deployment in hybrid cloud is a disaster recovery architecture derived from multi-cloud and heterogeneous scenarios, which decouples business recovery from cloud infrastructure and cloud services. The management product for active-active deployment in hybrid cloud, shields the complexity of hybrid cloud disaster recovery from developers and upper-layer users, and provides a consistent development, operation, and maintenance experience. The biggest challenge of active-active deployment in hybrid cloud lies in the independence and heterogeneity of multiple clouds, achieving integration and data exchange. Cloud integration refers to integrating cloud accounts, permissions, resources, and data, and building a unified multi-cloud operation interface on this basis. Data exchange refers to discovering and synchronizing the infrastructure and middleware system of different clouds, achieving data disaster recovery based on this foundation. To achieve cloud integration and data exchange, active-active deployment in hybrid cloud management needs multi-cloud adaptation to shield the underlying differences. It mainly relies on three core functionalities:

Cloud service interface adaptation: Supports multi-cloud adaptation through plug-ins and provides a unified cloud service interface.
Data model adaptation: Abstracts accounts, permissions, resources, and data of different clouds, and shields the data differences of multi-cloud.
Unified disaster recovery interface, providing standardized definitions and interface specifications for disaster recovery, facilitating rapid integration of heterogeneous clouds and their heterogeneous technology stacks.

Technical Solutions for Active-active Deployment

The technical solutions for active-active deployment generally consist of three parts: the application layer, the data layer, and the cloud platform. These three components follow the design standards of active-active deployment and support the capability of building active-active deployment architectures. The application layer is the main link for business application traffic and can be divided into three parts:

Access gateway: As the first hop for business traffic to enter the data center, the access gateway is responsible for identifying and distributing traffic, and has two core capabilities: data center routing and application routing.
Microservices: Business traffic is synchronously called within and across data centers. It generally includes roles such as consumer, provider, and registration center, and has three core capabilities: traffic routing, traffic protection, and fault isolation.
Messaging: Business traffic is asynchronously called within and across data centers, based on message peak shaving and valley filling. It generally includes roles such as producer, consumer, and broker.

The data layer covers business application data read/write, data storage, and data synchronization, and has three core capabilities: traffic routing, data consistency protection, and data synchronization. The cloud platform is the core foundation that supports the operation of business applications and includes single-cloud, single data center, multi-cloud, and hybrid cloud deployments.