Failure Drills - Well-Architected Framework - Alibaba Cloud Documentation Center

For many large enterprises (such as Alibaba) after years of technological evolution, system tools and architectures have become highly vertical, and the scale of servers has reached a considerable level. When the service scale exceeds a certain threshold, like 10,000 servers, low-probability hardware faults occur daily. In such cases, if human intervention is required, the system cannot reliably scale.

Therefore, failure tolerance is designed at every level of the system, and downstream components are trusted to ensure that faults can be quickly detected and handled when they occur. However, the effectiveness of these measures, the true disaster recovery capabilities of fault recovery tools, the proficiency of problem handling personnel, communication mechanisms, and the impact of disaster recovery measures on upper-level systems are often not tested until real faults occur.

Failure drills were born in this context, accumulating common fault scenarios and injecting exceptions into the online fault replay at controlled costs. Through continuous practice and regression operation, the faults discovered are validated, and the continuous improvement of systems, tools, processes, and personnel capabilities is promoted. This enables the discovery and repair of avoidable major problems in advance or shortening fault recovery time through fault discovery methods and fault recovery capabilities.

Failure drills are validated based on the business system through failure drills achieved by chaos engineering. Drills can be categorized into lossy drills and non-lossy drills, and business system verification can be achieved through medium to low-frequency lossy drills to discover and verify business architecture problems and verify business disaster recovery capabilities. High-frequency non-lossy drills can be conducted to validate business monitoring and alarm response capabilities, as well as emergency organization capabilities.

Theoretical Basis of Drill Scenario Design

Technical fault analysis can be summarized and categorized into IaaS, PaaS, and SaaS.

The above classification offers a macroscopic perspective, rather than a system design perspective. Consequently, the fault model in system design can be further refined, leading to several conclusions:

Faults arise from hardware (such as the IaaS layer) or software (such as the PaaS or SaaS layer). A rule exists that hardware faults manifest as software fault phenomena.
Faults can be categorized into standalone or distributed systems, with distributed faults encompassing standalone faults.
Regarding standalone faults, from a system perspective, faults may occur within the current process, such as FullGC or high CPU usage. They may also extend beyond the process, for instance, when other processes suddenly consume memory, resulting in abnormal behavior in the current system. For most non-critical scenarios, there is no need to consider actual external faults. Only the impact of faults on the current system needs to be considered, and actual external faults are not necessary.
Additionally, there is a category of faults that may result from human errors or improper processes, which are not addressed here.

Common fault types can be mapped to this fault model. Fault simulation drill systems and schemes can be designed based on this model. When designing a drill plan, faults can be injected into each link in the model to verify fault emergency plans.

Different Drill Types and Objectives

Based on the impact on production workloads during drills, drills can be categorized into lossy drills and non-lossy drills. Due to the different impacts on workloads, the frequency and achievable business verification objectives of the two types of drills are different.

Lossy drills refer to direct injection of exceptions into the real production workload environment for drills. The simulation effectiveness of drills is high. To balance the impact on business, core scenarios are generally selected, and drills are conducted during the lowest workload peak. The frequency of drills is relatively low. For example, to verify multi-active disaster recovery capabilities, monthly drills are conducted. Non-lossy drills refer to drills conducted in an isolated environment without real production workload traffic, and drills are conducted with traffic injection in the load testing environment to simulate exceptional conditions. Since the business is not affected, drills can be conducted at a higher frequency. For example, to simulate similar faults in the production environment, verify the acceptance of improvement actions in fault reviews, and validate monitoring perception and alarm response capabilities, drills can be conducted once a week with participation from different business teams.

Drill Type	Advantages and Disadvantages of Drill Plans	Drill Environment	Drill Frequency	Main Drill Objectives
Lossy Drills	Pros: High realism and effectiveness Cons: Impact of production workload	Real production workload environment	Once every 1-2 months	Testing disaster recovery capabilities in multiple active data centers Simulation and verification of major architecture/business issues Production sudden-attack simulation drills
Non-lossy Drills	Pros: Production workload not affected Cons: Limited realism	Full-path grayscale environment or newly established workload environment	Once or twice a week	Verification of monitoring perception capability/alarm response speed Simulation and acceptance of similar faults/improvement actions in fault retrospectives Verification of emergency organization processes and disruption mitigation plans

Best Practices for Failure Drills

Alibaba Group has achieved routine implementation of lossy drills and non-lossy drills through chaos engineering, shortening the process of building large-scale drill systems, and accelerating the efficiency of drill implementation. This allows businesses to focus more on identifying architecture/process risks and system optimization/disaster recovery capabilities to maximize the input-output ratio of chaos engineering experiments.

Low-frequency drills are conducted for three major scenarios in the production environment:

Data center network outage drills, which sequentially test individual services and gradually expand the scope of services until all services are part of the drills. This ensures the continuous effectiveness of multi-active disaster recovery capabilities, avoiding the failure of multi-active disaster recovery capabilities due to business iterations, infrastructure changes, or middleware changes.
Simulation and verification of major architecture/business issues discovered through full-path production scans and collection of problems by "Mine Sweeper".
Annual or semi-annual production sudden-attack drills conducted by the CTO by personally injecting faults. These drills verify the full-path fault handling process, demonstrated by fault perception detection, quick alarm response, efficient emergency organization, troubleshooting, and disruption mitigation measures.

High-frequency non-lossy drills are conducted in simulation environments to verify various capabilities of different businesses:

Verification of monitoring perception capability and alarm response speed of each business.
Replication and verification of historical fault deformation/abstraction scenarios in this business and other businesses, as well as verification of important improvement measures from historical faults.
Verification of emergency coordination capabilities of each business and various related plans.