Failure Review - Well-Architected Framework - Alibaba Cloud Documentation Center

Failure Reivew Specification

As an important part of the fault system, fault review includes fault handling process, improvement analysis, and fault responsibility definition. Based on standardized review SOP, corresponding preventive action recommendations, and accountability management mechanism, it comprehensively retraces the occurrence of production faults, generates fault review reports and improvement measures to avoid repeated faults.

The review follows the following standard process:

Process review: Use the 5-why method to dive deep into the processing process by proposing multiple questions. For example, why did this fault occur? Why wasn't it discovered in advance? How did each team handle it during the process? Are there any possibilities for process optimization?
Problem Analysis: After completing the process review, a deep analysis is needed. Is there a process problem? Is there a quality inspection problem? Is there a product business problem? Is there a system design problem? Are there better defense mechanisms? How to avoid recurrence?
Experience Summary: After analyzing the root cause, practical actions need to be identified, including short-term mitigation actions, long-term resolution actions, and accumulation of experiences and lessons.
Responsibility Assignment: After completing the analysis of the causes and improvement measures, determine the final fault level and assign responsibility for the fault. The responsible teams are divided into main responsible teams and secondary responsible teams, as well as testing responsible teams.
Improvement Tracking: After completing the review, if improvements cannot be effectively implemented, the results of the review will be wasted. Therefore, effective implementation of improvements is needed after the r, with improvement action plans clearly defined and completion time limit specified.

The formulated actions need to comply with the SMART principle, which means:
- Specific: What is the improvement? What are the specific metrics or indicators that need improvement and optimization?
- Measurable: What are the criteria for acceptance? What are the specific metrics or indicators used to determine whether the improvement has been achieved?
- Attainable: Is the improvement achievable? Avoid improvements that are unattainable or cannot be implemented.
- Relevant: Are the improvements related to other improvements? Avoid isolated improvements.
- Time-bound: What is the expected completion time for the improvement? The completion time should ideally not exceed three months to avoid improvements becoming formalities.
A complete action should record the following information: title, planned completion time, owner (and their team or assisting personnel), acceptance method and acceptor, tracker, improvement measure category, specific improvement description, and acceptance criteria. After the completion of the improvement action, acceptance can be selectively conducted, such as review acceptance or simulation acceptance. After acceptance, the person in charge of acceptance will conclude the overall work of this improvement action.

The review document generally includes the following content:

Fault Summary: Fault overview, impact, handlers, etc.
Fault Background: The business background when the fault occurred.
Fault Timeline: Emphasizes the time line of the introduction, occurrence, discovery, business response, recovery execution, and fault recovery.
Fault Cause Analysis: Firstly summarize in one sentence, and then conduct a detailed cause analysis.
Fault Process Analysis: Analyze from the aspects of demand evaluation, code release, and fault emergency.
Follow-up Improvement: Follow-up improvement measures, clearly defining improvement direction and responsible persons.
Fault Level/Responsibility: Refer to the fault level definition described above. Define the fault level for this fault and clearly define the responsible team and responsible person.

Failure Data Operation

Use different dimensions and forms based on the basic fault data to disclose and operate fault data in various occasions such as report platforms, production safety reports, and production safety meetings in a combined online and offline manner. The purpose is to use historical fault data to measure the stability status and capabilities. The purpose of fault data operation is make use of the fault quantification calculation assessment to achieve overall fault mitigation.