All Products
Search
Document Center

Well-Architected Framework:Failure Basic Data Management

Last Updated:Jul 15, 2025

Failure Scenario Level Definition

In daily operations, any phenomenon that causes service interruption, service quality degradation, or user service experience decline, regardless of the reason, is called a failure or fault. The division of fault impact level is the definition of fault level.

The definition of fault level is to specify the fault level as the production safety rules for each business, and to promote the improvement of stability of each business. For example, the standard for evaluating the fault discovery capability of each business team is the monitoring discovery rate of fault level definition. When defining fault levels, it is necessary to design from four dimensions: functional level, business scale, business characteristics, and quantitative impact. A brief reference template for common fault level definition is as follows:

Business scale

Functional Category

Impact Area

P1

P2

P3

P4

Large Scale

Critical Function

Success rate drops by more than 30%

P1

Success rate drops by 20% to 30%

P2

Success rate drops by less than 20%

P3

Non-Critical Function

Success rate drops by more than 30%

P2

Success rate drops by 20% to 30%

P3

Success rate drops by less than 20%

P4

Small Scale

Critical Function

Total success rate drops by more than 45% within 10 minutes

P1

Total success rate drops by 30% to 45% within 10 minutes

P2

Total success rate drops by less than 30% within 10 minutes

P3

Non-Critical Function

Total success rate drops by more than 45% within 10 minutes

P2

Total success rate drops by 30% to 45% within 10 minutes

P3

Total success rate drops by less than 30% within 10 minutes

P4

Failure Scenario Monitoring Coverage

Based on the scenario defined by fault level, configure corresponding monitoring items to access 24/7 monitoring duty, and provide additional intelligent alarms based on algorithms for the access monitoring data, or access risk alarms that can be closed-loop by R&D, to ensure the monitoring discovery rate of business faults, reduce the fault duration, and reduce the fault impact.

To ensure the fault discovery rate, it is recommended to maintain a fault scenario monitoring coverage rate of over 95%.

Service Group & Duty Roster Management

Associate the personnel group related to fault emergency with the stakeholders in the fault scenario, and support service groups and duty rosters to automatically and quickly notify responsible personnel to go online for processing after fault startup. Explanation of terms:

  1. Service Group: A group of personnel providing services, including fault handling, work order handling, etc.

  2. Duty Roster: A schedule of service group members, making fault emergency work more planned and less likely to be missed.

  3. Escalation Group: A type of service group, through service groups and escalation groups, the escalation path between groups can be expressed.

  4. Relationship between Service Group and Business Fault Group: One service group corresponds to one role in the fault, but can serve multiple fault business groups.

  5. Relationship between Service Group and Work Order Problem Classification: One service group can serve multiple work order problem classifications.

  6. Relationship between Service Group and Organizational Structure: One service group can serve multiple organizational structures, and one organizational structure can be divided into multiple service groups.

Failure Subscription Management

Fault notification subscription is used to maintain fault notification recipients and send notification via different channels based on different conditions. Fault subscription objects can be divided into three types: individuals, stakeholders roles, and DingTalk groups or other notification channels. By configuring fault notifications and subscriptions reasonably, relevant stakeholders can receive timely alerts.