Failure Basic Data Management - Well-Architected Framework

Failure Scenario Level Definition

In daily operations, any phenomenon that causes service interruption, service quality degradation, or user service experience decline, regardless of the reason, is called a failure or fault. The division of fault impact level is the definition of fault level.

The definition of fault level is to specify the fault level as the production safety rules for each business, and to promote the improvement of stability of each business. For example, the standard for evaluating the fault discovery capability of each business team is the monitoring discovery rate of fault level definition. When defining fault levels, it is necessary to design from four dimensions: functional level, business scale, business characteristics, and quantitative impact. A brief reference template for common fault level definition is as follows:

Business scale	Functional Category	Impact Area	P1	P2	P3	P4
Large Scale	Critical Function	Success rate drops by more than 30%	P1
		Success rate drops by 20% to 30%		P2
		Success rate drops by less than 20%			P3
	Non-Critical Function	Success rate drops by more than 30%		P2
		Success rate drops by 20% to 30%			P3
		Success rate drops by less than 20%				P4
Small Scale	Critical Function	Total success rate drops by more than 45% within 10 minutes	P1
		Total success rate drops by 30% to 45% within 10 minutes		P2
		Total success rate drops by less than 30% within 10 minutes			P3
	Non-Critical Function	Total success rate drops by more than 45% within 10 minutes		P2
		Total success rate drops by 30% to 45% within 10 minutes			P3
		Total success rate drops by less than 30% within 10 minutes				P4

Failure Scenario Monitoring Coverage

Based on the scenario defined by fault level, configure corresponding monitoring items to access 24/7 monitoring duty, and provide additional intelligent alarms based on algorithms for the access monitoring data, or access risk alarms that can be closed-loop by R&D, to ensure the monitoring discovery rate of business faults, reduce the fault duration, and reduce the fault impact.

To ensure the fault discovery rate, it is recommended to maintain a fault scenario monitoring coverage rate of over 95%.

Service Group & Duty Roster Management

Associate the personnel group related to fault emergency with the stakeholders in the fault scenario, and support service groups and duty rosters to automatically and quickly notify responsible personnel to go online for processing after fault startup. Explanation of terms:

Service Group: A group of personnel providing services, including fault handling, work order handling, etc.
Duty Roster: A schedule of service group members, making fault emergency work more planned and less likely to be missed.
Escalation Group: A type of service group, through service groups and escalation groups, the escalation path between groups can be expressed.
Relationship between Service Group and Business Fault Group: One service group corresponds to one role in the fault, but can serve multiple fault business groups.
Relationship between Service Group and Work Order Problem Classification: One service group can serve multiple work order problem classifications.
Relationship between Service Group and Organizational Structure: One service group can serve multiple organizational structures, and one organizational structure can be divided into multiple service groups.

Failure Subscription Management

Fault notification subscription is used to maintain fault notification recipients and send notification via different channels based on different conditions. Fault subscription objects can be divided into three types: individuals, stakeholders roles, and DingTalk groups or other notification channels. By configuring fault notifications and subscriptions reasonably, relevant stakeholders can receive timely alerts.