Failure Scenario Level Definition
In daily operations, any phenomenon that causes service interruption, service quality degradation, or user service experience decline, regardless of the reason, is called a failure or fault. The division of fault impact level is the definition of fault level.
The definition of fault level is to specify the fault level as the production safety rules for each business, and to promote the improvement of stability of each business. For example, the standard for evaluating the fault discovery capability of each business team is the monitoring discovery rate of fault level definition. When defining fault levels, it is necessary to design from four dimensions: functional level, business scale, business characteristics, and quantitative impact. A brief reference template for common fault level definition is as follows:
Business scale | Functional Category | Impact Area | P1 | P2 | P3 | P4 |
Large Scale | Critical Function | Success rate drops by more than 30% | P1 | |||
Success rate drops by 20% to 30% | P2 | |||||
Success rate drops by less than 20% | P3 | |||||
Non-Critical Function | Success rate drops by more than 30% | P2 | ||||
Success rate drops by 20% to 30% | P3 | |||||
Success rate drops by less than 20% | P4 | |||||
Small Scale | Critical Function | Total success rate drops by more than 45% within 10 minutes | P1 | |||
Total success rate drops by 30% to 45% within 10 minutes | P2 | |||||
Total success rate drops by less than 30% within 10 minutes | P3 | |||||
Non-Critical Function | Total success rate drops by more than 45% within 10 minutes | P2 | ||||
Total success rate drops by 30% to 45% within 10 minutes | P3 | |||||
Total success rate drops by less than 30% within 10 minutes | P4 |
Failure Scenario Monitoring Coverage
Based on the scenario defined by fault level, configure corresponding monitoring items to access 24/7 monitoring duty, and provide additional intelligent alarms based on algorithms for the access monitoring data, or access risk alarms that can be closed-loop by R&D, to ensure the monitoring discovery rate of business faults, reduce the fault duration, and reduce the fault impact.
To ensure the fault discovery rate, it is recommended to maintain a fault scenario monitoring coverage rate of over 95%.
Service Group & Duty Roster Management
Associate the personnel group related to fault emergency with the stakeholders in the fault scenario, and support service groups and duty rosters to automatically and quickly notify responsible personnel to go online for processing after fault startup. Explanation of terms:
Service Group: A group of personnel providing services, including fault handling, work order handling, etc.
Duty Roster: A schedule of service group members, making fault emergency work more planned and less likely to be missed.
Escalation Group: A type of service group, through service groups and escalation groups, the escalation path between groups can be expressed.
Relationship between Service Group and Business Fault Group: One service group corresponds to one role in the fault, but can serve multiple fault business groups.
Relationship between Service Group and Work Order Problem Classification: One service group can serve multiple work order problem classifications.
Relationship between Service Group and Organizational Structure: One service group can serve multiple organizational structures, and one organizational structure can be divided into multiple service groups.
Failure Subscription Management
Fault notification subscription is used to maintain fault notification recipients and send notification via different channels based on different conditions. Fault subscription objects can be divided into three types: individuals, stakeholders roles, and DingTalk groups or other notification channels. By configuring fault notifications and subscriptions reasonably, relevant stakeholders can receive timely alerts.