24/7 Monitoring Duty
For enterprises with conditions, a Global Operations Center (GOC) can be established to implement 24/7 monitoring duty and pay close attention to abnormality and faults in core business. For core business scenarios with monitoring coverage, when exceptions are reported, risks or faults can be timely identified through automatic detection using tools or manual judgment by on-duty personnel. Emergency responsible personnel can be quickly notified online for processing through risk alarms or fault notifications, thereby avoiding business damage or reducing the degree of business damage.
The establishment of 24/7 monitoring duty is because the accuracy of alerts cannot reach 100%. In order to ensure the accuracy of business development's phone alarms and reduce invalid interruptions, it is necessary to manually determine whether it is a true exception. Each business attaches great importance to faults, and false alarms or missed alarms will have a significant impact. Therefore, faults need to be manually confirmed and sent to ensure accuracy. The fault-handling process requires manual organization and follow-up. Fault emergency response is a race against time, and it is difficult to guarantee the response time for Oncall personnel. The main assessment indicators for 24/7 monitoring duty are notification timeliness rate, notfication accuracy and rapid recovery execution rate.
Intelligent Baseline Alarm
The intelligent baseline alarm is an intelligent alarm that integrates statistical methods and machine learning algorithms. It automatically learns the historical rules of metric data and detects abnormal curve mutations. Compared with custom alarm rules, intelligent baseline alarms have higher accuracy in detecting abnormalities for metrics with periodic patterns. Characteristics of intelligent baseline alarm:
No manual configuration of alarm rules is required. The alarm is automatically generated based on the historical data of the metric.
Suppresses false alarms caused by short-term high peaks and falls, such as business metric surges during promotion events.
Suppresses periodic false alarms. When a drop-type anomaly continues for multiple days at the same time, the alarm is suppressed. This applies to business scenarios where there is a daily surge or scheduled drop in message tasks.
It is recommended to focus on key business metrics in three scenarios: total success (volume), success (failure) rate, and failure volume.