Alibaba Cloud Elastic Computing SRE Practice

The stability of the system is inseparable from an effective early warning mechanism. According to Murphy's Law: Where things may go wrong, there will be mistakes. We can't accurately predict when and where the system will go wrong, but we can know when the system goes wrong. At times, the interface response becomes slow, system services are unavailable, business traffic drops, or customer operations cannot be completed, or even customer complaints.

In order to fight against Murphy's Law, we must take precautions to configure early warning information on various system nodes, so as not to be too passive when problems occur; at the same time, in order to pursue the problem discovery rate (more early warning items, more unreasonable thresholds, more irrelevant what matters), we have fallen into the paradox of early warning coverage, which culminates in the now prevalent early warning hell phenomenon. Let's look at the figure below, which is a very typical positive feedback enhancement loop, leading to more and more early warning problems.

Would more warnings make the problem better? According to the law of entropy increase, this process will inevitably lead to irreversible destructiveness. In the end, it may be unclear which warnings need to be dealt with, and even lead to disregard for warnings. Is there a way to fix this? Yes, do negative entropy! Step-by-step subtraction from each link of the early warning, today we will talk about the six elements of the early warning link.

Among these six early warning elements, some are recognized and obvious, while others are often overlooked, leading to early warning hell. This paper combines the experience and wisdom of daily early warning definition, notification, and governance, and is an ideal practice standard for early warning governance. Good early warning processing, continuous resolution of system hidden dangers, and promotion of stable and healthy development of the system.

01 Accurate: The alert itself is accurate and correctly notified

Of the large number of ignored alerts, a significant portion can be called inaccurate, since nothing actually happens if not addressed, and the first definition of accuracy is that an alert reaches a warning level. Early warnings that do not need to be dealt with will lead to the "wolf is coming" effect, more and more ignorance of early warnings, and eventually missing early warnings that really need to be dealt with; such a team has been found, and the warnings reported every few hours are not seen, only when Only short-term and high-density early warning notices can attract attention. Such teams are increasingly immune to early warnings and become more and more dangerous. In addition, invalid early warning notification will also lead to unnecessary waste of resources, such as: SMS, telephone charges, etc. So, from now on, take action to get rid of the invalid warning.

Accurate The second definition is the accurate notification of an alert to the correct recipient. Don't think that the more people who are notified of the warning, the more likely they will be dealt with. In fact, irrelevant people are more likely to wait and see when they receive the warning. In fact, no one takes action at all, because these irrelevant people have no motivation or ability to do so. I have encountered a case where students who have nothing to do with early warning notify the team that needs early warning and emergency response to see if your system has a problem with output. Although the irrelevant notification has played a role in this case, it is embarrassing and awkward for the team that should be early warning and emergency response. Terrible thing. In addition, students who receive the early warning emergency also need to respond to the early warning notification. On the one hand, they inform the concerned students that the early warning has been responded to and processed, and on the other hand, they have not prepared data for the early warning measurement.

To sum up, the accurate element is to notify the correct recipient of the real early warning information. Neither the real early warning nor the correct recipient should be sent. At the same time, this is also a handshake process, and the recipient must take over after receiving the notification. and ready to process.

02 Timely: timely notification, timely emergency

If we follow the response rate of early warning, wouldn’t it be better to notify and respond in time, and in some cases, it is true; but don’t forget that we do the negentropy. In most cases, it is enough for us to do it in time to avoid excessive tension and panic.

First of all, we need to use different intensities of notification channels in different time periods in a timely manner. For example, emergency warnings are required at night. SMS or IM may not be reached in time, and stronger notification methods, such as phone calls, are required; however, during normal working hours , everyone is online, there is no need for this. For very serious warnings, it is still necessary to continue to maintain urgent and strong notifications to achieve timeliness, but we still recommend that they be used as much as possible.

Secondly, in terms of emergency, for early warnings that do not respond in time, it can be upgraded to a stronger notification channel; at the same time, the processing personnel can be upgraded, such as upgrading to supervisors, SREs, etc., in order to achieve the purpose of timely emergency response. Not every alert needs to be dealt with urgently, for example, if the service is down, if it is one of many machines online, it will not affect the business traffic and can be dealt with later.

Finally, it is necessary to ensure that the same warning is not sent repeatedly in a short period of time, so as to avoid being drowned in the warning bomb. The general situation is to combine the alarms and make relevant statistics, and then do the corresponding suppression operations according to the classification of the warnings, or let manual selection of such warnings to be suspended for a period of time are sent. When the first warning is claimed, it means that the relevant emergency treatment has been started, and the push of the relevant warning is more interference. The effect of the treatment can be observed through monitoring, so there is no need to continue to report the same warning.

03 Exhaustive: Give scope, context, and diagnostic information

After receiving the alarm, the first thing is to determine the problem and then take corresponding isolation or hemostasis measures. If the content of the alarm is too simple, the process will become a guessing game. You need to go to the scene again to verify and speculate through various means. Only then can we identify the problem. This is also the most time-consuming place to deal with early warnings, and many of them rely on experience to eat, and it is almost difficult for new students to intervene. Therefore, if the scope of influence, context and diagnosis information are given in the early warning information, the problem location will be more effective at the meeting.

First, the impact scope of early warning is an important indicator for judging the priority of emergency response. The scope of influence includes dimensions such as resources, business, and users:
• Resources: stand-alone, cluster, related dependencies;
• Users: individual, some or all customers;
• Business: core business, bypass business, non-core business.
If the scope of influence is an individual case, isolation can be performed quickly, such as isolating a single machine, removing non-core links, or performing flow control degradation for a single client. On the contrary, if the area is at the area level, it is necessary to adopt a more urgent response mechanism, such as pulling more people to deal with it: the supervisor, SRE and other students in the team, and even upgrade to the fault level.

Second, contextual information can make positioning more effective. Contextual information helps to judge erroneous diagnoses and saves the trouble of re-establishing the scene. Contextual information includes:
◾ trace: A trace link that gives an early warning of the problem, which is equivalent to restoring the scene;
◾ Log: The detailed error log link can locate the specific code and stack information at that time;
◾ Association: The associated warning or change is convenient to quickly determine whether the warning is caused by other problem associations or changes.

With the context information, the path for further investigation is basically determined; otherwise, it is necessary to go to various platforms to collect information, or even restore the ready-made, and these operations are not only time-consuming, but also may not get the required information due to incomplete information or time flow. .

Finally, the diagnostic information in the early warning can even directly give the reason for the demarcation, eliminating the need for investigation time. The location of the problem often depends on the processor's understanding of the business, the use of tools, the data and previous experience of the platform. Early warning diagnosis is to convert the process experience in the mind through rules + data into results and output directly. Diagnosis requires certain system capacity building to achieve. For example, ECS uses the following architecture to support diagnosis:
1. It is necessary to integrate the early warning information from different channels. If the early warning information is sent directly, the diagnosis link will be lost. The early warning information is first connected to the diagnosis layer before subsequent diagnosis actions can be realized.
2. Certain information collection capabilities are required, including: various meta information (applications, databases, apis, personnel, on-duty...), change information, operation and maintenance tools, logs and other early warning information.
3. It is necessary to have certain information integration capabilities, integrate early warning and collected information, and give diagnostic results in combination with the diagnostic framework.

04 Recovery: isolation, hemostasis, self-healing

Recovery is the first priority in early warning processing. The impact on the system, business, and customers must be eliminated first, and then troubleshooting. Element 3 (detailed: give scope, context, and diagnostic information) The business is to locate where the problem is, and then take the corresponding recovery manual.

Under normal circumstances, recovery needs to perform some actions. If the early warning can go further and give recovery operations, it will greatly improve the response speed.

In general, the recovery action has the following execution paths:
◾ Fault self-healing, judge the impact according to the early warning and associate the prefabricated action to complete the fault self-healing. First of all, it is necessary to bind the call back action of the early warning support, and the correct self-healing operation can be selected according to the content of the warning; secondly, the control range of the action execution is required to avoid secondary faults caused by the automatic execution of the action. For example: we can detect a single machine problem and remove the machine. Self-healing requires a good judgment of the scope and influence of the execution, and only actions with confidence can be executed automatically.
◾ The action of the hemostatic action can be embedded in the warning content through links or chatops, and click the relevant action to quickly eliminate the impact. For example, we will give some actions such as flow control, restart, switch, etc. in the early warning notification, and the operation can be completed with one click.
◾ Best practices, operation manuals, or relevant contacts, when there is no hemostasis action, can give instructions for continuing the operation, follow the manual to continue the operation or contact the person who can handle it.
So far, we have completed the entire emergency response process of early warning through the optimization of four elements. The next two elements focus on the operation of early warning, and through efficient and effective operation, we will further manage early warning.

05 Override: Automatic override by template

During the fault review, many problems were not discovered in time because of the lack of corresponding monitoring. It is impossible to cover all monitoring items, but experience can be accumulated and inherited. General and standard early warning is applicable to most businesses, and a wider range of coverage can be achieved through templates.

A type of early warning has similar monitoring items or even similar threshold definitions. This type of standard monitoring must be able to quickly cover multiple applications and even businesses, and the inspection mechanism will ensure that it will not be missed. In general, the monitoring items for early warning are divided into basic monitoring and business monitoring. With different business importance levels, the definitions of monitoring and early warning are also different. For example, we also propose a response level to the monitoring according to the application level, and cover the monitoring through the template according to the level, so as to avoid forgetting the configuration problem, and at the same time, the newly added resources or services are also automatically covered.

General monitoring needs to be accumulated and templated, so that the experience can be passed on and avoid sending similar problems. For example: Brother team had such a failure before. Due to the issue of the release script, the machines removed from the vip during the release process were not mounted after the release was completed. When the last machine was released, the entire vip did not have a server and the service was unavailable. We summarize such a general rule through review: after more than xx% of the machines mounted in the vip are removed, an early warning is issued and applied to multiple organizations, so as to avoid similar things from happening in the future.

06 Metric: Negative Entropy Through Data Statistics

Finally, let's talk about the statistics of early warning. Management guru Drucker once said:
If you can't measure it, you can't manage it. (If you can't measure it, you can't manage it.) - Peter Drucker

Measurement is an important part of completing the closed loop of early warning governance. In fact, we are here to use the "lean" thinking to find problems through data feedback, then try on the problems, and then look back at the data.

Measured data can include the following:
◾ The specific data of early warning: used to analyze the follow-up analysis and improvement, statistical notification frequency, statistical early warning volume, for example: each average 1 day does not exceed 3, TOP people/teams/applications and other red and black lists.
◾ Whether it is marked invalid: clear invalid warning.
◾ Whether it is claimed: The claim rate needs to be posted through the red and black lists.
◾ Claiming time: emergency improvement, for faults, please refer to "1-5-10" (1 minute to find the problem, 5 minutes to locate, 10 minutes to restore).
◾ Time to Solve: Better to complete "1-5-10".
◾ Early warning tool usage: Improve tool recommendation accuracy.
With these operational data, there is a starting point for governance, and continuous operation can make the early warning evolve in a healthier direction.

Summarize
Finally, combined with an internal practice of Alibaba Cloud elastic computing, it explains how we combine the six elements for early warning management.

In terms of early warning, we have built a unified early warning platform and integrated the six elements into the early warning life cycle management through engineering methods.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us