Alibaba Cloud Elastic Computing SRE Practice

The stability of the system cannot be separated from an effective early warning mechanism. According to Murphy's Law, where there may be errors, there will be errors. We can't predict exactly when and where the system will make errors. However, we can know that when the system goes wrong, the interface response will be slow, the system service will not be available, the business flow will drop, the customer operation will not be completed, or even the customer will complain.

In order to fight against Murphy's Law, we must proactively configure early warning information on various system nodes to avoid being too passive when problems occur; At the same time, in order to pursue the problem discovery rate (more early warning items, more unreasonable thresholds, and more insignificant content), we have fallen into the paradox of early warning coverage, which ultimately led to the universal early warning hell phenomenon. As shown in the figure below, this is a typical positive feedback enhancement loop, resulting in more and more early warning problems.

Figure: Positive feedback enhancement loop of early warning

Will more early warning items make the problem better? According to the law of entropy increase, this process will inevitably lead to irreversible destruction, and may eventually be confused about which early warnings need to be handled, or even lead to disregard of early warnings. Is there any way to solve this problem? Yes, negative entropy! We will talk about the six elements of the early warning process.

Among the six early warning elements, some are recognized and obvious, while others are often ignored, which leads to early warning hell. This article combines the experience and wisdom of daily early warning definition, notification and governance, and is the ideal practice standard of early warning governance. It focuses on maintaining good early warning treatment, continuously solving system hidden dangers, and promoting the stable and healthy development of the system.

01 Accuracy: the early warning itself is accurate and correctly notified

In a large number of ignored alerts, a large part of them can be called inaccurate, because no actual problem will occur even if they are not handled. The first accurate definition is that alerts have reached the warning level. The warning that does not need to be handled will lead to the "wolf coming" effect, and will become more and more indifferent to the warning, eventually missing the warning that really needs to be handled; It has been found that such a team has no one to watch the early warning once in several small times. Only a short time of high-density early warning notice can attract attention. Such a team is becoming more and more immune to early warning, and more and more dangerous. In addition, invalid early warning notices can also lead to unnecessary waste of resources, such as SMS and phone charges. So, from now on, take action to eliminate the invalid early warning.

The second definition of accuracy is that alerts are accurately notified to the correct recipients. Don't think that the more people who are alerted, the more likely they will be dealt with. In fact, irrelevant people are more likely to wait and see when they receive alerts. In fact, no one acts at all, because these irrelevant people have no motivation or ability to do so. I once met a case where a student unrelated to early warning informed the team that needs early warning and emergency response to see if your system has a problem. Although the unrelated notification has played a role in this case, how embarrassing and terrible it is for the team that needs early warning and emergency response. In addition, students receiving early warning and emergency response also need to respond to the early warning notice. On the one hand, they should inform the students concerned that the early warning has been responded to, and on the other hand, they have not prepared data for the measurement of early warning.

To sum up, the accurate element is to notify the correct receiver of the real alert information. In this case, neither the real alert nor the correct receiver should be sent without it; At the same time, this is also a handshake process. After receiving the notice, the receiver should take over and prepare for handling.

02 Timely: timely notification and timely emergency response

According to the response rate of early warning, it is not better to inform and respond in time, as is true in some cases; But don't forget to do the negative entropy. In most cases, it is enough for us to be timely to avoid excessive tension and panic.

First of all, we need to use different intensity notification channels in different time periods at the right time. For example, emergency early warning is needed at night. SMS or IM may not be reached in time, and more intensive notification methods, such as telephone, are needed; But in normal working hours, it is unnecessary for everyone to be online. For very serious early warning, we still need to continue to maintain urgent and strong notice to achieve timeliness, but we still propose not to use it as much as possible.

Secondly, in terms of emergency response, early warning without timely response can be upgraded to a stronger notification channel; At the same time, the processing personnel can also be upgraded, such as the supervisor, SRE, etc., to achieve the purpose of timely emergency response. Not every alert needs to be handled urgently. For example, if it is one of many online machines, it will not affect the business flow and can be handled later.

Finally, it is necessary to ensure that the same early warning is not sent repeatedly in a short time at the right time to avoid being submerged in the early warning bomb. Generally, alarms are consolidated and related statistics are made, and then the corresponding suppression operation is performed according to the alert classification, or the sending of such alerts is manually suspended for a period of time. When the first early warning is claimed, it means that the relevant emergency treatment has been started. Pushing the relevant early warning is more interference, and the processing effect can be observed through monitoring, so it is unnecessary to continue to report the same early warning.

03 Detailed: give the scope of influence, context and diagnostic information

After receiving the alarm, the first thing to do is to determine the problem and then take corresponding isolation or hemostasis measures. If the alarm content is too simple, the process will become a guessing game. You need to go to the scene again to verify and speculate through a variety of means to determine the problem. This is also the place that takes the most time to deal with the early warning. In addition, many of them rely on experience to eat, and it is almost difficult for new students to intervene. Therefore, given the scope of influence, context and diagnostic information in the early warning information, the problem location in the meeting will be twice the result with half the effort.

First of all, the influence scope of early warning is an important indicator to judge the emergency priorities. The scope of influence includes resource, business, user and other dimensions:

• Resources: stand-alone, cluster, and related dependencies;

• Users: individual, partial or all customers;

• Business: core business, bypass business and non core business.

If the scope of influence is an individual case, isolation can be done quickly, such as isolating a single machine, removing non core links or degrading flow control for a single customer. On the contrary, if the area level, more emergency response mechanisms need to be adopted, such as pulling more people to deal with: supervisors, SREs and other students in the team, or even upgrading to the fault level.

Secondly, context information will make localization twice the result with half the effort. The context information is helpful to judge the wrong diagnosis, saving the trouble of restoring the scene again. Context information includes:

◾ trace: a trace link that gives an early warning, equivalent to restoring the scene;

◾ Log: The detailed error log link can locate the specific code and the current stack information;

◾ Association: associated alert or change, which is convenient to quickly judge whether the alert is caused by other problem association or change.

With the context information, the path for further troubleshooting is basically determined; Otherwise, you need to go to various platforms to collect information, or even restore the ready-made information. These operations are not only time-consuming, but also may not be able to obtain the required information due to incomplete information or time flow.

Finally, the diagnosis information in the early warning can even directly give the reason for delimitation, eliminating the time for troubleshooting. The problem location often depends on the business understanding, tool use, platform data and past experience of the processor. Early warning diagnosis is to turn these mental process experiences into results through rules+data and output them directly. Diagnosis can only be achieved through certain system capacity building. For example, ECS uses the following architecture internally to support diagnosis:

1. The early warning information from different channels needs to be connected in a unified way. If the early warning information is sent directly, the diagnosis link will be lost. The early warning information can be connected to the diagnosis layer first to achieve subsequent diagnosis actions.

2. A certain ability to collect information is required, including various meta information (applications, databases, apis, personnel, on duty...), change information, operation and maintenance tools, logs, and other early warning information.

3. It is necessary to have certain information integration ability, integrate the early warning and collected information, and give the diagnosis results in combination with the diagnosis framework.

04 Recovery: isolation, hemostasis and self-healing

Recovery is the first important thing in early warning processing. First, eliminate the impact on the system, business and customers, and then troubleshoot the problem. Element 3 (detailed: give the scope of influence, context and diagnostic information): In order to locate the problem, the business should take the corresponding recovery manual.

In general, recovery requires some actions. If the alert can go further and the recovery operation is given, the response speed will be greatly improved.

Generally, the recovery action has the following execution paths:

◾ Fault self-healing: judge the influence according to the early warning and associate the prefabricated actions to complete the fault self-healing. First, alert supports binding call back actions, and correct self-healing operations can be selected according to alert content; Secondly, the action executes the control range to avoid secondary failure caused by automatic action. For example, we can detect a single machine problem and remove the machine. Self healing requires a good judgment of the scope and impact of implementation, and only actions with confidence can be automatically implemented.

◾ The action of hemostasis action can be embedded into the alert content through link or chatops. Click the relevant action to quickly eliminate the impact. For example, we will give some flow control, restart, switch and other actions in the alert notice, and the operation can be completed with one click.

◾ When there is no hemostasis action, the best practice, operation manual or relevant contact person can give instructions for continuing the operation, and continue the operation according to the manual or contact the person who can handle it.

So far, we have completed the whole emergency response process of early warning through the optimization of four elements. The next two elements focus on the operation of early warning, and further manage early warning through efficient and effective operation.

05 Overwrite: Automatically overwrite by template

In fault recovery, many problems were not found in time because of the lack of corresponding monitoring. It is impossible to cover all monitoring items, but experience can be accumulated and inherited. Common and standard alerts are applicable to most businesses, and can be covered in a wider range through templates.

A type of early warning has similar monitoring items or even similar threshold definitions. This type of standard monitoring should be able to quickly cover multiple applications or even businesses, and ensure that it will not be missed through the patrol mechanism. In general, alert monitoring items are divided into basic monitoring and business monitoring. The definitions of monitoring and alert vary with different business importance levels. For example, we also propose the level of response to the monitoring according to the application level, and override the monitoring through the template according to the level, so as to avoid forgetting the configuration problem. At the same time, the newly added resources or services are also automatically covered.

General monitoring needs to be accumulated and templated to pass on experience and avoid sending similar problems. For example, a brother team had such a failure before. Due to the issue of publishing scripts, the machine removed from the vip during the publishing process is not mounted after the publishing is completed. When the last machine is published, the whole vip has no server, which makes the business unavailable. We summarize such a general rule through the review: after more than xx% of the machines attached to the vip are removed, an alert is sent and applied to multiple organizations to avoid similar events in the future.

06 Measurement: negative entropy through data statistics

Finally, let's talk about the data statistics of early warning. Drucker, the master of management, once said:

If you can't measure it, you can't manage it. (If you can't measure it, you can't manage it.) -- Peter Drucker

Measurement is an important part of completing the closed-loop early warning governance. In fact, we use the "lean" idea here to find problems through data feedback, then try to solve problems, and then look back at the data.

The measured data can include the following aspects:

◾ Specific data of alerts: used to analyze subsequent analysis and improvement, count notification frequency, and count alerts. For example, no more than 3 alerts per day on average, and red and black lists of TOP people/teams/applications.

◾ Marked invalid: clear invalid alerts.

◾ Whether someone claims: the claim rate needs to be published through the red and black list.

◾ Claim time: for emergency improvement, refer to "1-5-10" for faults (1 minute for finding problems, 5 minutes for positioning, and 10 minutes for recovery).

◾ Time to solve: better complete "1-5-10".

◾ Use of alert tools: improve the accuracy of tool recommendations.

With these operational data, we have a handle on governance. Only continuous operation can make early warning evolve in a healthier direction.


Finally, in combination with an internal practice of Alibaba Cloud elastic computing, we will explain how we combine the six elements to conduct early warning governance.

We have built a unified early warning platform in early warning, integrating the six elements into early warning lifecycle management through engineering.

First of all, we closed most of the early warning channels. Internally, we have multiple early warning sources such as SLS, POP, ARMS, DAS, Promethums, etc. These alerts will be notified to the early warning gateway rather than specific people. The alert gateway will analyze, structure and diagnose these data.

The diagnostic environment will output influence surface, root cause and rapid recovery tools, which are the most complex and require certain information integration and diagnostic capabilities. First of all, we will associate early warning information with meta, which is the basic data we continue to build internally, including resource information (machines, databases, APIs, applications, etc.), organizational information (organizational structure, owner information, duty information), and policy information (emergency procedures, upgrade strategies, notification strategies, etc.); In addition, we also need to diagnose the impact areas (affecting customers, regions, resources, apis, etc.) of the alert content, retain the context, grab logs or trace links, and finally provide recovery tools based on the analysis of the alert content. This information will be rendered through the template engine and finally pushed to the specific person.

In the diagnosis process, if we find that the early warning needs to be upgraded, we will automatically upgrade the basic early warning and start the corresponding emergency response process. For example, we will identify customers who have re insured and start the alert process for emergency response.

Finally, through the quantification of the data, a corresponding red and black list will be formed to continuously control the indicators that do not meet the standards. At the same time, through data analysis, we will continue to iterate to form a closed loop.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us