Alibaba Cloud Elastic Computing SRE Practice with Hundreds of Million Calls

The stability of the system cannot be separated from an effective early warning mechanism. According to Murphy's law, where there may be errors, there will be errors. We cannot accurately predict when and where the system will make mistakes, but we can know that when there is a problem in the system, the interface response will slow down, the system service will not be available, the business flow will decline, or the customer operation cannot be completed, or even there will be customer complaints.

In order to counter Murphy's law, we must configure early warning information on various system nodes in advance to avoid being too passive when problems occur; At the same time, in order to pursue the problem detection rate (more early warning items, more unreasonable threshold, and more insignificant content), we have fallen into the paradox of early warning coverage, which eventually led to the current widespread early warning hell phenomenon. As we can see in the figure below, this is a very typical positive feedback enhancement loop, resulting in more and more early warning problems.

Figure: Positive feedback enhancement circuit of early warning

Will more early warning items make the problem better? According to the law of entropy increase, this process will inevitably lead to irreversible destruction, and it may ultimately be unclear which early warning needs to be handled, or even lead to the disregard of early warning. Is there any way to solve this problem? Yes, do negative entropy! We will talk about the six elements of early warning step by step.

Among these six early warning elements, some are recognized and obvious, while others are often ignored and lead to early warning hell. This paper combines the experience and wisdom of daily early warning definition, notification and governance, and is the ideal practice standard of early warning governance. It focuses on maintaining good early warning management, continuously solving system hidden dangers, and promoting the stable and healthy development of the system.

01 Accurate: the warning itself is accurate and correct

Among a large number of ignored alerts, a large part can be called inaccurate, because even if it is not handled, no actual problems will occur. The first definition of accuracy is that the alert reaches the warning level. The warning that does not need to be dealt with will lead to the "wolf coming" effect, and the warning will be ignored more and more, and eventually the warning that really needs to be dealt with will be missed; We have found such a team that no one can watch the early warning of several small times. Only a short and high-density early warning notice can attract attention. Such a team is becoming more and more immune to early warning and more dangerous. In addition, invalid warning notices will also lead to unnecessary waste of resources, such as SMS, phone charges, etc. So, from now on, take action to kill the invalid warning.

The second definition of accuracy is that alerts are accurately notified to the correct recipients. Don't think that the more people who receive the warning notice, the more likely they are to be dealt with. In fact, irrelevant people are more likely to wait and see when they receive the warning. In fact, no one acts at all, because these irrelevant people have no motivation or ability to do so. I once met a case where students who had nothing to do with early warning informed the team that needed early warning and emergency response to see if your system had a problem. Although the unrelated notice worked in this case, how embarrassing and terrible it was for the team that should have early warning and emergency response. In addition, the students receiving the early warning and emergency response also need to respond to the early warning notice. On the one hand, they should inform the concerned students that the early warning has been responded and processed, and on the other hand, they have not made data preparation for the measurement of the early warning.

To sum up, the accurate element is to inform the correct receiver of the real warning information. In this case, neither the real warning nor the correct receiver should be sent; At the same time, this is also a handshake process. After receiving the notification, the receiver should take over and prepare for processing.

02 Timely: timely notification and emergency response

If we follow the response rate of early warning, is it better to notify and respond in time; But don't forget to do negative entropy. In most cases, we should do it in time to avoid excessive tension and panic.

First of all, we need to use different intensity of notification channels at different time periods, such as emergency warning at night, SMS or IM is likely to be unreachable in time, and we need to use stronger notification methods, such as telephone; But in normal working hours, it is unnecessary for everyone to be online. For very serious early warning, it is still necessary to continue to maintain urgent and strong notification to achieve timeliness, but we still propose not to use it as much as possible.

Secondly, in terms of emergency response, early warning without timely response can be upgraded to a stronger notification channel; At the same time, you can also upgrade the processing personnel, such as the supervisor and SRE, to achieve the purpose of timely emergency response. Not every alert needs to be handled urgently. For example, if it is one of many online machines, it will not affect the business flow and can be handled later.

Finally, the same warning should not be sent repeatedly in a short time to avoid being submerged in the warning bomb. Generally, the alarm is combined and relevant statistics are made, and then the corresponding suppression operation is performed according to the alert classification, or the sending of such alerts is suspended for a period of time by manual selection. When the first warning is claimed, it means that the relevant emergency response has been started. Pushing the relevant warning is more interference. The effect of the treatment can be observed through monitoring, so it is not necessary to continue to report the same warning.

03 Detailed: give the scope of influence, context and diagnostic information

After receiving the alarm, the first thing is to determine the problem and take corresponding isolation or hemostasis measures. If the alarm content is too simple, it will turn the process into a guessing game. You need to go to the scene again and verify and speculate through various means to determine the problem. This is also the most time-consuming place to deal with the alarm. Moreover, many of them rely on experience, and it is almost difficult for new students to intervene. Therefore, if the scope of influence, context and diagnosis information are given in the early warning information, the problem location will get twice the result with half the effort.

First of all, the scope of influence of early warning is an important indicator to judge the emergency priorities. The scope of influence includes resources, businesses, users and other dimensions:

• Resources: stand-alone, cluster and related dependencies;

• Users: individual, partial or all customers;

• Business: core business, bypass business and non-core business.

If the scope of influence is an individual case, isolation can be done quickly, such as isolating a single machine, removing non-core links or degrading a single customer's flow control. On the contrary, if the area level is high, more emergency response mechanisms need to be adopted, such as pulling more people to deal with: supervisors, SREs and other students in the team, even upgrading to the fault level.

Secondly, context information will make positioning more effective with less effort. Context information helps to judge the diagnosis of errors and saves the trouble of restoring the scene. Context information includes:

◾ Trace: a trace link that gives early warning problems, equivalent to restoring the scene;

◾ Log: The detailed error log link can locate the specific code and the current stack information;

◾ Association: the alert or change of association, which is convenient to quickly determine whether the alert is caused by other problems or changes.

With the context information, the path for further troubleshooting is basically determined; Otherwise, we need to go to various platforms to collect information, and even need to restore ready-made information. These operations are not only time-consuming, but also may not be able to obtain the required information due to incomplete information or time flow.

Finally, the diagnosis information in the early warning can even directly give the reason for delimitation, so as to avoid troubleshooting time. The positioning of the problem often depends on the processor's understanding of the business, the use of tools, the platform's data and previous experience. Early warning diagnosis is to turn the process experience in the mind into results through rules+data and output them directly. Diagnosis can only be realized through certain system capacity building. For example, the following architecture is adopted inside ECS to support diagnosis:

1. It is necessary to connect the early warning information from different channels. If the early warning information is sent directly, the diagnosis link will be lost. The early warning information can be connected to the diagnosis layer first to realize the follow-up diagnosis action.

2. It needs to have certain information collection capabilities, including: various meta information (application, database, api, personnel, on-duty...), change information, operation and maintenance tools, logs and other early warning information.

3. It is necessary to have certain information integration ability, integrate early warning and collected information, and give diagnosis results in combination with diagnosis framework.

04 Recovery: isolation, hemostasis and self-healing

Recovery is the first priority of early warning processing. It is necessary to eliminate the impact on the system, business and customers before troubleshooting. Element 3 (detailed: give the scope of influence, context and diagnostic information): In order to locate the problem, the business should take the corresponding recovery manual.

Generally, recovery needs to perform some actions. If the early warning can go further and give the recovery operation, it will greatly improve the response speed.

Generally, the recovery action has the following execution path:

◾ Fault self-healing: judge the influence according to the early warning and complete the fault self-healing by associating with the prefabricated action. First of all, it is necessary to support the binding of call back action for early warning. You can select the correct self-healing operation according to the early warning content; Secondly, the control range of action execution can avoid the secondary fault caused by automatic execution of action. For example, we can detect a single machine problem and remove the machine. Self-healing needs to judge the scope and influence of execution, so that actions with confidence can be executed automatically.

◾ The action of hemostasis action can be embedded into the alert content through link or chat, and click the relevant action to quickly eliminate the impact. For example, we will give some flow control, restart, switch and other actions in the warning notice, and the operation can be completed with one button.

◾ When there is no hemostasis action, the best practice, operation manual or relevant contact person can give instructions to continue the operation, continue the operation according to the manual or contact the person who can handle it.

So far, we have completed the whole emergency process of early warning through the optimization of four elements. The next two elements focus on the operation of early warning, and further manage early warning through efficient and effective operation.

05 Overwrite: automatically overwrite by template

In the process of fault recovery, many problems were not found in time because of the lack of corresponding monitoring. It is impossible to cover all the monitoring items, but experience can be accumulated and inherited. General and standard early warning is applicable to most businesses, and can be covered in a wider range through template.

The first type of early warning has similar monitoring items and even similar threshold definitions. The monitoring of this type of standard should be able to quickly cover multiple applications and even businesses, and ensure that it will not be missed through the patrol mechanism. Generally, the monitoring items of early warning are divided into basic monitoring and business monitoring. The definitions of monitoring and early warning vary with different business importance levels. For example, we also propose the level of response to the monitoring according to the application level, and the monitoring is covered by the template according to the level, so as to avoid forgetting the configuration problem, and at the same time, the new resources or services are automatically covered.

General monitoring needs to be accumulated and templated, so as to pass on experience and avoid sending similar problems. For example, the brother team had such a failure before. Due to the publishing script problem, the machine removed from the vip during the publishing process is not mounted after the publishing is completed. When the last machine is published, the entire vip has no server, which makes the service unavailable. We summarized a general rule through a double check: after more than xx% of the attached machines in vip are removed, an early warning is issued and applied to multiple organizations to avoid similar events in the future.

06 Measurement: do negative entropy through data statistics

Finally, let's talk about the data statistics of early warning. The management master Drucker once said:

If you can't measure it, you can't manage it. (If you can't measure it, you can't manage it.) - Peter Drucker

Measurement is an important part of completing the closed-loop of early warning governance. In fact, we use the idea of "lean" to find problems through data feedback, then try to solve the problems, and then look back at the data.

The measured data can include the following aspects:

◾ Specific data of early warning: used to analyze the follow-up analysis and improvement, statistics of notification frequency, statistics of early warning quantity, such as: no more than 3 items per average day, TOP people/teams/applications and other red and black lists.

◾ Mark Invalid: clear the invalid alert.

◾ Whether someone claims: the claim rate needs to be disclosed through the red and black list.

◾ Claim time: for emergency improvement, refer to "1-5-10" (1 minute to find the problem, 5 minutes to locate and 10 minutes to recover).

◾ Time to solve: better complete "1-5-10".

◾ Use of early warning tools: improve the accuracy of tool recommendations.

With these operational data, there is a grasp of governance. Only by continuing to operate can early warning evolve in a healthier direction.


Finally, combined with an internal practice of Alibaba Cloud elastic computing, we will explain how we combine the six elements to conduct early warning governance.

In terms of early warning, we have built a unified early warning platform, integrating six elements into early warning life cycle management through engineering.

First of all, we have closed most of the early warning channels. We have multiple early warning sources internally, such as SLS, POP, ARMS, DAS, Prometheus, and so on. These early warnings will be notified to the early warning gateway rather than specific people. The alert gateway will analyze, structure and diagnose these data.

We will output the influence surface, root cause and quick recovery tool in the diagnosis environment. This part is the most complex and requires certain information integration ability and diagnosis ability. First, we will associate early warning information with meta, which is the basic data for our internal continuous construction, including resource information (machine, database, API, application, etc.), organization information (organization structure, owner information, on-duty information) and strategy information (emergency process, upgrade strategy, notification strategy, etc.); In addition, we also need to diagnose the affected areas (affecting customers, regions, resources, apis, etc.) for the early warning content, while preserving the context and capturing logs or trace links, and finally provide recovery tools based on the analysis of the early warning content. This information will be rendered through the template engine and finally pushed to specific people.

During the diagnosis process, if we find that the alert needs to be upgraded, we will automatically upgrade the alert base and start the corresponding emergency process. For example, we will identify the re-insurance customer and start the alert process for emergency response.

Finally, through the quantification of the data, the corresponding red and black list is formed, and the substandard indicators are continuously managed. At the same time, we will also analyze our deficiencies in the governance process through data, and continue to iterate to form a closed loop.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us