Smart Metrics helps you configure alarms with algorithms

Introduction

A senior SRE student said, "I feel uneasy about not receiving dozens of alarms every day", "alarms are reported every day, and our applications have nothing to do". This reflects a very common phenomenon - "false alarm flooding", while "true" alarm is easy to be submerged. ARMS AIOps team analyzed more than 60,000 alarms about sudden increase of response time and error rate, and found that only 3.05% of them were "true" alarms. At the same time, it is found that the root cause of false alarm flooding is that it is difficult to configure effective alarm rules for travel only depending on the capabilities of existing alarm products. Therefore, based on the managed version of Grafana, ARMS introduces Smart Metrics, an intelligent alarm plug-in that uses algorithms to help users solve the problem of "difficult alarm configuration and maintenance".

This article starts with two types of common invalid alarm rules, analyzes the reasons why it is difficult to configure effective alarms and false alarms are rampant, introduces how Smart Metrics helps users solve the problem of alarm configuration, and introduces some best practices. Finally, welcome to join the SmartMetrics exchange group. The number of the nail group is 25125004458.

Alarm status analysis

False alarm flooding

Through the analysis of ARMS alarm data and user interviews, we found that many users receive hundreds of alarms a day, but only a few of them are really useful. What's more, those "true" alarms are often submerged in a large number of false alarms, resulting in users failing to deal with the real fault at the first time. These false alarms are often caused by some bad alarm configuration habits, and the following two types are typical:

"One size fits all" alarm mode:

For example, an SRE student needs to manage many interfaces, but in order to save alarm configuration time, he chooses to configure unified and fixed thresholds for response time, error rate and call volume of all applications/interfaces. However, the normal water level of response time, call volume and error rate indicators under normal conditions of different applications/interfaces is different. Hundreds of applications/interfaces using the same threshold will naturally generate a large number of false alarms.

Alarm threshold of "neglect":

Some alarm rules have no problem in application at first, but with the growth of business, the average water level of business indicators such as response time, calls, and machine indicators such as CPU utilization have changed. However, SRE did not update the threshold value in time, resulting in continuous alarms when the system is normal.

Generally speaking, as long as the above two alarm strategies are optimized, a large number of false alarms can be effectively avoided. But in fact, their optimization is not simple. The reason is that existing alarm products often only support users to configure alarms using static thresholds.

Let's take a look at the typical alarm configuration page: it provides many common aggregation operators: mean, maximum, minimum, difference, etc. Users can use these operators to customize alarm rules.

Let's take a look at the real operation and maintenance indicators, such as qps (the amount of adjustment per minute). It looks like this:

So, what should min max avg do now? What is the threshold value?

Reasons for difficult configuration and maintenance of effective alarms

In the face of real and changeable O&M indicators, even experienced O&M experts are difficult to assign effective alarms. We summarize the main reasons why it is difficult to configure and maintain effective alarms:

1. Many operation and maintenance indicators fluctuate, making it difficult to set an appropriate static threshold.

They often show seasonal characteristics with the period of hour, day and week. These indicators themselves fluctuate, resulting in poor matching of static threshold and year-on-year threshold.

2. For the same indicator, different applications interfaces hosts have different thresholds.

Taking RT index (response time) as an example, if some interfaces are normal and RT is about 200ms, then when RT is greater than 300ms, it can be determined as abnormal. However, some interfaces have large long-term accesses, and the overall indicator fluctuates normally around 500ms. The appropriate alarm threshold may be around 600ms. While an application may have hundreds of interfaces, the operation and maintenance students need to configure appropriate thresholds for all interfaces, so the time cost is very high.

3. The normal water level of indicators will change with the growth of business.

With the development of the company's business and the launch of new applications, the water level of some indicators under normal conditions will continue to change. If the threshold value is not updated in time, it is easy to generate a large number of false alarms.

To sum up, relying on the static threshold method in existing alarm products requires SRE students to invest a lot of time and cost, and it is difficult to achieve good results. In order to solve this problem, ARMS launched the Grafana based intelligent alarm plug-in SmartMetrics, which uses algorithms to help users configure effective alarms.

Simpler alarm configuration - SmartMetrics

SmartMetrics is a "smart, easy to use, and visible" alarm plug-in, which can learn the characteristics of indicators from historical indicator data, predict the normal range of indicators in the future, and generate upper and lower boundaries. The interval enclosed by the upper and lower boundaries here is the 90 confidence interval by default. That is, according to the trend of the previous days, if there is no abnormality in the indicator, there is a 90% probability that its future value will fall into the upper and lower boundaries we predicted.

SmartMetrics supports Grafana's native alarm function. The upper and lower boundaries generated by SmartMetrics can be used as the threshold for configuring alarms. A simple alarm allocation strategy can issue an alarm when the indicator exceeds the upper limit or is lower than the lower limit. You can also configure more complex strategies, such as sending an alarm only when the original curve is 1.5 times higher than the upper boundary and no higher than the upper boundary has occurred in the past hour.

At present, SmartMetrics has been launched in the managed version of Grafana, and will be embedded in the alarm capability of ARMS as an additional function in the future.

How SmartMetrics generates upper and lower boundaries

SmartMetrics calculates upper and lower boundaries for different types of indicators through multi model fusion. SmartMetrics first captures the key characteristics of the index through the Smart PLR algorithm, and determines the type of the index curve using the classification algorithm; According to its type, the most suitable time series prediction model and the best parameters are selected; Finally, the upper and lower boundaries are generated.

SmartMetrics adopts the industry's popular open source algorithms Prophet, STL, ARIMA, and BiLSTM, and optimizes single cycle/multi cycle identification, trend identification, outlier identification, burr identification, and change point identification based on Alibaba Cloud's internal big data practices, and finally integrates them into a set of multi model Smart Prophet algorithm solutions. SmartMetrics has the following characteristics:

a. Accuracy: The algorithm has been verified in multiple scenarios within Alibaba Cloud, and has accurate and comprehensive anomaly detection capabilities. Cooperate with the alarm duration to achieve accurate alarm effect.

b. Universality: the algorithm supports business indicators and basic indicators, and performs better curve classification and model parameter configuration for cyclical, trend and volatility indicators.

c. Maintenance free: Users using SmartMetrics do not need to dynamically adjust the parameters of the algorithm as the business changes. The algorithm can adapt itself to the business changes by learning the rules of indicator changes.

How does SmartMetrics solve the problem of difficult configuration and maintenance of effective alarms

1. How does SmartMetrics respond to the demand for volatile O&M indicators to configure effective alarms

SmartMetrics will automatically predict the upper and lower boundaries of the curve in the next 1 day under normal conditions based on the 7-day historical data, and write the actual values of the indicators in real time. The user can use Grafana's built-in alarm capability to configure alarms: when the actual value of an indicator exceeds the upper and lower boundaries, or when the actual value exceeds 1.5 times the upper boundary value, an alarm will be issued. Users can customize various alarm rules. For more best practices, refer to the official SmartMetrics documentation.

2. How SmartMetrics copes with the same indicator, different application interface host alarm threshold scenarios

For indicator allocation of different applications interfaces hosts, SmartMetrics is used to generate upper and lower boundaries. SmartMetrics will automatically learn their respective characteristics and generate appropriate dynamic baselines. Users do not need to manually enter static thresholds.

3. How SmartMetrics copes with the problem of difficult static threshold maintenance caused by business development

By default, SmartMetrics updates the model every day to automatically learn the normal water level changes of indicators caused by business changes, which is free of manual maintenance.

SmartMetrics Best Practices

Step 1 Create Dynamic Threshold Task

• Select the right data source

• Select indicators for dynamic monitoring

Note: Currently, only one indicator task is supported, so you need to specify the label value in the indicator, or use operators such as sum and count to query a single indicator. Multi indicator configuration is currently planned. Whether to go online is determined based on user feedback.

• After selection, run Query to find the corresponding indicator curve.

• Set appropriate model parameters and select sensitivity. It is recommended to use the default configuration.

• Fill in the correct name and description.

• Click Create Forecast to complete the creation.

Step 2 View Metrics

• After creating a successful task, you can click to return to the task list.

Note: After the task is created, the task will be started immediately to complete data pulling, calculation, storage and other tasks. This process needs to wait about 1-2 minutes.

• Click to view the market to view specific indicators and the corresponding upper and lower boundaries.

• The overall market provides a normal area composed of the original indicator time series data and the corresponding upper and lower boundaries. If the index value is within the boundary, it can be interpreted as normal by the algorithm; if it exceeds the upper and lower boundaries, it can be interpreted as abnormal by the algorithm.

• Click Edit to enter the editing page.

• The current indicators and upper and lower boundaries are uniformly stored in the prommenthus data source cloud_ product_ prometheus_ cn-hangzhou_ aiops_ UserId. The indicator name is the name when the task is created. The corresponding label: smart_ Different values (actual, upper, lower) of metric correspond to (original indicator, upper boundary indicator, lower boundary indicator). For example, if you want to view the upper boundary separately, you only need to view the upper boundary in the corresponding data source cloud_ product_ prometheus_ cn-hangzhou_ aiops_ Find the corresponding indicator in userId.

Step 3 Anomaly detection&configuration alarm

• On the page of viewing the indicator market, click Edit to enter the editing page

• On the Query page, you can view the D query indicators. By default, indicator queries beyond the upper boundary are preset:

tuyang_ test{smart_metric="actual"} > ignoring (smart_metric) tuyang_ test{smart_metric="upper"}

• On the tab page under grafana, enter the Alert page and create Alert

• If NULL is configured, it is OK, and the notifier and notification information are configured at the same time

Step 4 Alarm notification

• For the alarm occurred, you can obtain the alarm in the configured notification mode, click the link and jump to grafana to view it.

Step 5 task management

• Unnecessary dynamic threshold detection tasks can be deleted from the task list.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us