Anomaly Detection in Real-World Scenarios + Assistance from Prometheus

By Fandeng and Baiyu

Background

As a basic and important function in intelligent operation and maintenance (AIOps) systems, anomaly detection aims to discover abnormal fluctuations in KPI time series data through algorithms automatically. It provides a decision-making basis for subsequent alarms, automatic stop loss, and root cause analysis. What is anomaly detection? How should we use anomaly detection in real-world scenarios? This article gives an in-depth explanation.

What Is Anomaly Detection?

First, we need to understand what anomaly detection is. Anomaly detection refers to identifying abnormal events and phenomena from time series or event logs. Anomaly detection here refers to the anomaly detection of time series. The abnormal points of the curve can be found by comprehensively judging the value size and curve shape of the time series. Abnormal performance generally refers to the time series that does not meet the expected rise, fall, or fluctuation.

For example, the memory usage index of a machine has been fluctuating at about 40% and suddenly soared to 100%. The normal level of connections to a Redis database has been around 100, and suddenly a large-scale drop to 0 occurred. The number of online users of a certain business fluctuated around 100,000 and suddenly dropped to 50,000.

What Is Time Series?

Time series is a sequence of data points arranged in chronological order. It is usually a set of time series with a constant interval (such as 1 minute or 5 minutes).

How Does Open-Source Prometheus Detect Anomalies?

Currently, the detection capability of the current open-source version of Prometheus is still based on setting threshold rules. This way of relying on threshold setting raises the following problems.

Common Errors

Question 1: Faced with Tens of Thousands of metrics, How Can We Complete the Detection Configuration Quickly and Reasonably?

Since the meaning of different types of metrics varies significantly, the corresponding reasonable thresholds are not the same. Even if it is the same type of metric, the same threshold cannot be used because the business state is different. Therefore, when configuring thresholds, O&M personnel need to configure thresholds they consider reasonable based on the corresponding business conditions. Due to differences in the cognitive level and work experience of operations personnel, the thresholds for different staffing configurations are also different. In addition, many metrics do not have a clear and reasonable scope definition. It leads to the situation that many threshold configurations are determined subjectively and have strong randomness.

For example, you must carefully observe the numerical distribution and trend of the analysis historical metric curve to set a reasonable threshold for an online population metric.

Question 2: With the Evolution of Business, How Can We Maintain Detection Rules?

For a relatively stable business, the business metrics are in a stable state for a long time. In this case, the configured threshold can play a role for a long time. However, for a business that changes all the time, the watermark and trend of the metric are constantly changing as the business evolves. These changes can easily lead to the threshold detection set at the beginning, but it is not enough to meet the current detection situation after a period. At this time, operations and maintenance experts are required to check whether the detection threshold still meets the current detection requirements regularly. They should also maintain and modify unreasonable configurations. Therefore, the static threshold method is of high maintenance cost.

For example, an IO throughput fluctuates steadily around the 10,000 magnitude at the beginning. An alarm is triggered if the detection threshold is set to exceed 20,000 at the beginning. However, with the development of business, IO throughput has stabilized at about 25,000. At this time, the threshold set at the beginning has led to a steady stream of alarms.

Question 3: How Can We Solve the Problem of Poor Data Quality?

Poor data quality is manifested in several specific phenomena: large acquisition delay, many missing data values, and more data burr points. (The reaction on the curve is not smooth enough.) For the first two phenomena, more targeted optimization is carried out from the collection and aggregation sides. ARMS-Prometheus continues to optimize the collection capability. However, the static threshold method cannot effectively avoid the data quality problem with many data response latency. We have effectively identified multiple burr points In the intelligent operator of ARMS-managed Prometheus to ensure the response latency will not form invalid alarms and reduce the interference on the user side/operation and maintenance side.

How Does Alibaba Cloud Prometheus Monitor Solve These Problems?

In the face of the problems above, the detection configuration capability of Alibaba Cloud Prometheus Service supports the native threshold setting detection mode, the template setting detection threshold mode, and the intelligent detection operator mode.

Business Value 1: Efficient and High-Quality Alert Configuration

(1) Configure detection rules for clear application scenarios. Alibaba Cloud Prometheus Monitoring provides mature alarm configuration templates. Users do not need to set thresholds manually without selecting corresponding templates. For example, configure the template of cpu usage of machine metrics> 80% in the machine metric scenario. The template method solves the pain points of application scenarios where the configuration is clearly abnormal and the business is relatively stable.

(2) The intelligent detection operator function is recommended for unclear metric scenarios or business metric scenarios that are not easy to set.

For example, you need to set a threshold for the number of people online, which takes a long time to observe the state of the historical curve to configure a reasonable threshold. Users can directly select intelligent detection operators in this scenario.

Business Value 2: Adaptive Tracking of Business Changes Reduces the Maintenance Cost of Detection Threshold Significantly

The intelligent detection operator feature of Alibaba Cloud Prometheus Monitoring allows the model to track changes in the trend of metrics adaptively by setting parameters that refer to the length of historical data. You do not need to review the configuration rules regularly.

Business Value 3: Intelligent Detection for Metrics with Poor Quality and Missing Values/Response Latency

In the intelligent detection operator function, if the historical data is missing, the algorithm can automatically fill in the missing values by linear interpolation, polynomial interpolation, and other methods. For unsmooth metric curve detection, the intelligent detection operator also adaptively selects the optimal model for the scene to detect, ensuring the overall detection effect.

How to Apply in Specific Business Scenarios

Sudden Increase/Decrease Index of Watermark: qps Index of a Business

When the threshold is set at the beginning of the service, it is very likely that the threshold will not exceed 150 by observation. However, the qps metrics also change in a variety of ways as the business iterates. From the index, a periodic suddenly increases to a certain value and then to a stable state. In this case, the set static threshold is difficult to meet the detection requirements continuously. On the other hand, the stability situation will also suddenly decline. The decline is detected by setting the upper limit of the static threshold. In this case, the intelligent detection operator can adaptively track the change in the service level and identify the sudden increase or decrease in the service intelligently.

Periodic Metrics

In the metric profile module, if it is identified that the current metric has a certain period, the corresponding period value, period offset value, and overall trend curve are extracted. After removing periodicity and trend in the original time series, the residuals are used for anomaly detection. As an example of the cycle metric in the figure above, the cycle of about 11.30 points is significantly different from other cycles. Traditional static thresholds have difficulty solving the detection problem in such scenarios. Intelligent detection operators can be used to identify such anomalies.

Trend-Destructive Metrics:

In addition, there is a common type of metric anomaly where the metric has been on an upward (or downward) trend at a certain stage. Sudden trend destruction occurs at a certain node. The local trend is different from the overall trend. This type of exception is also very common, but static thresholds are difficult to set to solve this situation. The intelligent detection operator can accurately identify anomalies for this type.

Best Practices

Usage of Alibaba Cloud Prometheus Monitoring

Currently, Alibaba Cloud Prometheus Service supports the intelligent detection operator feature. You only need to log on to ARMS-Prometheus/grafana and enter the corresponding PromQL.

Operator Definition:

"anomaly_detect": {
Name: anomaly_detect",
ArgTypes: []ValueType{ValueTypeMatrix, ValueTypeScalar},
ReturnType: ValueTypeVector,
},
Input: The time series of the metric. The type is range vector. For the detection parameter, use the default 3.
Output: 1 for exception and 0 for normal

Use Case:

anomal_detect (node_memory_free_bytes[20m],3)

The input must be a range vector, so you need to add [180m] after the metric name. The default time range is 180m, and the default parameter is 3.
If other aggregate function operations are performed first, [180m:] is required to change it into a range vector: anomal_detect (sum(node_memory_free_bytes)[180m:],3)

Example:

Step 1: Log on to ARMS-Prometheus or Grafana and select the corresponding Prometheus data source:

Select the corresponding data source:

Step 2: Select a metric and view it:

Step 3: Input anomaly detection operator:

Prometheus - Intelligent Detection Operator

The intelligent detection operator of Alibaba Cloud Prometheus Service is designed by summarizing dozens of leading algorithm solutions in the industry. A metric profile is established for common metric types. The best model is adaptively selected for detection and calculation. After each metric data is entered into the model, the model first builds a metric profile of the current metric, including smoothness, jitter, trend, periodicity, and whether it is a special holiday/activity. After the construction of these portrait features, the model adaptively selects the optimal combination of one or more algorithms to solve the current index detection problem, ensuring the optimal overall effect. Currently, the supported functions include surge detection, response latency detection, and period recognition (identifying periodic and periodic offsets).

We hope integrating intelligent detection operators in Alibaba Cloud Prometheus Monitoring can provide users with out-of-the-box intelligent detection services that can be updated continuously and iteratively. Currently, users can view and use intelligent detection operators in Alibaba Cloud Prometheus Service. The ARMS-based native configuration intelligent detection and alerting feature and Grafana dynamic display will be launched in the near future.

Prometheus Service:
https://www.alibabacloud.com/help/en/application-real-time-monitoring-service/latest/what-is-prometheus-service

Prometheus Monitoring:
https://www.alibabacloud.com/help/en/application-real-time-monitoring-service/latest/prometheus-service

Community

Anomaly Detection in Real-World Scenarios + Assistance from Prometheus

Background

What Is Anomaly Detection?

What Is Time Series?

How Does Open-Source Prometheus Detect Anomalies?

Common Errors

Question 1: Faced with Tens of Thousands of metrics, How Can We Complete the Detection Configuration Quickly and Reasonably?

Question 2: With the Evolution of Business, How Can We Maintain Detection Rules?

Question 3: How Can We Solve the Problem of Poor Data Quality?

How Does Alibaba Cloud Prometheus Monitor Solve These Problems?

Business Value 1: Efficient and High-Quality Alert Configuration

Business Value 2: Adaptive Tracking of Business Changes Reduces the Maintenance Cost of Detection Threshold Significantly

Business Value 3: Intelligent Detection for Metrics with Poor Quality and Missing Values/Response Latency

How to Apply in Specific Business Scenarios

Sudden Increase/Decrease Index of Watermark: qps Index of a Business

Periodic Metrics

Best Practices

Usage of Alibaba Cloud Prometheus Monitoring

Prometheus - Intelligent Detection Operator

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Application Real-Time Monitoring Service

Managed Service for Grafana

Bastionhost

DevOps Solution