Here comes the Alibaba Cloud Prometheus intelligent detection operator


Anomaly detection, as a fundamental and important function in intelligent operations and maintenance (AIOps) systems, aims to automatically detect abnormal fluctuations in KPI time series data through algorithms, providing decision-making basis for subsequent alarms, automatic stop loss, root cause analysis, etc. So, how can we use anomaly detection in practical scenarios, and what anomaly detection is? Today we will give an in-depth explanation.

What is anomaly detection?

Before everything starts, we first need to understand what anomaly detection is. Anomaly detection refers to identifying abnormal events, phenomena, etc. from time series or event logs. The anomaly detection we are talking about here specifically refers to the anomaly detection of time series. By comprehensively judging the value size and curve shape of the time series, abnormal points in the curve can be identified. Abnormal performance generally refers to the occurrence of unexpected increases, decreases, or fluctuations in the time series.

For example, the memory usage index of a certain machine has been fluctuating at around 40%, suddenly soaring to 100%; The normal level of connection count for a Redis database has been around 100, but suddenly a large-scale drop to 0 occurs; The number of online users for a certain business fluctuates from around 100000 to suddenly drop to 50000, and so on.

What is a time series?

A time series refers to a sequence of data points arranged in chronological order, typically with a constant time interval (such as 1 minute, 5 minutes).

How does open source Prometheus currently perform anomaly detection?

At present, the detection capability of the open source version of Prometheus is still based on setting threshold rules, and this dependency on threshold setting raises the following issues.

common problem

Question 1: How to quickly and reasonably complete the detection configuration in the face of tens of thousands of indicators?

Due to the significant differences in the meanings of different types of indicators, the reasonable thresholds set accordingly are also different. Even for the same type of indicator, due to different business statuses, it is often not possible to use the same threshold. Therefore, when configuring thresholds, operation and maintenance personnel need to configure thresholds that they consider reasonable based on the corresponding business situation. Due to differences in the cognitive level and work experience of operation and maintenance personnel, the threshold for different personnel configurations also varies. Secondly, many indicators do not have clear and reasonable range definitions, which leads to many threshold configurations being determined by "slapping the head" and having strong randomness.

For example, in order to set a reasonable threshold for a certain online population indicator, it is necessary to carefully observe and analyze the numerical distribution and trend of historical indicator curves.

Question 2: How to maintain detection rules as business evolves?

For relatively stable businesses, business indicators remain stable for a long time, and in this case, the configured threshold can play a relatively long-term role. But for constantly changing businesses, as the business evolves, the water level and trend of indicators are also constantly changing. These changes can easily lead to the threshold detection set at the beginning, but after a period of time, it may not meet the current detection status. At this point, it is necessary for operation and maintenance experts to regularly verify whether the detection threshold still meets the current detection requirements, and to maintain and modify unreasonable configurations. Therefore, the static threshold method has the problem of high maintenance costs.

For example, if an IO throughput initially stabilizes around a value of 10000, and the detection threshold is set to exceed 20000, an alarm will be triggered. But with the development of the business, the IO throughput has stabilized at around 25000, and at this point, the threshold set at the beginning has led to a continuous stream of alarm nagging.

Question 3: How to solve poor data quality?

The poor data quality is manifested in several specific phenomena: large collection delay, multiple missing data values, and more data burrs (reflected in the curve being not smooth enough). For the first two, more targeted optimization is carried out from the collection and aggregation sides. ARMS Prometheus continues to optimize its collection capabilities. However, for data quality issues with many data glitches, static threshold methods cannot effectively avoid them. In the ARMS managed version of Prometheus' intelligent operator, we have effectively identified multiple burrs, ensuring that burrs do not form invalid alarms and reducing user/maintenance side interference.

How does Alibaba Cloud Prometheus monitoring address these issues

Faced with the above issues, the detection configuration capability of Alibaba Cloud Prometheus monitoring not only supports native threshold setting detection methods, but also comprehensively adds support for template setting detection threshold methods and intelligent detection operator methods.

Business Value 1: Efficient and high-quality alarm configuration

(1) For clear application scenario configuration detection rules, Alibaba Cloud Prometheus monitoring provides mature alarm configuration templates. Users do not need to manually set thresholds, but only need to select the corresponding template. For example, in the scenario of machine metrics, configure a template with "CPU usage rate of machine metrics>80%". The template approach solves the pain points in application scenarios where there are clear exceptions in the configuration and the business is relatively stable.

(2) For unclear indicator scenarios or business indicator scenarios that are difficult to set, it is recommended to use the intelligent detection operator function.

For example, it is necessary to set a threshold for a certain online population indicator, which requires a long time to observe the historical curve status in order to configure a reasonable threshold. In this scenario, users can directly choose intelligent detection operators.

Business Value 2: Adaptive tracking of business changes, greatly reducing detection threshold maintenance costs

The intelligent detection operator function of Alibaba Cloud Prometheus monitoring allows the model to adaptively track changes in indicator trends by setting parameters that reference the length of historical data, without the need for manual periodic review of configuration rules.

Business Value 3: Intelligent detection can also be achieved for indicators with poor quality and excessive missing values/burrs

In the intelligent detection operator function, if the historical data is missing, the algorithm can automatically fill in the missing values by linear interpolation, polynomial interpolation and other methods. For the detection of unsmooth indicator curves, the intelligent detection operator also adaptively selects the optimal model for the scene to ensure the overall detection effect.

How to apply in specific business scenarios

Sudden increase/decrease in water level indicator: QPS indicator for a certain business

At the beginning of the business, it is highly possible to set a threshold of no more than 150 through observation. But as the business iterates, various changes will also occur in the QPS indicators. From the perspective of indicators, it manifests as a periodic sudden increase to a certain value, and then a stable state. In this case, the set static threshold is difficult to continuously meet the detection requirements. On the other hand, stable situations can also experience sudden declines, and a static threshold with only an upper limit is set to detect such a decline. In this case, intelligent detection operators can adaptively track changes in business levels and intelligently identify sudden increases or decreases in business.

Periodic indicators:

In the indicator profiling module, if the current indicator is identified to have a certain period, the corresponding period value, period offset value, and overall trend curve will be extracted from it. After removing periodicity and trendiness from the original time series, residual is used for anomaly detection. Taking the cycle indicator in the above figure as an example, there is a significant difference between the cycle of around 11.30 minutes and other cycles. Traditional static thresholds are difficult to solve detection problems in such scenarios, but using intelligent detection operators can identify such anomalies.

Trend breaking indicators:

In addition, there is a common type of indicator anomaly where the indicator consistently shows an upward (or downward) trend during a certain stage. At a certain node, there is a sudden trend disruption, where the local trend is different from the overall trend. This type of exception is also common, but it is difficult to set a static threshold to solve this situation. And intelligent detection operators can accurately identify anomalies for this type.

Best Practices

Alibaba Cloud Prometheus monitoring internal usage process

At present, Alibaba Cloud Prometheus monitoring already supports the intelligent detection operator function. Simply log in to ARMS Prometheus/grafana and enter the corresponding PromQL.

Operator Definition

Input: Time series of the indicator, type range vector; To detect parameters, use the default value of 3

Output: Abnormal return 1, normal return 0

Use case:

anomaly_ detect(node_memory_free_bytes[20m],3)

1. The input must be a range vector, so it is necessary to add [180m] after the indicator name. The default time range is 180m, and the default parameter is 3

2. If other aggregate function operations are performed first, [180m:] is required to change it to a range

Usage example:

Step 1: Log in to ARMS Prometheus or Grafana and select the corresponding Prometheus data source

Select the corresponding data source:

Step 2: Select metrics and view

Step 3: Input anomaly detection operator

About Prometheus - Intelligent Detection Operators

Alibaba Cloud Prometheus monitoring intelligent detection operator, designed by summarizing dozens of leading algorithm solutions in the industry. Established indicator profiles for common indicator types and adaptively selected the best model for detection and calculation. After inputting each indicator data into the model, the model will first establish an indicator image of the current indicator, including stationarity, jitter, trend, periodicity, whether it is a special holiday/activity, etc. After constructing these portrait features, the model adaptively selects the optimal combination of one or more algorithms to solve the current indicator detection problem, ensuring the overall optimal effect. The currently supported functions include: burst detection, burr detection, and cycle recognition (identifying periodicity and cycle offset).

Through the integration of intelligent detection operators in Alibaba Cloud Prometheus monitoring, we hope to provide users with out of the box, continuous iterative and updated intelligent detection services. At present, users can view and use intelligent detection operators in Alibaba Cloud Prometheus monitoring, and the ARMS based native configuration intelligent detection alarm function and Grafana dynamic display will be launched in the near future.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us