Anomaly detection for metric data based on dynamic thresholds - Application Real-Time Monitoring Service

To detect anomalies and configure alerting for metrics whose values fluctuate even in a normal state, such as the response time (RT) and queries per second (QPS), we recommend that you enable dynamic thresholds in different period of time. Anomaly detection based on dynamic thresholds is mainly used to monitor metrics whose trends are stable. If the specified thresholds are exceeded, the system generates exception events.

Scenarios

Application performance monitoring: monitors the key metrics of a website or service, such as the response time and request speed. If the response time of a service suddenly exceeds the dynamic thresholds, the system immediately issues an exception warning. This enables website administrators to quickly locate and solve the problem.
Server resource optimization: monitors the CPU utilization and memory usage of a server. If the resource usage of a server continuously exceeds the dynamic thresholds, the system automatically generates an exception event. This helps you adjust resource allocation in a timely manner to prevent system crashes.
Application connection pool analysis: monitors key metrics, such as the query speed and the number of concurrent connections. If some metrics of a thread exceed the dynamic thresholds, the system automatically triggers an exception event to optimize program performance in a timely manner.
Microservice model monitoring: monitors resource usage and response performance of each microservice. The interactions and dependencies among microservices are complex. With dynamic thresholds, if an exception occurs in a microservice, you can quickly locate the problem to ensure the stability of the entire microservice.

Example:

Assume that the normal page view of a website from 10:00 to 18:00 is greater than 1,000. If the page view is still greater than 1,000 from 22:00 to 06:00, the website is likely to be attacked. In this case, the expected data range of the page view changes over time. If you configure a static threshold value 1000, you can receive alert notifications when the page view is less than 1000 during the day. However, if the website is attacked at night, alerts are not triggered. In this case, you can use dynamic thresholds to intelligently update the data range and detect anomalies.

Prerequisites

The data of the application that you want to monitor is reported to Managed Service for OpenTelemetry. For more information, see Connection Description.

Procedure

Log on to the ARMS console.
In the left-side navigation pane, choose Application Monitoring > Application Monitoring Alert Rules.
On the Application Monitoring Alert Rules page, choose Create Alert Rule > Managed Service for OpenTelemetry Alert Rule.
On the Create Alert Rule page, enter an alert rule name and select Interval Detection for the Alert Detection Type parameter.

In the Alert Contact section, specify the application, metric type, and filter condition based on your business requirements.

Parameter	Description
Select Applications	The application that you want to monitor. You can select only one application for anomaly detection based on dynamic thresholds.
Metric Type	The type of the metric that you want to detect. For more information, see Alert rule metrics. After you select a metric, the system automatically calculates the upper and lower threshold boundaries and renders the metric in real time. You can preview the metric trends in the Alert Condition section. Note The valid values of the Alert Condition and Filter Condition parameters vary based on the value of the Metric Type parameter. The initial rendering takes about 2 to 4 seconds. For more information about how the upper and lower threshold boundaries are calculated, see the Threshold calculation section of this topic.
Filter Condition	The method that is used to filter the metrics for which alerts are generated. This helps narrow down the monitoring scope. Valid values: Traverse: traverses all values of the specified metric type and displays the metric values that trigger the alert in the alert notification. No Dimension: aggregates all values of the specified metric type and displays the sum in the alert notification. =: filters the values of the specified metric type and displays only the data of the specified metric values in the alert notification. !=: filters the values of the specified metric type and displays only the data of the metric values that are unequal to the specified metric values in the alert notification. Contain: filters the values of the specified metric type and displays only the data of the metric values that contain the specified values in the alert notification. Do Not Contain: filters the values of the specified metric and displays only the data of the metric values that do not contain the specified values in the alert notification. Match Regular Expression: filters the values of the specified metric and displays only the data of the metric values that match the specified regular expression in the alert notification.

In the Alert rules section, configure the Alert Condition parameter.

Parameter	Description
Alert trigger mode	Valid value: Single condition.
Alert Condition	The alert condition. The following elements are required: Last X minutes: the time period of monitoring. Maximum value: 60. Data: the data that you want to monitor. You can specify various data types, such as the number of calls or the response time. Calculation method: the method used to calculate data. You can specify various calculation methods such as calculating the average value, maximum value, or minimum value based on the metric and data type. Comparison method: the method used to compare calculated data to find anomalies. Valid values: Outside the range of the dynamic threshold: automatically calculates the upper and lower threshold boundaries during the specified time period. If a data point falls outside the range, the data is abnormal and an alert is triggered. Larger than the maximum value of the dynamic threshold: automatically calculates the upper and lower threshold boundaries during the specified time period. If a data point is larger than the upper boundary, the data is abnormal and an alert is triggered. Lower than the minimum value of the dynamic threshold: automatically calculates the upper and lower threshold boundaries during the specified time period. If a data point is less than the lower boundary, the data is abnormal and an alert is triggered. Alert level: the severity of the alert. Valid values: P1, P2, P3, and P4. In the data preview section, the color blue represents data points and the color green specifies an allowed data range.
Tolerance	The tolerance value determines the data range. A higher tolerance value indicates a larger data range and that alerts are less likely triggered. A lower tolerance value indicates a smaller data range and that alerts are more easily triggered.
Alert Quantity Prediction	You can view the number of alerts that are expected to be triggered within the specified time period. You can also click the number to query the data that is expected to trigger the alerts at the historical points in time. Each time you create or modify an alert rule, we recommend that you use the alert prediction feature. This feature uses a detection algorithm to analyze historical data and predict the number of alerts within the specified time period. Then, you can adjust thresholds based on the prediction results. For more information, see the Alert quantity prediction section of this topic.

Configure the parameters in the Alert Notification and Advanced Alert Settings sections.

Parameter	Description
Alert Notification
Simple Mode	Notification Objects: creates notification objects. For more information, see Notification objects. Notification Period: specifies the time period to send alert notifications. Whether to Resend Notifications: If you do not use an escalation policy, the alert notification is sent only once before an alert is cleared. If you need repeat notifications, specify the interval to resend alert notifications. Alert notifications are continuously sent at the specified interval until the alert is cleared.
Standard Mode	Notification Policy: Do Not Specify Notification Policy: If alerts are triggered, no notification is sent. Notifications are sent only if the matching rules of a notification policy is triggered. Specify a notification policy: Application Real-Time Monitoring Service (ARMS) sends notifications by using the notification method specified in the notification policy. You can select an existing notification policy or create a notification policy. For more information, see Create and manage a notification policy.
Advanced Alert Settings
No data	This parameter is used to fix data anomalies, such as no data, abnormal composite metrics, and abnormal period-over-period comparison results. If data anomalies can be fixed, the alert data is automatically changed to 0 or 1, or the alert is not triggered. For more information, see Terms.

Click Save.

Threshold calculation

The dynamic thresholds of ARMS are mainly developed based on the Prophet algorithm. After dynamic thresholds are enabled, ARMS analyzes historical data of last 7 days every 24 hours, extracts the tendency and seasonality, and then draws a trend chart for the predicted data in the next 24 hours. At the same time, an expected data range is calculated based on the fluctuations of the metric. When you configure dynamic thresholds, you can preview the upper and lower boundaries calculated by the algorithm. In the following figure, the color blue represents data points, and the color green specifies an allowed data range.

Different from static thresholds, dynamic thresholds do not need to be updated by manually editing alert rules even if the expected data range of a metric changes over time. This is because ARMS analyzes metric trends once a day and predicts the upper and lower boundaries only of the next day.

Alert quantity prediction

The alert quantity prediction feature uses an algorithm to analyze historical data, display the time when historical alerts occur, and then predicts the number of alerts within a specified period of time. The feature helps you configure static thresholds or improve alert sensitivity for dynamic thresholds.

Implementation

Based on metric data in the last 24 hours, ARMS calculates the number of times that each threshold of a metric is exceeded to predict the quantity of alerts in the future. In addition, ARMS provides the metric details, including the specific time when each threshold is exceeded. You can adjust thresholds based on your business requirements.