Application Real-Time Monitoring Service (ARMS) Alert Management provides features such as alert convergence, alert notification, and automatic escalation to help you efficiently identify and handle alerts. This topic describes the architecture and benefits of Alert Management.

Architecture

Alert Management provides the following modules: integration management, alert event management, notification policy management, collaborative alert handling, and alert handling analysis.

Architecture

Integration management

Alert Management provides two integration types: default alert integration and third-party service integration.

Default alert integration

You can integrate Alert Management with the alerts of ARMS sub-services, such as Application Monitoring, Browser Monitoring, Prometheus Service, and Synthetic Monitoring. You can use default alert integrations to check whether monitoring data contains errors based on periodic tasks. If the monitoring data contains errors, the corresponding alert events are reported to Event Management Center.

For information about how to create alert rules for each sub-service of ARMS, see the following topics:

Third-party service integration

You can integrate Alert Management with a third-party alert source by configuring simple settings. This is a one-stop solution to handle alerts generated by self-managed data centers or virtual machines in ARMS. If an alert is reported from a third-party alert source to Alert Management, an alert event is generated. Alert events have the following limits:

Data structure of an alert event

The data structure of an ARMS alert event is similar to the data structure of the notification templates of open source AlertManager. The data structure includes the following fields:

  • Labels: the metadata of the alert. A set of labels uniquely identifies an alert event. Alert events with the same labels are compressed into one event. Example: "alertname: alert name".
  • Annotations: The additional information of the alert. Annotations are not metadata. Example: "message: alert content".
  • StartsAt: the start time of the alert.
  • EndsAt: the end time of the alert.
  • GeneratorUrl: the URL of the alert event.
Difference between labels and annotations
  • A set of labels specifies an alert event. If a label changes, a new alert event is generated.

    Example:

    { "hostname":"Host", "alertname":"CPU utilization is too high","ip":"192.168.0.3"}. The set of labels specifies an alert that the CPU utilization of the host (IP address: 192.168.0.3) is too high. If the ip label changes to{"ip":"192.168.0.4"}, a new alert is generated. The new alert indicates that the CPU utilization of the host (IP address: 192.168.0.4) is too high.

  • Annotation changes do not affect alert events. If alert events with the same labels have different annotations, it means that an alert is reported multiple times.

    Example:

    If the {"value":"85","message":"The CPU utilization of the host (IP address: 192.168.0.3) is 85%, which is higher than the threshold value 80%."} annotation changes to {"value":"86","message":"The CPU utilization of the host (IP address: 192.168.0.3) is 86%, which is higher than the threshold value 80%."}, no new alert is generated. The alert events are considered as an alert that is reported twice.

Note You can configure deduplication fields as labels for an integration. If the integration reports an alert, Alert Management identifies a unique alert event only based on the deduplication fields. If you do not configure deduplication fields, Alert Management uses all labels to identify a unique alert event.

Alert event management

The alert event management module provides the following methods to process alert events:

  • Use event processing flows to orchestrate simple procedures and process alert events that are reported by an alert source. This meets your specific requirements on event handling in various scenarios.
  • Deduplicate, compress, denoise, and silence alerts that are reported by an alert source. This converges alerts and reduces alert storms.

Event compression

By default, the alert event management module compresses events based on labels or time.

Label-based event compression
When ARMS sends alert notifications to contacts, alert events are compressed based on the event grouping settings specified in the notification policy. If multiple alert events contain the same labels, the events are automatically compressed into one alert event. In the following figure, alert events are compressed based on two labels.Label-based event compression
Time-based event compression
Each alert event contains the start time and end time of the alert. For alert events with the same labels, if the start time and end time of the alert events overlap, the alert events are compressed into one alert event. The start time and end time of the resulting alert event are the union of the start time and end time of the alert events. Time-based event compression

Notification policy management

You can configure conditions in notification policies the same way you configure subscription rules. If an alert event meets the specified conditions, ARMS sends alert notifications based on the notification policy.

The following figure shows the relationships among event handling flows, events, and notification policies. Notification policy management

Collaborative alert handling

You can configure multiple collaboration policies. Then, you can handle alerts in the ARMS console, DingTalk, WeCom, and Lark. You can also configure group message synchronization, scheduling management, and escalation policies. This way, the contacts in a team can collaboratively handle alerts. The following figure shows how to collaboratively handle an alert. For more information, see Manage alerts in a DingTalk group. Collaborative alert handling

Alert handling analysis

The alert handling analysis module records and visualizes the alert handling process. You can review the process of handling a single alert, or analyze the process of handling multiple alerts of a specific period of time to discover the bottleneck of the handling process. This optimizes the handling process and improves the handling efficiency. If you use the alert handling analysis module along with Grafana, Loki, or Prometheus, you can view handled and unhandled alerts in real time on a Grafana dashboard. This helps you monitor the status of system alerts.

The alert handling analysis module provides a wide range of service level objective (SLO) metrics. You can manage teams based on metrics such as the alert takeover rate, average takeover duration, and average handling duration. You can analyze the alert handling process based on labels in multiple dimensions, such as team, application, service, and environment. This way, you can manage teams on a large scale.

Benefits

If you deploy your services in Alibaba Cloud and use ARMS to monitor your services, you can use Alert Management to handle alerts. Alert Management improves O&M efficiency in the following ways:

  • Alerting can be globalized.
    • You can globalize alert rule templates to configure alerting for global events.
    • You can globalize contacts and notification policies by configuring simple settings.
      Note You cannot send alert notifications by phone in the International site.
  • Events are collected from different monitoring services for higher management efficiency.
    • You can integrate Alert Management with common monitoring services of Alibaba Cloud. You can also integrate Alert Management with third-party monitoring services for centralized management.
    • Alert Management provides stable alert event handling capabilities. You can handle alert events 24/7.
    • Alert Management ensures low latency for handling a large number of alert events.
  • You can send alert notifications to contacts in a timely manner.
    • You can configure notification policies and compress alert events. This reduces the O&M workloads.
    • You can select one or more notification methods based on the urgency of an alert. For example, you can send alert notifications to contacts by email, SMS, phone call, or DingTalk to remind the contacts to handle the alert.
    • You can configure an escalation policy to send notifications to contacts multiple times if an alert remains unhandled for a long period of time.
  • Alerts can be managed in an efficient manner.
    • Contacts can use DingTalk to handle alerts anytime.
    • Alerts use a common format, which allows contacts to better analyze alerts.
    • Multiple contacts can work together through DingTalk to handle alerts.
  • Statistics on alerts are collected in real time to analyze how alerts are handled. This allows you to handle alerts in a more efficient manner.