The alarm service provides powerful capabilities to monitor alarms so that you can easily detect metric exceptions and quickly troubleshoot faults.
- Products: ECS, RDS, OSS, and others
- Resource Range: There are three alarm rule ranges available: All Resources, Application Group, and Instances. When you set Resource Range to All Resources, you can report an alarm for up to 1,000 resources. If the number of your resources exceeds 1000, alarms cannot be reported for some resources even if they exceed the threshold set in your alarm rules. Therefore, we recommend that you use application groups to divide resources by service before setting up alarm rules to avoid this issue.
- All Resources: Indicates that the specified alarm rule applies to all instances under a user name. For example, if you set the resource range to all resources, and set the alarm threshold for MongoDB CPU usage to 80%, then an alarm is triggered when the CPU usage of any MongoDB instance exceeds 80%.
- Application Group: Indicates that the specified rule applies to all instances under an application group. For example, if you set the resource range to application group and set the alarm threshold for host CPU usage to 80%, then an alarm is triggered when the CPU usage of a host instance exceeds 80%.
- Instances: Indicates that the specified rule only applies to a specific instance. For example, if you set the resource range to instances and set the alarm threshold for host CPU usage to 80%, an alarm is triggered when the CPU usage of the specified instance exceeds 80%.
- Alarm Rule: the alarm rule name.
- Rule Describe: the main content of the alarm rule where you define the alarm-triggering condition, or value threshold, for related metrics. For example, if you describe the rule as 1-minute average CPU usage >=90%, the alarm service will check every minute whether the average value of metrics within one minute meets or exceeds 90%.
Consider the following example. For the alarm service in host monitoring, a single server metric item reports one data point in 15 seconds, and 20 data points in five minutes. This relates to the following alarm rules.
- 5-minute average CPU usage > 90%: Indicates that the average CPU usage value of the 20 data points for five minutes exceeds 90%.
- 5-minute CPU usage always > 90%: Indicates that the CPU usage values of the 20 data points for five minutes all exceed 90%.
- 5-minute CPU usage once > 90%: Indicates that the CPU usage value of at least one of the 20 data points for five minutes exceeds 90%.
- Total 5-minute Internet outbound traffic > 50 MB: Indicates that the sum of the outbound traffic values of the 20 data points for five minutes exceeds 50 MB.
- Triggered when threshold is exceeded for: An alarm notification is sent if the detected values reach the alarm rule threshold multiple times in a row.
- Effective Period: the period of time for which an alarm rule is valid. The alarm service checks metrics and determines whether to generate an alarm only during this period of time.
- Notification Contact: a group of contacts who receive alarm notifications.
- Notification Methods: Different notification methods are available based on different alarm levels. Three alarm levels are available: Critical, Warning, and Info.
- Critical: voice calls, SMS messages, emails, and DingTalk chatbot
- Warning: SMS messages, emails, and DingTalk chatbot
- Info: emails and DingTalk chatbot
- Email Remark: supplementary information customized for an alarm email. Remarks are sent as part of the alarm notification email.