Configure an alert rule in ApsaraDB for SelectDB - ApsaraDB for SelectDB

ApsaraDB for SelectDB integrates the monitoring feature of Alibaba Cloud Application Real-Time Monitoring Service (ARMS). This allows you to configure alert rules in the ApsaraDB for SelectDB console. You can configure alert rules for metrics. If the alert rule of a metric is triggered, the system notifies all contacts within the alert contact group. You can maintain alert contact groups for metrics to ensure that the contacts receive alert notifications at the earliest opportunity. You can configure alert rules for important metrics. This way, if alerts are generated for your ApsaraDB for SelectDB instance, you can receive alert notifications and handle the alerts in a timely manner. This topic describes how to configure alert rules in the ApsaraDB for SelectDB console.

Note

You can also configure alert rules for ApsaraDB for SelectDB instances in the CloudMonitor console. For more information, see Cloud service monitoring.

Prerequisites

The AliyunServiceRoleForSelectDB service-linked role is created for ApsaraDB for SelectDB. By default, the role has the permission to access ARMS. For more information, see Manage the service-linked role for ApsaraDB for SelectDB.
ARMS is activated. This applies if you want to send the monitoring data of your ApsaraDB for SelectDB instance to the ARMS console for centralized monitoring and alert reporting. For more information, see Activate ARMS.

Procedure

Log on to the ApsaraDB for SelectDB console.
In the top navigation bar, select the region in which the instance that you want to manage resides.
In the left-side navigation pane, click Instances. On the page that appears, find the required instance and click the instance ID.
In the left-side navigation pane of the page that appears, click Monitoring and Alerts.
On the Monitoring and Alerts page, click the Alert Management tab. On the tab that appears, click Create SelectDBAlert.
Note
After you click Create SelectDBAlert on the Alert Management tab, it requires 3 to 5 minutes for the system to display the Create SelectDBAlert section.

On the page that appears, configure the parameters.

When you create an alert rule for an ApsaraDB for SelectDB instance, you can select Static Threshold or Custom PromQL for Check Type.

Static Threshold: allows you to create an alert rule for a preset metric by using the semantic method.
Custom PromQL: allows you to create an alert rule for a non-preset metric by customizing a PromQL statement.

Parameters for rules based on static thresholds

Parameter	Description	Example value
Alert Rule Name	The name of the alert rule.	CPU utilization alert
Check Type	The check type of the alert rule. Select Static Threshold.	Static Threshold
Instance	The instance for which you want to create the alert rule. Default value: Traverse. This indicates that the alert rule takes effect for all instances.	selectdb-cn-7213n****
Cluster	The cluster for which you want to create the alert rule. Default value: Traverse. This indicates that the alert rule takes effect for all clusters.	selectdb-cn-7213n****-be
Alert Contact Group	The alert contact group. The alert contact groups that are supported by a Prometheus instance vary based on the type of the Prometheus instance. The options in the drop-down list vary based on the type of the Prometheus instance that you specify.	SelectDB
Alert Metric	The metric for which you want to create the alert rule. The metric that you can select varies based on the selected alert contact group.	CPU usage rate
Alert Condition	The condition based on which alert events are generated.	If the CPU utilization is greater than 80%, an alert is generated.
Filter Conditions	The applicable scope of the alert rule.	No Filter
Data Preview	The Data Preview section displays the PromQL statement that corresponds to the alert condition. The section also displays the values of the specified metric on a time series curve. By default, only the real-time values of one resource are displayed. You can specify filter conditions to view the metric values of different resources within different time ranges. Note The alert threshold is displayed as a red dashed line in the chart. Portions of the time series curve that exceed the alert threshold are displayed in dark red, and the portions that do not reach the alert threshold are displayed in blue. You can move the pointer over the time series curve to view resource details at a specific point in time. You can move the pointer over the curve of a metric and click-and-drag the pointer to select a time period. Then, you can view the time series curve of the selected time range.	None
Duration	The period of time before an alert event is generated after the alert condition is met. Valid values: If the alert condition is met, an alert event is generated: If the threshold is reached at a data point, an alert event is immediately generated. If the alert condition is met continuously for N Minutes an alert event is generated: An alert event is generated only if the alert threshold is met for at least N minutes.	1
Alert Level	The alert level. Valid values: Default, P4, P3, P2, and P1. Default value: Default. The preceding values are listed in ascending order of severity.	P2
Alert Message	The alert message that you receive after the alert events are generated. You can customize variables in the alert message based on the Go template syntax.	node: {{$labels.pod_name}} CPU usage rate {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, current value{{ printf "%.2f" $value }}%
Alert Notification	The alert notification mode. Valid values: Simple Mode: You can configure the Notification Objects, Notification Period, and Whether to Resend Notifications parameters. Standard Mode: You can select a notification policy. If no notification policy is available, you can click Create Notification Policy to create a notification policy. For more information, see Create and manage a notification policy. Important After you select a notification policy, the alert events that are generated based on the alert rule can be matched by the notification policy, and alerts can be generated. However, the alert events may also be matched by other notification policies that use fuzzy match and alerts may be generated. One or more alert events can be matched by one or more notification policies.	None
Alert Notification		None
Advanced Settings	The advanced settings. Alert Check Cycle: the interval at which the alert rule is triggered to check whether the data meets the alert condition. Default value: 1. Minimum value: 1. Unit: minutes. Check When Data Is Complete: specifies whether to check the integrity of data. Default value: Yes. We recommend that you use the default value of this parameter. Tags: the tags that are used to match the notification policy. Annotations: the annotations of the alert.	Alert Check Cycle: 1 Minutes Check When Data Is Complete: Yes Tags: N/A Annotations: N/A

Parameters for rules based on custom PromQL

Parameter	Description	Example value
Alert Rule Name	The name of the alert rule.	Pod CPU utilization exceeds 80%
Check Type	The check type of the alert rule. Select Custom PromQL.	Custom PromQL
Instance	The instance for which you want to create the alert rule.	selectdb-cn-7213n****
Cluster	The cluster for which you want to create the alert rule.	selectdb-cn-7213n****-be
Reference Alert Contact Group	The alert contact group. The alert contact groups that are supported by a Prometheus instance vary based on the type of the Prometheus instance. The options in the drop-down list vary based on the type of the Prometheus instance that you specify.	SelectDB
Reference Metrics	Optional. The Reference Metrics drop-down list displays common metrics. After you select a metric, the PromQL statement of the metric is displayed in the Custom PromQL Statements field. You can modify the statement based on your business requirements. The values of the Reference Metrics drop-down list vary based on the type of the Prometheus instance. Note The custom PromQL statement template in the Custom PromQL Statements field is not a complete PromQL statement. You must modify and refine the template based on your business requirements.	99th query response time
Custom PromQL Statements	The PromQL statement based on which alert events are generated.	avg(doris_fe_query_latency_ms{quantile="0.99",pod=~,cluster_id=~}) by (cluster_id) > 300
Data Preview	The Data Preview section displays the PromQL statement that corresponds to the alert condition. The section also displays the values of the specified metric on a time series curve. By default, only the real-time values of one resource are displayed. You can specify filter conditions to view the metric values of different resources within different time ranges. Note The alert threshold is displayed as a red dashed line in the chart. Portions of the time series curve that exceed the alert threshold are displayed in dark red, and the portions that do not reach the alert threshold are displayed in blue. You can move the pointer over the time series curve to view resource details at a specific point in time. You can move the pointer over the curve of a metric and click-and-drag the pointer to select a time period. Then, you can view the time series curve of the selected time range.	None
Duration	The period of time before an alert event is generated after the alert condition is met. Valid values: If the alert condition is met, an alert event is generated: If the threshold is reached at a data point, an alert event is immediately generated. If the alert condition is met continuously for N Minutes an alert event is generated: An alert event is generated only if the alert threshold is met for at least N minutes.	1
Alert Level	The alert level. Default value: Default. Valid values: Default, P4, P3, P2, and P1. Default indicates the lowest severity level, while P1 indicates the highest severity level.	Default.
Alert Message	The alert message that you receive after the alert events are generated. You can customize variables in the alert message based on the Go template syntax.	Namespace: {$ labels.namespace}}/Pod: {{$labels.pod_name}}/Disk: {{$labels.device}} CPU utilization of pods is greater than 90%, Current value {{ printf "%.2f" $value }}%
Alert Notification	The alert notification mode. Valid values: Simple Mode: You can configure the Notification Objects, Notification Period, and Whether to Resend Notifications parameters. Standard Mode: You can select a notification policy. If no notification policy is available, you can click Create Notification Policy to create a notification policy. For more information, see Create and manage a notification policy. Important After you select a notification policy, the alert events that are generated based on the alert rule can be matched by the notification policy, and alerts can be generated. However, the alert events may also be matched by other notification policies that use fuzzy match and alerts may be generated. One or more alert events can be matched by one or more notification policies.	None
Advanced Settings	The advanced settings. Alert Check Cycle: the interval at which the alert rule is triggered to check whether the data meets the alert condition. Default value: 1. Minimum value: 1. Unit: minutes. Check When Data Is Complete: specifies whether to check the integrity of data. Default value: Yes. We recommend that you use the default value of this parameter. Tags: the tags that are used to match the notification policy. Annotations: the annotations of the alert.	Alert Check Cycle: 1 Minutes Check When Data Is Complete: Yes Tags: N/A Annotations: N/A

Click Save. The alert rule automatically takes effect.

Suggestions on alert rule configurations

You can configure an alert rule based on your business requirements. The following table describes the suggestions on configuring an alert rule for common metrics.

Metric	Recommended threshold	Recommended duration (Unit: minutes)	Suggestion
Average Query Time	>5000	5	The average query duration. Unit: milliseconds. You can adjust the threshold based on your business requirements. We recommend that you configure an alert rule for this metric.
99th query time	>60000	5	The long-tail query duration. Unit: milliseconds. You can adjust the threshold based on your business requirements. We recommend that you configure an alert rule for this metric.
Query success rate	>90	5	The success rate of SQL queries. We recommend that you configure an alert rule for this metric.
CPU Usage	>80	5	The CPU utilization of a BE cluster. This is a common business metric. We recommend that you configure an alert rule for this metric.
Memory Usage	>80	5	The memory usage of a BE cluster. This is a common business metric. We recommend that you configure an alert rule for this metric.
FE CPU Usage	>60	15	The FE CPU utilization of a BE cluster. We recommend that you configure an alert rule for this metric. If the CPU resources are insufficient, submit a ticket to apply for a free scale-out.
FE JVM Memory Usage	>80	15	The JVM memory usage of an FE cluster. We recommend that you configure an alert rule for this metric. If the memory resources are insufficient, submit a ticket to apply for a free scale-out.
Number of failed nodes	>0	1	The number of times that the underlying nodes of the cluster are restarted. You can configure an alert rule for this metric based on your business requirements.
Base Sore of Data Merge Compaction Score	<1500	15	A greater value indicates higher data merge pressure on compute nodes. We recommend that you configure an alert rule for this metric.
Cumulative Score of Data Merge Compaction Score	<1500	15	A greater value indicates higher data merge pressure on compute nodes. We recommend that you configure an alert rule for this metric.
Cache Hit Ratio	<90	15	The cache hit ratio, which affects the query duration. We recommend that you configure an alert rule for this metric. If the value of this parameter decreases, determine whether to perform a scale-out. For more information, see Scale a cluster.
User Connections	>150	15	The total connections between database users and the database. By default, a user can create up to 200 connections. This is a business metric. We recommend that you configure an alert rule for this metric.
Queries per Second (QPS)	None	None	The queries that are performed per second. This is a business metric. You can configure an alert rule for this metric based on your business requirements.
Number of hard disk reads (IOPS)	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.
Number of hard disk writes (IOPS)	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.
Object Storage Capacity	None	None	If are concerned about the object storage usage, you can configure an alert rule for this metric based on your business requirements.
Import data speed	None	None	If you are concerned about the speed of data importing, you can configure an alert rule for this metric based on your business requirements.
Cache Write Throughput	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.
Cache read throughput	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.
Network inbound throughput	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.
Network outbound throughput	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.
Remote storage read throughput	None	None	This is an underlying metric. You can ignore this metric or configure an alert rule for the metric based on your business requirements.