All Products
Search
Document Center

ApsaraDB for SelectDB:Configure an alert rule

Last Updated:Mar 28, 2026

ApsaraDB for SelectDB integrates with Alibaba Cloud Application Real-Time Monitoring Service (ARMS) to provide built-in alerting. When a metric crosses its threshold, ARMS notifies every contact in the associated alert contact group.

For example, you might want an alert when:

  • CPU utilization of a BE cluster stays above 80% for five minutes

  • The 99th percentile query response time exceeds 60 seconds

  • The query success rate drops below 90%

You can also manage alert rules in the CloudMonitor console. For more information, see Cloud service monitoring.

Prerequisites

Before you begin, ensure that you have:

How it works

Each alert rule is defined by three things:

  • Condition — the metric and threshold that triggers the alert (for example, CPU utilization > 80%)

  • Duration — how long the condition must hold before an alert event fires

  • Notification — who to notify and how (via an alert contact group or a notification policy)

Check types

Choose a check type based on what you want to monitor:

Check typeWhen to use
Static thresholdMonitor a preset metric using a built-in semantic condition. Use this for common metrics such as CPU utilization, memory usage, and query latency.
Custom PromQLWrite a custom PromQL expression to monitor a non-preset metric or a complex condition. Use this when built-in metrics do not cover your requirements.

Create an alert rule

  1. Log on to the ApsaraDB for SelectDB console.

  2. In the top navigation bar, select the region where your instance resides.

  3. In the left-side navigation pane, click Instances. Find the target instance and click the instance ID.

  4. In the left-side navigation pane, click Monitoring and Alerts.

  5. Click the Alert Management tab, then click Create SelectDBAlert.

    After you click Create SelectDBAlert, it takes 3 to 5 minutes for the Create SelectDBAlert section to appear.
  6. Configure the parameters. See the parameter tables below based on the check type you select.

  7. Click Save. The alert rule automatically takes effect.

Parameters for static threshold rules

Use Static threshold to alert on preset metrics such as CPU utilization, memory usage, or query latency.

ParameterDescriptionExample
Alert rule nameA descriptive name for the alert rule.CPU utilization alert
Check typeSelect Static Threshold.Static Threshold
InstanceThe instance to monitor. Default: Traverse (applies to all instances).selectdb-cn-7213n\*\*\*\*
ClusterThe cluster to monitor. Default: Traverse (applies to all clusters).selectdb-cn-7213n\*\*\*\*-be
Alert contact groupThe group that receives alert notifications. Available groups depend on the Prometheus instance type.SelectDB
Alert metricThe metric to monitor. Available metrics depend on the selected alert contact group.CPU usage rate
Alert conditionThe threshold condition that triggers an alert event.CPU utilization > 80%
Filter conditionsNarrows the scope of the alert rule to specific resources.No Filter
Data previewDisplays the PromQL expression for the alert condition and plots metric values on a time series chart. The alert threshold appears as a red dashed line. Values above the threshold are dark red; values below are blue. Hover over the chart to inspect values at a specific point, or click and drag to zoom into a time range.
DurationHow long the condition must hold before an alert event fires. Select If the alert condition is met, an alert event is generated to fire immediately, or If the alert condition is met continuously for N minutes, an alert event is generated to require a sustained breach.1
Alert levelThe severity of the alert. Valid values: Default, P4, P3, P2, P1 (ascending severity; P1 is highest). Default: Default.P2
Alert messageThe notification message sent when an alert fires. Supports Go template syntax for dynamic variables. Common variables: $labels.pod_name (pod name), $labels.cluster_id (cluster ID), $value (current metric value).node: {{$labels.pod_name}} CPU usage rate {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, current value{{ printf "%.2f" $value }}%
Alert notificationThe notification mode. Simple mode: configure notification objects, notification period, and resend settings. Standard mode: select or create a notification policy. For more information, see Create and manage a notification policy.
Advanced settingsAlert check cycle: how often the rule evaluates the condition. Default and minimum: 1 minute. Check when data is complete: whether to verify data completeness before evaluation. Default: Yes. Tags: labels for matching notification policies. Annotations: additional metadata for the alert.Alert check cycle: 1 minute; Check when data is complete: Yes
Important

In Standard mode, alert events may match multiple notification policies if other policies use fuzzy matching. One alert event can trigger more than one notification policy.

Parameters for custom PromQL rules

Use Custom PromQL to monitor non-preset metrics or to write complex conditions using PromQL.

ParameterDescriptionExample
Alert rule nameA descriptive name for the alert rule.Pod CPU utilization exceeds 80%
Check typeSelect Custom PromQL.Custom PromQL
InstanceThe instance to monitor.selectdb-cn-7213n\*\*\*\*
ClusterThe cluster to monitor.selectdb-cn-7213n\*\*\*\*-be
Reference alert contact groupThe group that receives alert notifications. Available groups depend on the Prometheus instance type.SelectDB
Reference metrics(Optional) Select a common metric to pre-populate the Custom PromQL statements field. Modify the template as needed.99th query response time
Custom PromQL statementsThe PromQL expression that defines the alert condition. The reference metric template is a starting point — refine it for your requirements.avg(doris_fe_query_latency_ms{quantile="0.99",pod=~,cluster_id=~}) by (cluster_id) > 300
Data previewDisplays the PromQL expression and plots metric values on a time series chart. The alert threshold appears as a red dashed line.
DurationHow long the condition must hold before an alert event fires.1
Alert levelThe severity of the alert. Valid values: Default, P4, P3, P2, P1 (ascending severity; P1 is highest). Default: Default.Default
Alert messageThe notification message sent when an alert fires. Supports Go template syntax. Common variables: $labels.namespace (namespace), $labels.pod_name (pod name), $labels.device (device), $value (current metric value).Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}/Disk: {{$labels.device}} CPU utilization > 90%, current value {{ printf "%.2f" $value }}%
Alert notificationThe notification mode. Simple mode or Standard mode (select or create a notification policy).
Advanced settingsSame as static threshold rules: alert check cycle, data completeness check, tags, and annotations.Alert check cycle: 1 minute; Check when data is complete: Yes

Recommended alert configurations

Configure alert rules for these metrics to catch common issues early. The table below lists recommended thresholds and durations.

MetricRecommended thresholdDuration (minutes)Notes
Average query time> 5,000 ms5Average query latency. Adjust based on your workload.
99th percentile query response time> 60,000 ms5Long-tail query latency. Adjust based on your workload.
Query success rate> 90%5SQL query success rate.
CPU usage> 80%5BE cluster CPU utilization.
Memory usage> 80%5BE cluster memory utilization.
FE CPU usage> 60%15FE node CPU utilization. If CPU resources are insufficient, submit a ticket to apply for a free scale-out.
FE JVM memory usage> 80%15FE cluster JVM memory usage. If memory resources are insufficient, submit a ticket for a free scale-out.
Number of failed nodes> 01Number of underlying node restarts.
Base compaction score< 1,50015Higher values indicate greater data merge pressure on compute nodes.
Cumulative compaction score< 1,50015Higher values indicate greater data merge pressure on compute nodes.
Cache hit ratio< 90%15A drop may indicate that a scale-out is needed. For more information, see Scale a cluster.
User connections> 15015Total active connections. The default connection limit is 200 per user.
QPS (queries per second)Business metric. Configure based on your requirements.
Disk read IOPSUnderlying metric. Configure based on your requirements.
Disk write IOPSUnderlying metric. Configure based on your requirements.
Object storage capacityConfigure if you need to track object storage usage.
Data import speedConfigure if you need to track ingestion throughput.
Cache write throughputUnderlying metric. Configure based on your requirements.
Cache read throughputUnderlying metric. Configure based on your requirements.
Network inbound throughputUnderlying metric. Configure based on your requirements.
Network outbound throughputUnderlying metric. Configure based on your requirements.
Remote storage read throughputUnderlying metric. Configure based on your requirements.

What's next