ApsaraDB for SelectDB integrates with Alibaba Cloud Application Real-Time Monitoring Service (ARMS) to provide built-in alerting. When a metric crosses its threshold, ARMS notifies every contact in the associated alert contact group.
For example, you might want an alert when:
CPU utilization of a BE cluster stays above 80% for five minutes
The 99th percentile query response time exceeds 60 seconds
The query success rate drops below 90%
You can also manage alert rules in the CloudMonitor console. For more information, see Cloud service monitoring.
Prerequisites
Before you begin, ensure that you have:
Created the AliyunServiceRoleForSelectDB service-linked role. The role has permission to access ARMS by default. For more information, see Manage the service-linked role for ApsaraDB for SelectDB
Activated ARMS, if you want to send monitoring data to the ARMS console for centralized alerting. For more information, see Activate ARMS
How it works
Each alert rule is defined by three things:
Condition — the metric and threshold that triggers the alert (for example, CPU utilization > 80%)
Duration — how long the condition must hold before an alert event fires
Notification — who to notify and how (via an alert contact group or a notification policy)
Check types
Choose a check type based on what you want to monitor:
| Check type | When to use |
|---|---|
| Static threshold | Monitor a preset metric using a built-in semantic condition. Use this for common metrics such as CPU utilization, memory usage, and query latency. |
| Custom PromQL | Write a custom PromQL expression to monitor a non-preset metric or a complex condition. Use this when built-in metrics do not cover your requirements. |
Create an alert rule
Log on to the ApsaraDB for SelectDB console.
In the top navigation bar, select the region where your instance resides.
In the left-side navigation pane, click Instances. Find the target instance and click the instance ID.
In the left-side navigation pane, click Monitoring and Alerts.
Click the Alert Management tab, then click Create SelectDBAlert.
After you click Create SelectDBAlert, it takes 3 to 5 minutes for the Create SelectDBAlert section to appear.
Configure the parameters. See the parameter tables below based on the check type you select.
Click Save. The alert rule automatically takes effect.
Parameters for static threshold rules
Use Static threshold to alert on preset metrics such as CPU utilization, memory usage, or query latency.
| Parameter | Description | Example |
|---|---|---|
| Alert rule name | A descriptive name for the alert rule. | CPU utilization alert |
| Check type | Select Static Threshold. | Static Threshold |
| Instance | The instance to monitor. Default: Traverse (applies to all instances). | selectdb-cn-7213n\*\*\*\* |
| Cluster | The cluster to monitor. Default: Traverse (applies to all clusters). | selectdb-cn-7213n\*\*\*\*-be |
| Alert contact group | The group that receives alert notifications. Available groups depend on the Prometheus instance type. | SelectDB |
| Alert metric | The metric to monitor. Available metrics depend on the selected alert contact group. | CPU usage rate |
| Alert condition | The threshold condition that triggers an alert event. | CPU utilization > 80% |
| Filter conditions | Narrows the scope of the alert rule to specific resources. | No Filter |
| Data preview | Displays the PromQL expression for the alert condition and plots metric values on a time series chart. The alert threshold appears as a red dashed line. Values above the threshold are dark red; values below are blue. Hover over the chart to inspect values at a specific point, or click and drag to zoom into a time range. | — |
| Duration | How long the condition must hold before an alert event fires. Select If the alert condition is met, an alert event is generated to fire immediately, or If the alert condition is met continuously for N minutes, an alert event is generated to require a sustained breach. | 1 |
| Alert level | The severity of the alert. Valid values: Default, P4, P3, P2, P1 (ascending severity; P1 is highest). Default: Default. | P2 |
| Alert message | The notification message sent when an alert fires. Supports Go template syntax for dynamic variables. Common variables: $labels.pod_name (pod name), $labels.cluster_id (cluster ID), $value (current metric value). | node: {{$labels.pod_name}} CPU usage rate {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, current value{{ printf "%.2f" $value }}% |
| Alert notification | The notification mode. Simple mode: configure notification objects, notification period, and resend settings. Standard mode: select or create a notification policy. For more information, see Create and manage a notification policy. | — |
| Advanced settings | Alert check cycle: how often the rule evaluates the condition. Default and minimum: 1 minute. Check when data is complete: whether to verify data completeness before evaluation. Default: Yes. Tags: labels for matching notification policies. Annotations: additional metadata for the alert. | Alert check cycle: 1 minute; Check when data is complete: Yes |
In Standard mode, alert events may match multiple notification policies if other policies use fuzzy matching. One alert event can trigger more than one notification policy.
Parameters for custom PromQL rules
Use Custom PromQL to monitor non-preset metrics or to write complex conditions using PromQL.
| Parameter | Description | Example |
|---|---|---|
| Alert rule name | A descriptive name for the alert rule. | Pod CPU utilization exceeds 80% |
| Check type | Select Custom PromQL. | Custom PromQL |
| Instance | The instance to monitor. | selectdb-cn-7213n\*\*\*\* |
| Cluster | The cluster to monitor. | selectdb-cn-7213n\*\*\*\*-be |
| Reference alert contact group | The group that receives alert notifications. Available groups depend on the Prometheus instance type. | SelectDB |
| Reference metrics | (Optional) Select a common metric to pre-populate the Custom PromQL statements field. Modify the template as needed. | 99th query response time |
| Custom PromQL statements | The PromQL expression that defines the alert condition. The reference metric template is a starting point — refine it for your requirements. | avg(doris_fe_query_latency_ms{quantile="0.99",pod=~,cluster_id=~}) by (cluster_id) > 300 |
| Data preview | Displays the PromQL expression and plots metric values on a time series chart. The alert threshold appears as a red dashed line. | — |
| Duration | How long the condition must hold before an alert event fires. | 1 |
| Alert level | The severity of the alert. Valid values: Default, P4, P3, P2, P1 (ascending severity; P1 is highest). Default: Default. | Default |
| Alert message | The notification message sent when an alert fires. Supports Go template syntax. Common variables: $labels.namespace (namespace), $labels.pod_name (pod name), $labels.device (device), $value (current metric value). | Namespace: {{$labels.namespace}}/Pod: {{$labels.pod_name}}/Disk: {{$labels.device}} CPU utilization > 90%, current value {{ printf "%.2f" $value }}% |
| Alert notification | The notification mode. Simple mode or Standard mode (select or create a notification policy). | — |
| Advanced settings | Same as static threshold rules: alert check cycle, data completeness check, tags, and annotations. | Alert check cycle: 1 minute; Check when data is complete: Yes |
Recommended alert configurations
Configure alert rules for these metrics to catch common issues early. The table below lists recommended thresholds and durations.
| Metric | Recommended threshold | Duration (minutes) | Notes |
|---|---|---|---|
| Average query time | > 5,000 ms | 5 | Average query latency. Adjust based on your workload. |
| 99th percentile query response time | > 60,000 ms | 5 | Long-tail query latency. Adjust based on your workload. |
| Query success rate | > 90% | 5 | SQL query success rate. |
| CPU usage | > 80% | 5 | BE cluster CPU utilization. |
| Memory usage | > 80% | 5 | BE cluster memory utilization. |
| FE CPU usage | > 60% | 15 | FE node CPU utilization. If CPU resources are insufficient, submit a ticket to apply for a free scale-out. |
| FE JVM memory usage | > 80% | 15 | FE cluster JVM memory usage. If memory resources are insufficient, submit a ticket for a free scale-out. |
| Number of failed nodes | > 0 | 1 | Number of underlying node restarts. |
| Base compaction score | < 1,500 | 15 | Higher values indicate greater data merge pressure on compute nodes. |
| Cumulative compaction score | < 1,500 | 15 | Higher values indicate greater data merge pressure on compute nodes. |
| Cache hit ratio | < 90% | 15 | A drop may indicate that a scale-out is needed. For more information, see Scale a cluster. |
| User connections | > 150 | 15 | Total active connections. The default connection limit is 200 per user. |
| QPS (queries per second) | — | — | Business metric. Configure based on your requirements. |
| Disk read IOPS | — | — | Underlying metric. Configure based on your requirements. |
| Disk write IOPS | — | — | Underlying metric. Configure based on your requirements. |
| Object storage capacity | — | — | Configure if you need to track object storage usage. |
| Data import speed | — | — | Configure if you need to track ingestion throughput. |
| Cache write throughput | — | — | Underlying metric. Configure based on your requirements. |
| Cache read throughput | — | — | Underlying metric. Configure based on your requirements. |
| Network inbound throughput | — | — | Underlying metric. Configure based on your requirements. |
| Network outbound throughput | — | — | Underlying metric. Configure based on your requirements. |
| Remote storage read throughput | — | — | Underlying metric. Configure based on your requirements. |