Cloud Monitor gives you a unified view of resource usage, service health, and performance across your Hologres instances. Set up alert rules to catch anomalies early and keep your applications running smoothly.
Prerequisites
Before you begin, ensure that you have:
Instance types and monitoring views
Cloud Monitor displays metrics by instance type. Each type exposes the metrics most relevant to its architecture:
-
Hologres (Read-only Secondary Instance)
-
Hologres (Lakehouse Acceleration)
-
Hologres (General-purpose)
-
Hologres (Compute Group)
If you currently use the generic Hologres monitoring view, switch to your specific instance type for a more accurate monitoring experience.
View metrics in Cloud Monitor
-
Log on to the Cloud Monitor console.
-
In the left navigation pane, click Cloud Service Monitoring.
-
In the Big Data Computing area, click your instance type: Hologres (Read-only Secondary Instance), Hologres (Lakehouse Acceleration), Hologres (General-purpose), or Hologres (Compute Group).
-
Click the
icon next to the region and select a region. -
Click a target Instance ID, or click Monitoring Chart in the Actions column.

Use the time range selector to view historical data. Monitoring data is retained for a maximum of 30 days.
For a full list of available Hologres metrics, see Monitoring metrics in the Hologres console.
Set up alerts
Use one-click alerting
The one-click alerting feature creates a set of default alert rules for all Hologres instances under your Alibaba Cloud account — no manual configuration needed. It covers the most critical resource metrics so you can detect common issues quickly.
Alerts are sent to the alert contact of the Alibaba Cloud account.
The default alert rules are:
| Metric | Severity | Condition | Check interval |
|---|---|---|---|
| Connection usage | Info | Average ≥ 95% for 3 consecutive checks | 5 minutes (customizable) |
| Storage usage | Warn | Average > 90% for 3 consecutive checks | 5 minutes (customizable) |
| Memory usage | Warn | Average ≥ 90% for 3 consecutive checks | 5 minutes (customizable) |
| CPU usage | Info | Average ≥ 99% for 3 consecutive checks | 5 minutes (customizable) |
Create custom alert rules
For metrics not covered by one-click alerting, create rules manually:
-
Log on to the Cloud Monitor console.
-
In the left navigation pane, choose Alert Service > Alert Rules.
-
On the Alert Rules page, click Create Alert Rule and follow the prompts.
For details, see Create an alert rule.
Recommended alert rules
The following rules are tested thresholds based on Hologres behavior. Each entry explains what the metric detects and common misconfiguration pitfalls to avoid.
Instance CPU usage (%)
What it detects: Resource bottlenecks at the instance level, and whether compute resources are fully utilized.
| Severity | Condition |
|---|---|
| Critical | ≥ 99% for 60 consecutive epochs (1 epoch = 1 minute) |
| Warning | ≥ 99% for 10 consecutive epochs (1 epoch = 1 minute) |
The Critical rule catches sustained high usage that signals the need to scale out. The Warning rule gives you an early heads-up when CPU maxes out due to workload changes.
Common mistakes:
-
Don't trigger an alert on a single spike to 100%. Brief peaks are normal and indicate efficient utilization, not overload.
-
Don't set the threshold too low. System components consume CPU even when no user queries are running.
Worker node CPU usage (%)
What it detects: Resource bottlenecks on individual worker nodes, helping you identify hot spots before they affect the cluster.
| Severity | Condition |
|---|---|
| Critical | ≥ 99% for 60 consecutive epochs (1 epoch = 1 minute) |
| Warning | ≥ 99% for 10 consecutive epochs (1 epoch = 1 minute) |
Common mistakes:
-
Don't trigger an alert on a single spike to 100%. Brief peaks are normal.
-
Don't set the threshold too low. System components consume CPU even when no user queries are running.
Instance memory usage (%)
What it detects: Memory pressure at the instance level. Sustained high usage can lead to query failures or out-of-memory (OOM) conditions.
| Severity | Condition |
|---|---|
| Critical | ≥ 99% for 60 consecutive epochs (1 epoch = 1 minute) |
| Warning | ≥ 99% for 10 consecutive epochs (1 epoch = 1 minute) |
Common mistake: Don't set the threshold too low. Memory is used for running queries, metadata, and caching. Some baseline memory is consumed even when the instance is idle.
Worker node memory usage (%)
What it detects: Memory pressure on individual worker nodes.
| Severity | Condition |
|---|---|
| Critical | ≥ 99% for 60 consecutive epochs (1 epoch = 1 minute) |
| Warning | ≥ 99% for 10 consecutive epochs (1 epoch = 1 minute) |
Common mistake: Don't set the threshold too low. Memory is used for running queries, metadata, and caching. Some baseline memory is consumed even when the instance is idle.
Connection usage of the FE with the highest connection usage (%)
What it detects: Whether any Frontend (FE) node is running out of connection capacity. High connection usage can block new connections and degrade service.
| Severity | Condition |
|---|---|
| Warning | ≥ 95% for 5 consecutive epochs (1 epoch = 1 minute) |
When this alert fires, clear idle connections promptly.
WAL sender usage of the FE with the highest WAL sender usage (%)
What it detects: Write-Ahead Log (WAL) sender saturation on FE nodes. WAL senders are used for data replication; exhausting them can disrupt replication pipelines.
| Severity | Condition |
|---|---|
| Warning | ≥ 95% for 5 consecutive epochs (1 epoch = 1 minute) |
Longest duration of a running query in the instance (milliseconds)
What it detects: Long-running queries that may be stuck or blocking resources.
| Severity | Condition |
|---|---|
| Warning | ≥ 3,600,000 ms for 10 consecutive epochs (1 epoch = 1 minute) |
When this alert fires, check Active Queries in HoloWeb to identify and cancel the offending query.
Longest duration of a running query in Serverless Computing (milliseconds)
What it detects: Runaway tasks in a serverless cluster. Canceling long-running tasks promptly helps control costs and free up resources.
| Severity | Condition |
|---|---|
| Warning | ≥ 3,600,000 ms for 10 consecutive epochs (1 epoch = 1 minute) |
Failed query QPS (counts)
What it detects: A surge in query failures, which can signal schema issues, resource exhaustion, or application bugs.
| Severity | Condition |
|---|---|
| Warning | ≥ 10 counts for 10 consecutive epochs (1 epoch = 1 minute) |
When this alert fires, check the slow query logs for failure details and take appropriate action.
FE replay latency (milliseconds)
What it detects: How long it takes each FE node to replay metadata changes. A stuck FE causes queries to stall and requires immediate attention.
| Severity | Condition |
|---|---|
| Warning | ≥ 300,000 ms for 10 consecutive epochs (1 epoch = 1 minute) |
When this alert fires, go to Active Queries in HoloWeb to check for long-running queries and try to cancel them.
Common mistake: Don't set the threshold too low. FE replay happens on every metadata change, and a replay time in the range of seconds is normal.
Primary-secondary synchronization latency (milliseconds)
What it detects: Replication lag between the primary and secondary instances. Applies only to read-only secondary instances.
| Severity | Condition |
|---|---|
| Warning | ≥ 600,000 ms for 10 consecutive epochs (1 epoch = 1 minute) |
Number of tables with missing statistics in each DB (counts)
What it detects: Degraded query plan quality caused by stale or missing table statistics. The Auto Analyze feature keeps statistics up to date automatically, but some tables may lag behind.
| Severity | Condition |
|---|---|
| Warning | ≥ 10 counts for 60 consecutive epochs (1 epoch = 1 minute) |
When this alert fires, manually run the ANALYZE command on the affected tables. For details, see ANALYZE and AUTO ANALYZE.
Common mistake: Don't set the threshold too low. Instances with many tables can slow down Auto Analyze, so some lag is expected.
Troubleshoot monitoring issues
If a metric fluctuates unexpectedly or an alert fires without a clear cause, see FAQ about monitoring metrics.
Access metrics via API or custom dashboards
You can access monitoring data programmatically or build custom views:
-
API access: See Cloud service monitoring.
-
Custom dashboards: See Manage custom dashboards.
-
ARMS integration: Connect Hologres monitoring to Application Real-Time Monitoring Service (ARMS). For setup instructions, see the ARMS integration guide.
Grant RAM users access to Cloud Monitor
By default, a Resource Access Management (RAM) user cannot view metric data in Cloud Monitor. Grant the appropriate policy based on what the user needs to do.
Use your Alibaba Cloud account to log on to the RAM console and attach one of the following policies. For instructions, see Manage RAM user permissions.
| Policy name | Access level |
|---|---|
| AliyunCloudMonitorFullAccess | Full management of Cloud Monitor |
| AliyunCloudMonitorReadOnlyAccess | Read-only access to Cloud Monitor |
| AliyunCloudMonitorMetricDataReadOnlyAccess | Read-only access to time series metric data |