Best practices for monitoring and alerting - Hologres

Cloud Monitor gives you a unified view of resource usage, service health, and performance across your Hologres instances. Set up alert rules to catch anomalies early and keep your applications running smoothly.

Prerequisites

Before you begin, ensure that you have:

A Hologres instance

Instance types and monitoring views

Cloud Monitor displays metrics by instance type. Each type exposes the metrics most relevant to its architecture:

Hologres (Read-only Secondary Instance)
Hologres (Lakehouse Acceleration)
Hologres (General-purpose)
Hologres (Compute Group)

If you currently use the generic Hologres monitoring view, switch to your specific instance type for a more accurate monitoring experience.

View metrics in Cloud Monitor

Log on to the Cloud Monitor console.
In the left navigation pane, click Cloud Service Monitoring.
In the Big Data Computing area, click your instance type: Hologres (Read-only Secondary Instance), Hologres (Lakehouse Acceleration), Hologres (General-purpose), or Hologres (Compute Group).
Click the icon next to the region and select a region.
Click a target Instance ID, or click Monitoring Chart in the Actions column.

Use the time range selector to view historical data. Monitoring data is retained for a maximum of 30 days.

For a full list of available Hologres metrics, see Monitoring metrics in the Hologres console.

Set up alerts

Use one-click alerting

The one-click alerting feature creates a set of default alert rules for all Hologres instances under your Alibaba Cloud account — no manual configuration needed. It covers the most critical resource metrics so you can detect common issues quickly.

Alerts are sent to the alert contact of the Alibaba Cloud account.

The default alert rules are:

Metric	Severity	Condition	Check interval
Connection usage	Info	Average ≥ 95% for 3 consecutive checks	5 minutes (customizable)
Storage usage	Warn	Average > 90% for 3 consecutive checks	5 minutes (customizable)
Memory usage	Warn	Average ≥ 90% for 3 consecutive checks	5 minutes (customizable)
CPU usage	Info	Average ≥ 99% for 3 consecutive checks	5 minutes (customizable)

Create custom alert rules

For metrics not covered by one-click alerting, create rules manually:

Log on to the Cloud Monitor console.
In the left navigation pane, choose Alert Service > Alert Rules.
On the Alert Rules page, click Create Alert Rule and follow the prompts.

For details, see Create an alert rule.

Recommended alert rules

The following rules are tested thresholds based on Hologres behavior. Each entry explains what the metric detects and common misconfiguration pitfalls to avoid.

Instance CPU usage (%)

What it detects: Resource bottlenecks at the instance level, and whether compute resources are fully utilized.

Severity	Condition
Critical	≥ 99% for 60 consecutive epochs (1 epoch = 1 minute)
Warning	≥ 99% for 10 consecutive epochs (1 epoch = 1 minute)

The Critical rule catches sustained high usage that signals the need to scale out. The Warning rule gives you an early heads-up when CPU maxes out due to workload changes.

Common mistakes:

Don't trigger an alert on a single spike to 100%. Brief peaks are normal and indicate efficient utilization, not overload.
Don't set the threshold too low. System components consume CPU even when no user queries are running.

Worker node CPU usage (%)

What it detects: Resource bottlenecks on individual worker nodes, helping you identify hot spots before they affect the cluster.

Severity	Condition
Critical	≥ 99% for 60 consecutive epochs (1 epoch = 1 minute)
Warning	≥ 99% for 10 consecutive epochs (1 epoch = 1 minute)

Common mistakes:

Don't trigger an alert on a single spike to 100%. Brief peaks are normal.
Don't set the threshold too low. System components consume CPU even when no user queries are running.

Instance memory usage (%)

What it detects: Memory pressure at the instance level. Sustained high usage can lead to query failures or out-of-memory (OOM) conditions.

Severity	Condition
Critical	≥ 99% for 60 consecutive epochs (1 epoch = 1 minute)
Warning	≥ 99% for 10 consecutive epochs (1 epoch = 1 minute)

Common mistake: Don't set the threshold too low. Memory is used for running queries, metadata, and caching. Some baseline memory is consumed even when the instance is idle.

Worker node memory usage (%)

What it detects: Memory pressure on individual worker nodes.

Severity	Condition
Critical	≥ 99% for 60 consecutive epochs (1 epoch = 1 minute)
Warning	≥ 99% for 10 consecutive epochs (1 epoch = 1 minute)

Common mistake: Don't set the threshold too low. Memory is used for running queries, metadata, and caching. Some baseline memory is consumed even when the instance is idle.

Connection usage of the FE with the highest connection usage (%)

What it detects: Whether any Frontend (FE) node is running out of connection capacity. High connection usage can block new connections and degrade service.

Severity	Condition
Warning	≥ 95% for 5 consecutive epochs (1 epoch = 1 minute)

When this alert fires, clear idle connections promptly.

WAL sender usage of the FE with the highest WAL sender usage (%)

What it detects: Write-Ahead Log (WAL) sender saturation on FE nodes. WAL senders are used for data replication; exhausting them can disrupt replication pipelines.

Severity	Condition
Warning	≥ 95% for 5 consecutive epochs (1 epoch = 1 minute)

Longest duration of a running query in the instance (milliseconds)

What it detects: Long-running queries that may be stuck or blocking resources.

Severity	Condition
Warning	≥ 3,600,000 ms for 10 consecutive epochs (1 epoch = 1 minute)

When this alert fires, check Active Queries in HoloWeb to identify and cancel the offending query.

Longest duration of a running query in Serverless Computing (milliseconds)

What it detects: Runaway tasks in a serverless cluster. Canceling long-running tasks promptly helps control costs and free up resources.

Severity	Condition
Warning	≥ 3,600,000 ms for 10 consecutive epochs (1 epoch = 1 minute)

Failed query QPS (counts)

What it detects: A surge in query failures, which can signal schema issues, resource exhaustion, or application bugs.

Severity	Condition
Warning	≥ 10 counts for 10 consecutive epochs (1 epoch = 1 minute)

When this alert fires, check the slow query logs for failure details and take appropriate action.

FE replay latency (milliseconds)

What it detects: How long it takes each FE node to replay metadata changes. A stuck FE causes queries to stall and requires immediate attention.

Severity	Condition
Warning	≥ 300,000 ms for 10 consecutive epochs (1 epoch = 1 minute)

When this alert fires, go to Active Queries in HoloWeb to check for long-running queries and try to cancel them.

Common mistake: Don't set the threshold too low. FE replay happens on every metadata change, and a replay time in the range of seconds is normal.

Primary-secondary synchronization latency (milliseconds)

What it detects: Replication lag between the primary and secondary instances. Applies only to read-only secondary instances.

Severity	Condition
Warning	≥ 600,000 ms for 10 consecutive epochs (1 epoch = 1 minute)

Number of tables with missing statistics in each DB (counts)

What it detects: Degraded query plan quality caused by stale or missing table statistics. The Auto Analyze feature keeps statistics up to date automatically, but some tables may lag behind.

Severity	Condition
Warning	≥ 10 counts for 60 consecutive epochs (1 epoch = 1 minute)

When this alert fires, manually run the ANALYZE command on the affected tables. For details, see ANALYZE and AUTO ANALYZE.

Common mistake: Don't set the threshold too low. Instances with many tables can slow down Auto Analyze, so some lag is expected.

Troubleshoot monitoring issues

If a metric fluctuates unexpectedly or an alert fires without a clear cause, see FAQ about monitoring metrics.

Access metrics via API or custom dashboards

You can access monitoring data programmatically or build custom views:

API access: See Cloud service monitoring.
Custom dashboards: See Manage custom dashboards.
ARMS integration: Connect Hologres monitoring to Application Real-Time Monitoring Service (ARMS). For setup instructions, see the ARMS integration guide.

Grant RAM users access to Cloud Monitor

By default, a Resource Access Management (RAM) user cannot view metric data in Cloud Monitor. Grant the appropriate policy based on what the user needs to do.

Use your Alibaba Cloud account to log on to the RAM console and attach one of the following policies. For instructions, see Manage RAM user permissions.

Policy name	Access level
AliyunCloudMonitorFullAccess	Full management of Cloud Monitor
AliyunCloudMonitorReadOnlyAccess	Read-only access to Cloud Monitor
AliyunCloudMonitorMetricDataReadOnlyAccess	Read-only access to time series metric data