This topic explains the meaning of each Hologres monitoring metric. Understanding these metrics helps you select the most suitable ones for your business needs, monitor resource usage and SQL execution in real time, promptly detect system errors, and handle instance failures.
Important Notes
Notes about QE and FixedQE:
QE is a collective term for Hologres proprietary vector compute engines, such as HQE and SQE, under the XQE engine family. In slow query logs, queries with Engine Type={XQE} fall under the QE category in monitoring metrics.
FixedQE refers to queries that use the Fixed Plan path. In slow query logs, queries with Engine Type={FixedQE} (SDK in versions earlier than V2.2) fall under the FixedQE category in monitoring metrics.
Notes about Command Type:
The Command Type matches the SQL statement type. For example,
INSERT xxxorINSERT xxx ON CONFLICT DO UPDATE/NOTHINGare both classified as INSERT.UNKNOWN: A classification for SQL statements that the DPI engine cannot recognize due to SQL syntax errors.
UTILITY: Administrative, definition, and control commands other than INSERT, UPDATE, DELETE, and SELECT.
These include the following:
Data Definition Language (DDL): CREATE, ALTER, DROP, TRUNCATE, and COMMENT.
Transaction Control Language (TCL): BEGIN, COMMIT, ROLLBACK, and SAVEPOINT.
Administration and maintenance: ANALYZE, VACUUM, EXPLAIN, SET, SHOW, COPY, and REFRESH.
Execution and procedural control: PREPARE, EXECUTE, DEALLOCATE, CALL, and DECLARE CURSOR.
Others: LOCK TABLE, LISTEN, and NOTIFY.
In Cloud Monitor, each metric has a unique ID that lets you find specific metrics more easily. Metric IDs for different instance types have different prefixes. For example, general-purpose instances, follower instances, Virtual Warehouse Instances, and Lakehouse Acceleration (Shared Cluster) use the prefixes
standard_,follower_,warehouse_, andshared_, respectively. The metrics supported by each instance type are listed below:General-purpose instance: Metrics supported by general-purpose instances.
Follower instance: Metrics supported by follower instances.
Compute group: Metrics supported by compute group instances.
Lakehouse acceleration (shared cluster): Metrics supported by lakehouse acceleration instances.
If a metric shows no data, it may be because the current instance version does not support it or there has been no activity for an extended period.
Monitoring data is retained for up to 30 days.
Access Control
The monitoring page in the Hologres console retrieves data from Cloud Monitor. If you use a Resource Access Management (RAM) user to view monitoring information, you must grant the appropriate permissions based on your business needs. These permissions include the following:
AliyunCloudMonitorFullAccess: Full management permissions for Cloud Monitor.
AliyunCloudMonitorReadOnlyAccess: Read-only access permissions for Cloud Monitor.
For more information about RAM user authorization, see Grant permissions to RAM users.
Monitoring Metrics Overview
The following monitoring metrics are available in Hologres:
Categorization | Metric | Description | Supported Instance Types | Notes |
CPU | The CPU usage of the instance. | General-purpose instance, follower instance, and compute group instance | None | |
The CPU usage of each Worker node in the instance. | ||||
CPU utilization for each Cluster in the compute group. | Compute group instance | Supported only in Hologres V4.0 and later. | ||
Memory | The total memory usage of the instance. | General-purpose instance, follower instance, and compute group instance | None | |
The memory usage of each Worker node in the instance. | ||||
Memory usage is broken down by system, meta, cache, query, and background categories. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.0 and later. | ||
The amount of memory used by queries that are executed by the QE engine. | Supported only in Hologres V2.0.44 and later, and V2.1.22 and later. | |||
The percentage of memory used by queries that are executed by the QE engine. | ||||
The memory usage of each Cluster in the compute group. | Compute group instance | Supported only in Hologres V4.0 and later. | ||
Query QPS and RPS | The total queries per second (QPS) across the instance. | General-purpose instance, follower instance, compute group instance, and shared cluster instance | Query QPS ≥ QE QPS + FixedQE QPS. Note The total QPS includes all queries, such as UNKNOWN, UTILITY, and Engine Type={PG}. Therefore, the total QPS is greater than or equal to the sum of QE QPS and FixedQE QPS. | |
The QPS of queries that are executed by the QE engine. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.2 and later. | ||
The QPS of queries that are executed by the FixedQE (formerly SDK) engine. | ||||
The total rows per second (RPS) for DML queries in the instance. | General-purpose instance and compute group instance | DML RPS = QE RPS + FixedQE RPS | ||
The RPS of DML operations that are executed by the QE engine. | Supported only in Hologres V2.2 and later. | |||
The RPS of DML operations that are executed by the FixedQE engine. | ||||
Query Latency | The latency of queries in the instance. | General-purpose instance, follower instance, compute group instance, and shared cluster instance | None | |
The latency of queries that are executed by the QE engine. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.2 and later. | ||
The latency of queries that are executed by the FixedQE engine. | ||||
The duration of the Optimization phase for a query. | General-purpose instance, follower instance, compute group instance, and shared cluster instance | Supported only in Hologres V2.0.44 and later, and V2.1.22 and later. | ||
The duration of the Start Query phase for a query. | ||||
The duration of the Get Next phase for a query. | ||||
The P99 latency of queries. | None | |||
Longest Running Query Duration in This Instance (milliseconds) | The duration of the longest-running query among those that are currently executing in the instance. | |||
Failed Query QPS | The total number of failed queries per second in the instance. | General-purpose instance, follower instance, compute group instance, and shared cluster instance | Failed Query QPS ≥ QE Failed Query QPS + FixedQE Failed Query QPS. Note Failed Query QPS counts all failed queries, such as UNKNOWN, UTILITY, and Engine Type={PG}. Therefore, the total failed QPS is greater than or equal to the sum of QE Failed Query QPS and FixedQE Failed Query QPS. | |
The number of failed queries per second that are executed by the QE engine. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.2 and later. | ||
The number of failed queries per second that are executed by the FixedQE engine. | General-purpose instance and compute group instance | |||
Locks | The wait time for DDL locks on FE nodes. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.0.44 and later, and V2.1.22 and later. | |
The lock wait time for FixedQE, which is typically for HQE locks. | ||||
The delay for HQE locks in the instance, which includes FixedQE HQE lock delays. | ||||
Connection | The total number of connections used in the instance. | General-purpose instance, follower instance, compute group instance, and shared cluster instance | None | |
The number of connections used by each database in the instance. | General-purpose instance, follower instance, and compute group instance | |||
The number of connections used by each FE in the instance. | ||||
The connection usage rate in the instance, which defaults to the FE with the highest usage rate. | ||||
Query Queue | The number of query requests that are waiting to be executed but have not yet been processed. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V3.0 and later. | |
The number of query requests submitted to the system queue per second. | ||||
The number of query requests that transition from the waiting state to the running state per second. | ||||
The per-second request count for queries that have started running but have not yet completed, grouped by execution state. | ||||
The average time between entering the queue and starting processing. This does not include the actual query execution time. | ||||
The maximum concurrency for auto-rate-limited query queues. | Compute group instance | Supported only in Hologres V3.1 and later. | ||
I/O | The I/O throughput when reading Standard storage data. | General-purpose instance, follower instance, and compute group instance | None | |
The I/O throughput when writing Standard storage data. | General-purpose instance and compute group instance | |||
The I/O throughput when reading IA storage data. | General-purpose instance, follower instance, and compute group instance | |||
The I/O throughput when writing IA storage data. | General-purpose instance and compute group instance | |||
Storage | The used capacity in Standard storage. | General-purpose instance and compute group instance | None | |
The usage percentage of Standard storage capacity. | ||||
The used capacity in IA storage. | ||||
The usage percentage of IA storage capacity. | ||||
The storage used by the recycle bin. | General-purpose instance and compute group instance | Supported only in Hologres V3.1 and later. | ||
Frameworks | The replay delay for each FE. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.2 and later. | |
The sync delay between Shard replicas after replication is enabled. | None | |||
The delay that occurs when a follower instance reads data from the primary instance. This is visible only for follower instances. | ||||
The file sync delay between disaster recovery instances. | General-purpose instance | |||
Auto Analyze | The number of tables that are missing statistics in each database. | General-purpose instance and compute group instance | Supported only in Hologres V2.2 and later. | |
Serverless Computing | Longest Running Serverless Computing Query Duration (milliseconds) | The duration of the longest-running query in Serverless Computing after it is enabled. | General-purpose instance and compute group instance | Supported only in Hologres V2.1 and later. |
The number of queries that are queued in the Serverless Computing resource pool. | Supported only in Hologres V2.2 and later. | |||
The ratio of the actual Serverless Computing resources used to the maximum allocatable resources. | ||||
Binary Logging | The number of Binlog entries consumed per second. | General-purpose instance, follower instance, and compute group instance | Supported only in Hologres V2.2 and later. | |
The number of bytes consumed from Binlog per second. | ||||
The number of WAL senders used per FE. | ||||
The WAL sender usage rate of the FE with the highest usage rate. | ||||
Computing Resource | The number of cores that are elastically added by time-based scaling in the compute group. | Compute group instance | Supported only in Hologres V2.2.21 and later. | |
The number of cores that are elastically added by auto-scaling in the compute group. | Supported only in Hologres V4.0 and later. | |||
Gateway | The CPU usage of each Gateway in the instance. | Compute group instance | Supported only in Hologres V2.0 and later. | |
The memory usage of each Gateway in the instance. | Supported only in Hologres V2.0 and later. | |||
The maximum number of new connections that the system can accept and successfully establish per second. | Supported only in Hologres V2.1.12 and later. | |||
The volume of data that enters the system through the Gateway per second. | Supported only in Hologres V2.1 and later. | |||
The volume of data sent from the Gateway to external systems per second. | Supported only in Hologres V2.1 and later. | |||
Dynamic Table | The refresh failure QPS across all Dynamic Tables in the instance. You can use this metric to assess the overall health of the refresh process. | Supported only in Hologres V4.0.8 and later. | ||
The latency of each Dynamic Table relative to the latest upstream base table data or expected timestamp, in seconds. You can use this metric to assess data freshness. | ||||
The current duration of the ongoing refresh task for each Dynamic Table, in milliseconds. You can use this metric to detect whether refresh cycles are lengthening. | ||||
The number of refresh failures per minute for each Dynamic Table. You can use this metric to evaluate the refresh stability of each table. |
CPU
The following metrics relate to CPU usage.
Instance CPU Usage (%)
Instance CPU usage reflects the overall CPU load on the instance.
Even without active queries, background processes or asynchronous compaction tasks may consume CPU resources. A small amount of CPU usage during idle periods is normal.
Hologres efficiently leverages multi-core parallel computing. A single query can often push CPU usage to 100%, which indicates full utilization of compute resources.
If CPU usage remains near 100% for extended periods, such as three hours at 100% or twelve hours above 90%, the instance is under a heavy load. The CPU is likely the bottleneck in the system. You should investigate your workload and queries by considering the following questions:
Are large offline data imports (INSERT) occurring with growing data volumes?
Are high-QPS queries or writes consuming all CPU resources?
Are there hybrid workloads in or outside the aforementioned scenarios?
If full CPU usage is required for your business needs, you can scale up the instance to handle more complex queries or larger datasets.
For more information, see FAQ for monitoring metrics.
Worker Node CPU Usage (%)
Worker node CPU usage reflects the CPU load on each Worker node. Hologres provides a varying number of Worker nodes depending on the instance type. For more information, see Instance management.
This metric is supported only in Hologres V1.1 and later.
If all Worker nodes show sustained CPU usage near 100%, the instance is heavily loaded. You can optimize resource usage or scale up the instance based on your workload.
If only some Worker nodes show high CPU usage while others have low usage, a resource skew exists. For common causes and troubleshooting steps, see FAQ for monitoring metrics.
Cluster CPU Usage (%)
The CPU usage of each Cluster in the compute group.
Memory
The following metrics relate to memory usage.
Instance Memory Usage (%)
Instance memory usage reflects the overall memory consumption.
Hologres reserves memory. Even without active queries, metadata, indexes, and data caches are loaded into memory to accelerate retrieval and computation. Therefore, non-zero memory usage during idle periods is normal. Typically, 30% to 40% usage is expected when the instance is idle.
If memory usage steadily climbs toward 80%, memory may become a bottleneck and affect stability or performance.
You can use memory distribution metrics along with QPS and other indicators to identify high-memory consumers and perform optimizations. For more information, see Troubleshooting guide for out-of-memory issues.
Worker Node Memory Usage (%)
Worker node memory usage reflects the memory load on each Worker node. Hologres provides a varying number of Worker nodes depending on the instance type. For more information, see Instance management.
This metric is supported only in Hologres V1.1 and later.
If all Worker nodes show sustained memory usage near 80%, the instance is heavily loaded. You can optimize resource usage or scale up the instance based on your workload.
If only some Worker nodes show high memory usage while others have low usage, a resource skew exists. For common causes and troubleshooting steps, see FAQ for monitoring metrics.
Detailed Compute Group Memory Usage (%)
Hologres divides memory into the following categories: system (System), metadata (Meta), cache (Cache), query (Query), and background process (Background). Starting in V2.0.15, memory distribution metrics can help you analyze usage patterns and optimize effectively. The key categories include the following:
System: The memory used by system components such as Holohub, Gateway, and Frontend (FE). The FE includes the FE Master and FE Query, so System memory fluctuates with query activity.
Cache: memory used for caching. It includes the following:
SQL-related caches, such as the result cache and block cache. These caches change dynamically with query execution. Higher cache hit rates improve query performance. For example, smaller values in the Physical read bytes field of EXPLAIN ANALYZE indicate better cache hit rates. Caches have size limits.
Meta cache: Schema metadata and file metadata. To accelerate query execution, Hologres preloads relevant metadata into the cache, which reduces cold access and improves performance.
The cache size is fixed, typically at around 30% of the total instance memory. Some cache usage persists even when the instance is idle, which is mainly for Meta.
Meta: The memory used for metadata and files. Hologres uses a lazy open mode where frequently accessed metadata stays in memory, but infrequently accessed metadata does not. This mode reduces memory pressure. You should keep Meta usage under 30% of the total memory. High Meta usage suggests many files or partitioned tables. You can use Table statistics overview and analysis to manage tables.
Query: The memory consumed during SQL execution. The usage scales with query complexity and concurrency. This includes the memory used by Fixed Plan, HQE, and SQE.
Query memory uses elastic allocation. The minimum memory per Worker is 20 GB, and the maximum depends on the available free memory. Higher memory usage in other categories reduces the elastic memory available for Query.
High Query memory usage or out-of-memory (OOM) events suggest complex queries or high concurrency. You can optimize queries or scale up the instance. For more information, see Optimize query performance.
Background: The memory used by background tasks such as compaction and flush. Background memory usage is typically low, under 5%. It temporarily increases during index changes, bulk writes, or updates, and then drops as tasks are completed.
Memtable: The memory used for in-memory tables. Memtables store data after real-time writes, updates, or deletes. Memtable usage is typically under 5%.
QE Query Memory Usage (bytes)
The memory used by queries that are executed by HQE, SQE, or other XQE engines.
This metric is supported only in Hologres V2.0.44 and later, and V2.1.22 and later.
In memory breakdowns, Query memory usage exceeds QE Query memory usage.
QE Query memory usage helps you assess query complexity. Higher usage indicates more complex queries that require more memory.
QE Query Memory Usage (%)
QE Query memory usage helps you assess the instance load. High usage may cause OOM errors. You can optimize queries or scale up the instance.
This metric is supported only in Hologres V2.0.44 and later, and V2.1.22 and later.
Cluster Memory Usage (%)
The memory usage of each Cluster in the compute group.
Query QPS and RPS
Query QPS (count/s)
Query QPS is the average number of SQL statements executed per second across the instance. It includes SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements. Query QPS ≥ QE Query QPS + FixedQE Query QPS.
QE Query QPS (count/s)
The number of queries executed per second by the QE engine. This includes SELECT, INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
FixedQE Query QPS (count/s)
The number of queries executed per second by the FixedQE engine (Fixed Plan path, formerly SDK). This includes SELECT, INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
DML RPS (count/s)
DML RPS is the average number of data records imported or updated per second. It includes INSERT, UPDATE, and DELETE statements. Therefore, DML RPS = QE DML RPS + FixedQE DML RPS.
QE DML RPS (count/s)
The number of data records imported or updated per second by the QE engine. This includes INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
Common QE scenarios include the following:
Batch import or update from MaxCompute or OSS external tables.
Batch write or update using COPY.
Batch import between Hologres tables.
FixedQE DML RPS (count/s)
The number of data records imported or updated per second by INSERT, UPDATE, and DELETE SQL statements executed by the FixedQE engine within the instance (formerly named SDK). Specifically:
This metric is supported only in Hologres V2.2 and later.
Common FixedQE scenarios include the following:
Offline writes using Data Integration (DataX).
Writes using SQL or JDBC with
INSERT INTO VALUES().
Query Latency
Query Latency (milliseconds)
The average latency of all queries in the instance. This includes SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements. Query Latency ≥ MAX(QE Query Latency, FixedQE Query Latency).
QE Query Latency (milliseconds)
The average latency of queries that are executed by the QE engine. This includes SELECT, INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
To troubleshoot increased QE Query latency, you can check the Optimization duration, Start Query duration, Get Next duration, and QE QPS.
FixedQE Query Latency (milliseconds)
The average latency of queries that are executed by the FixedQE engine. This includes SELECT, INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
High FixedQE Query latency may result from the following reasons:
Occasional spikes: These may indicate HQE locks. You can check whether the FixedQE backend lock wait time has increased. If it has, you can use Query Insight to identify the locking queries.
Persistent high latency: This may result from a suboptimal table design or interference from complex queries. See Common issues and diagnostics for Blink and Flink.
Optimization Phase Duration (milliseconds)
The time spent in the Optimization phase for a query. During this phase, the optimizer parses the SQL statement and generates a physical plan for the execution engine.
This metric is supported only in Hologres V2.0.44 and later, and V2.1.22 and later.
Long Optimization durations suggest complex queries. If queries differ only in their parameters, you can use Prepared Statements to reduce optimization overhead. For more information, see JDBC.
Start Query Phase Duration (milliseconds)
The time spent in the Start Query phase, which is the initialization before the actual query execution. This includes locking and schema version alignment.
This metric is supported only in Hologres V2.0.44 and later, and V2.1.22 and later.
Long Start Query durations often result from lock waits or high CPU usage. You can use execution plans for deeper analysis.
Get Next Phase Duration (milliseconds)
The time from the end of the Start Query phase until all results are returned. This includes computation and result delivery.
This metric is supported only in Hologres V2.0.44 and later, and V2.1.22 and later.
Long Get Next durations often reflect complex computations. You can correlate this with QE memory usage and QE QPS. If no anomalies exist, the client may simply be waiting to receive the results.
Query P99 Latency (milliseconds)
The P99 latency of all queries in the instance. This includes SELECT, INSERT, UPDATE, UTILITY, and system queries.
Longest Running Query Duration in This Instance (milliseconds)
The duration of the longest-running query in the instance. This metric reports the longest-running query at the current moment. It includes SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN statements.
This metric is supported only in Hologres V1.1 and later.
Hologres is a distributed system. The number of Worker nodes varies by instance type. Queries are randomly distributed across Workers. This metric reports the longest-running query across all Workers. For example, if Workers run queries for 10 minutes, 5 minutes, and 30 seconds, the reported duration is 10 minutes.
You can combine this metric with active queries or slow query logs to assess query duration, diagnose long-running queries, and resolve deadlocks or hangs.
Metrics are reported every minute. Therefore, the "current running duration" starts slightly after the query begins. This metric aids in anomaly detection by helping you quickly locate long-running queries, but it does not provide precise timing.
Failed Query QPS
Failed Query QPS (milliseconds)
The Failed Query Count is the average number of failed SQL statements per second within an instance, such as SELECT, INSERT, UPDATE, DELETE, UTILITY, and UNKNOWN. Failed Query QPS >= QE Failed Query QPS + FixedQE Failed Query RPS.
You can use the failed query type and frequency to find failing queries in the slow query logs. You can then analyze the root causes to improve availability.
QE failed query QPS (milliseconds)
The number of queries that fail per second when using the QE engine. This includes SELECT, INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
FixedQE Failed Query QPS (milliseconds)
The number of queries that fail per second when using the FixedQE engine. This includes SELECT, INSERT, UPDATE, and DELETE statements.
This metric is supported only in Hologres V2.2 and later.
Locks
Maximum FE Lock Wait Time (milliseconds)
Hologres is a distributed system. Multiple FE nodes parse, dispatch, and route SQL statements. When multiple connections are routed to the same FE and perform DDL operations on the same table, such as CREATE or DROP, FE locks occur. This metric shows how long each FE waits for DDL locks.
This metric is supported only in Hologres V2.2 and later.
DDL operations always incur a lock wait time. If the FE lock wait time exceeds five minutes and the FE replay delay also spikes, a DDL operation may be stuck. You can use Manage queries to find and terminate long-running queries.
FixedQE Backend Lock Wait Time (milliseconds)
INSERT, DELETE, or UPDATE queries that use HQE take table locks. Queries that use FixedPlan take row locks. The FixedQE backend lock wait time increases when FixedPlan queries wait for row locks while HQE queries hold table locks on the same table.
This metric is supported only in Hologres V2.2 and later.
If the FixedQE lock wait time is high, you can use slow query logs to find slow FixedQE queries. Then, you can use Query Insight to identify the locking HQE queries.
Instance Total Backend Lock Wait Time (milliseconds)
The total lock wait time for INSERT, DELETE, or UPDATE queries in the instance. This includes FixedQE and HQE lock wait times.
This metric is supported only in Hologres V2.2 and later.
If the lock wait time is high, you can use slow query logs to find slow INSERT, DELETE, or UPDATE queries. Then, you can use Query Insight to identify the locking HQE queries.
Connection
Total Connections (count)
Hologres sets default connection limits based on the instance type. For more information, see Instance management. Total connections represent all active connections, including those in active, idle, and idle-in-transaction states. You can use Manage queries to view the current usage. You should kill idle connections if the number of available connections is low.
Connections by Database (count)
The number of connections aggregated by database. You can use this to assess the connection usage for each database. Note the following:
The default connection limit per database is 128. For more information, see Instance management.
If the number of connections approaches the limit, you should review the idle connections versus business connections. For more information, see Connection management. You can clean up idle connections or scale up to add capacity.
If the connection load skews across Workers, you can use Connection management to clean up idle connections and balance the load.
Connections by FE (count)
The number of connections aggregated by FE. You can use this to assess the connection usage for each FE. Note the following:
The default connection limit per FE node is 128. For more information, see Instance management.
If the number of connections approaches the limit, you should review the idle connections versus business connections. For more information, see Connection management. You can clean up idle connections or scale up to add capacity.
If the connection load skews across Workers, you can use Connection management to clean up idle connections and balance the load.
Connection Usage Rate of FE with Highest Usage (%)
This metric reports the highest connection usage rate among all FE (Frontend) nodes: Max(frontend_connection_used_rate). This helps you spot when connections are approaching the limit on any FE node and prevent connection failures. FE nodes use round-robin load balancing where new connections are distributed evenly across FEs. You can use Manage queries to view the current usage. You should kill idle connections if the number of available connections is low.
Query Queue
Queued Queries Count (count)
The number of query requests that are waiting to be executed but have not yet been processed.
This metric is supported only in Hologres V3.0 and later.
Query Queue Entry QPS (count/s)
The number of query requests submitted to the system queue per second. You can use this to gauge the system load and query frequency.
This metric is supported only in Hologres V3.0 and later.
Queries Transitioned from Queued to Running QPS (count/s)
The number of query requests that transition from the waiting state to the running state per second.
This metric is supported only in Hologres V3.0 and later.
QPS by State for Queries That Started Running (count/s)
The QPS for queries in the query queue, grouped by state. The states include the following:
kReadyToRun (qualified to run)
kQueueTimeout (failed due to queue timeout)
kCanceled (failed due to cancellation)
kExceedConcurrencyLimit (failed due to concurrency limit)
This metric is supported only in Hologres V3.0 and later.
Average Query Queue Wait Time (milliseconds)
The average time between entering the queue and starting processing, not including the actual query execution time, in milliseconds.
This metric is supported only in Hologres V3.0 and later.
Query Queue Auto-Rate-Limit Max Concurrency (count)
The maximum concurrency for auto-rate-limited query queues.
This metric is supported only in Hologres V3.1 and later.
I/O
I/O throughput measures the read and write volume of the instance. It reflects disk I/O activity and helps you assess the system load and diagnose issues. Note: 1 GiB = 1024 MiB = 1024 × 1024 KiB.
Standard storage (hot): The I/O throughput is not fixed. It mainly depends on the CPU load.
For the IA storage class (cold storage), the maximum I/O throughput is
80 MB/s * (number of cores / 16).
Standard I/O Read Throughput (bytes/s)
The I/O throughput when queries read Standard storage data.
Standard I/O Write Throughput (bytes/s)
The I/O throughput when queries write Standard storage data.
Low-frequency I/O Read Throughput (bytes/s)
The I/O throughput when queries read IA storage data.
Low-frequency I/O write throughput (bytes/s)
Represents the I/O throughput when Query writes data to the IA storage class.
Storage
The logical disk space used by instance data, which is the sum of all database storage, including the recycle bin. Note: 1 GiB = 1024 MiB = 1024 × 1024 KiB. Hologres storage usage grows continuously with no hard cap.
For subscription instances, storage that exceeds the purchased amount is automatically billed on a pay-as-you-go basis. This does not impact system stability or usability.
After you exceed the storage capacity, you should promptly upgrade the storage or delete unused data to avoid unnecessary storage costs. The savings can be used to fund additional compute resources.
You can use the pg_relation_size function to view table and database storage sizes and details. You can also use Table Info for fine-grained table management.
Standard Storage Used Capacity (bytes)
The capacity used in Standard storage. You should scale up the storage if the usage exceeds the purchased capacity.
Standard Storage Usage (%)
The usage percentage of Standard storage capacity. You should scale up the storage if the usage exceeds the purchased capacity.
IA Storage Used Capacity (bytes)
The capacity used in IA storage. You should scale up the storage if the usage exceeds the purchased capacity.
IA Storage Usage (%)
The usage percentage of IA storage capacity. You should scale up the storage if the usage exceeds the purchased capacity.
Recycle Bin Storage Usage (bytes)
Hologres supports a table recycle bin starting in V3.1. Tables that are dropped using the DROP command remain in the recycle bin for a retention period. This lets you recover accidentally dropped tables. Tables in the recycle bin still consume instance storage. You should monitor the recycle bin usage for each database. If frequent table drops cause high recycle bin usage, you can configure tables to skip the recycle bin upon deletion.
Framework
FE Replay Delay (milliseconds)
Hologres is a distributed system. Multiple Frontend (FE) nodes handle SQL parsing, dispatch, and routing. For DDL operations, Hologres first executes the operation on one FE and then replays it on the others. Note the following:
FE replay takes time. Delays at the millisecond or second level are normal.
If an FE's replay delay exceeds several minutes, too many DDL operations may overwhelm the replay process. If the delay continues to increase, a query may be stuck. You can use hg_stat_activity to find and kill long-running queries.
This metric is supported only in Hologres V2.2 and later.
Shard Multi-Replica Sync Delay (milliseconds)
The sync delay between Shard replicas after Replication is enabled.
The typical Shard replica delay is in milliseconds.
Heavy data writes, updates, or frequent DDL operations may increase the sync delay.
Primary-Follower Sync Delay (milliseconds)
The delay that occurs when a follower instance reads data from the primary instance, in milliseconds. Note the following:
This metric appears only for follower instances, not primary instances.
Data appears only after a follower instance is bound to a primary instance (0 ms initially). The sync delay fluctuates when the primary instance receives writes.
The normal sync delay is in milliseconds. Occasional jitter, for example, from primary DDL operations, is safe to ignore. A persistent high delay of more than a few seconds may indicate a high instance load or a resource shortage. You should check the CPU and memory usage and scale up the instance if needed.
The sync delay may spike to several minutes during restarts or upgrades and then recovers automatically.
Cross-Instance File Sync Delay (milliseconds)
The file sync delay between disaster recovery instances. This metric appears only on follower instances (read-only followers).
Auto Analyze
Tables Missing Statistics per Database (count)
The number of tables that are missing statistics in each database.
This metric is supported only in Hologres V2.2 and later.
For Hologres V2.0 and later, Auto Analyze runs by default. After a table is created or after bulk writes or updates, the statistics may lag. You should first observe the statistics for a short period.
If a database consistently lacks statistics for hours or days, Auto Analyze may not have been triggered. You can use the HG_STATS_MISSING view to list the affected tables and then manually run the ANALYZE command to update the statistics.
If a database consistently lacks statistics for hours or days, Auto Analyze may not have been triggered. You can review the table statistics and manually run the ANALYZE command. For more information, see ANALYZE and AUTO ANALYZE.
Serverless Computing
Longest Running Serverless Computing Query Duration (milliseconds)
Hologres supports Serverless Computing. You can run specific queries in a dedicated Serverless Computing resource pool to isolate them from the main instance and ensure fast execution.
This metric is supported only in Hologres V2.1 and later.
This metric shows the longest-running query in Serverless Computing. You can use hg_stat_activity to inspect the status of Serverless queries.
Serverless Computing Query Queue Count (count)
The number of queries that are queued in the Serverless Computing resource pool.
This metric is supported only in Hologres V2.2 and later.
Serverless Computing Resource Quota Usage (%)
The ratio of the actual Serverless Computing resources used to the maximum allocatable resources over a given time.
This metric is supported only in Hologres V2.2 and later.
Binary Logging
Binlog Consumption Rate (count/s or bytes/s)
Hologres supports subscribing to Hologres Binlog. Binlog enables real-time data tiering and accelerates data forwarding.
Binlog Consumption Rate (count/s)
The number of Binlog entries consumed per second. This metric is supported only in Hologres V2.2 and later.
Binlog Consumption Rate (bytes/s)
The number of bytes consumed from Binlog per second. Larger fields or higher data volumes increase the byte count. This metric is supported only in Hologres V2.2 and later.
WAL Sender Count and Usage Rate
Similar to regular connections, each shard of each table consumes one WAL sender connection when consuming Binlog using JDBC. WAL sender connections are independent of regular connections. The number of WAL senders has a default limit.
WAL Sender Count per FE (count)
The number of WAL senders used per FE node.
WAL Sender Usage Rate of FE with Highest Usage (%)
The utilization rate of the frontend (FE) that uses the most WAL senders.
You can use both metrics to assess WAL sender usage. If the usage reaches the limit, see Consume Hologres Binlog via JDBC for troubleshooting.
Computing Resource
Elastic Core Count (Count) for Compute Groups
Hologres compute group instances support time-based elasticity. For more information, see Time-based elasticity (Beta). This metric shows the number of cores that are added using time-based elasticity.
Compute Group Auto-Elastic Core Count (count)
Hologres compute group instances support auto-elasticity. For more information, see Multi-cluster and auto-elasticity (Beta). This metric shows the number of cores that are added using auto-elasticity.
Gateway
Gateway CPU Usage (%)
The CPU usage of each Gateway in the instance.
This metric is supported only in Hologres V2.0 and later.
Gateways use round-robin traffic forwarding. CPU usage occurs even without new connections.
Starting in Hologres V2.2.22, Gateways launch more worker threads by default to improve the handling of new connections, which increases CPU usage.
Gateway Memory Usage (%)
The memory usage of each Gateway in the instance.
This metric is supported only in Hologres V2.0 and later.
Gateway New Connection Requests per Second (count/s)
The maximum number of new connections that the system can accept and successfully establish per second.
This metric is supported only in Hologres V2.1.12 and later.
A single Gateway handles up to approximately 100 new connections per second.
If the number of new connection requests approaches
100 × Gateway count, the Gateways become the bottleneck for handling new connections. You can configure a connection pool or scale up the number of Gateways.
Gateway Inbound Traffic Rate (B/s)
The volume of data that enters the system through the Gateway per second.
This metric is supported only in Hologres V2.1 and later.
If the inbound traffic approaches
200 MiB/s × Gateway count, the Gateway network capacity becomes the bottleneck. You can scale up the number of Gateways.
Gateway Outbound Traffic Rate (B/s)
The volume of data sent from the Gateway to external systems per second.
This metric is supported only in Hologres V2.1 and later.
If the outbound traffic approaches
200 MiB/s × Gateway count, the Gateway network capacity becomes the bottleneck. You can scale up the number of Gateways.
Dynamic Table Monitoring and Alerting
Starting in Hologres V4.0.8, Dynamic Tables offer monitoring metrics to help you better manage refresh tasks. For more information, see Monitoring and alerting.
Common Monitoring Metric Issues
The FAQ for monitoring metrics topic lists common issues. It helps you diagnose problems faster, identify root causes, and apply fixes, which boosts your self-service capabilities.
Monitoring Metric Alerting
You can set alerts for monitoring metrics in Cloud Monitor to detect anomalies early and minimize the impact on your business. For more information, see Cloud Monitor.