FAQ and troubleshooting for Hologres monitoring metrics - Hologres

This page covers common questions about Hologres monitoring metrics, including high CPU usage, memory pressure, slow queries, and write performance.

How do I view and kill connections when the number of connections is too high?

Connections include all SQL connections in an instance — active and idle — from Java Database Connectivity (JDBC) or PSQL. If you see either of the following errors, the instance has reached its connection limit:

FATAL: sorry, too many clients already connection limit exceeded for superusers
FATAL: remaining connection slots are reserved for non-replication superuser connections

View current connections in HoloWeb or using SQL. For details, see Manage connections. Then use a superuser account to kill unexpected or idle connections.

What do I do if query latency is too high?

Start by identifying the slow SQL statements in the slow query log. For details, see Get and analyze slow query logs. Then address the issue based on your situation:

Low queries per second (QPS) but complex SQL: Optimize the SQL statements and set appropriate indexes. See Best practices: internal table performance tuning and Accelerate MaxCompute queries.
High QPS: If the SQL is already optimized but you need higher QPS and lower latency, scale up the instance. See Instances page overview.

Write operations affecting query performance: Perform write operations during off-peak query hours, or reduce write concurrency. For MaxCompute foreign tables, use the following parameters to reduce concurrency:

-- Limit MaxCompute execution concurrency. Default: 128. A lower value prevents one query from affecting others.
SET hg_experimental_foreign_table_executor_max_dop = 32;

-- Reduce the batch size for each read from a MaxCompute table. Default: 8192.
SET hg_experimental_query_batch_size = 1024;

-- Read ORC files directly.
SET hg_experimental_enable_access_odps_orc_via_holo = on;

-- Increase the split size for large MaxCompute tables to reduce the number of concurrent tasks. Default: 64 MB.
SET hg_experimental_foreign_table_split_size = 512MB;

What causes high memory usage and how do I resolve it?

Hologres uses a memory reservation model: even with no active queries, metadata, indexes, and data caches are loaded into memory to speed up retrieval. Memory usage of around 30%–40% is normal when no queries are running.

When memory usage steadily climbs toward 80%, it usually means one of two things:

Growing data volume: The number of tables and total data has far exceeded the current compute resources. Memory usage scales with the amount of metadata and indexes.
Over-indexed tables: Tables with many columns, most of which are TEXT columns, and excessive bitmap or dictionary indexes consume disproportionate memory.

Sustained memory usage near 80% can cause stability issues — errors like SERVER_INTERNAL_ERROR, ERPC_ERROR_CONNECTION_CLOSED, or Total memory used by all existing queries exceeded memory limitation — and performance degradation, such as a reduced cache hit rate and higher query latency.

To resolve this:

Delete unused data to release memory occupied by metadata.
Remove unnecessary indexes: If bitmap or dictionary indexes are not used in your workload, remove them using ALTER TABLE. Analyze your business requirements before removing any indexes.
Upgrade compute and storage resources:
- General scenarios: 1 compute unit (CU) — 1 vCPU + 4 GB memory — supports 50–100 GB of data storage.
- Low-latency serving: For best performance, hot data should fit in the memory cache. The cache occupies 30% of total memory, so a 1 CU instance provides 1.3 GB for the data cache. For 100 GB of hot data, you need at least 320 GB of memory (96 CUs or more).

Why does the CPU usage of a Hologres instance reach 100% with only one task?

Hologres is designed to fully use multi-core parallel computing. A single query can drive CPU usage to 100%, which simply means compute resources are fully utilized — this is expected behavior, not a problem. The issue arises only when high CPU usage causes slow queries or writes. See What do I do if CPU usage remains at 100% for a long time? for how to diagnose and resolve that.

How do I resolve slow writes?

If INSERT, INSERT ON CONFLICT, or UPDATE commands are taking too long, the most likely cause is that the SQL is not using a Fixed Plan. SQL without a Fixed Plan is subject to table locks: under concurrent execution, each statement waits for the lock, causing long runtimes. In the Real-time Write RPS monitoring metric, these appear as write type insert.

To resolve this, rewrite the SQL to use a Fixed Plan. The write type in the Real-time Write RPS metric changes to SDK, and performance improves. For details, see Accelerate SQL execution with fixed plans.

What do I do if CPU usage remains at 100% for a long time?

Sustained high CPU usage — 100% for 3 or more consecutive hours, or above 90% for 12 or more consecutive hours — means CPU has become a system bottleneck. Work through the following checks in order.

Check 1: Has QPS or RPS increased significantly?

Compare QPS and RPS metrics before and after the CPU spike. If there's a clear upward trend:

For SELECT-driven spikes: use the slow query log to find long-running queries and optimize them.
For INSERT/UPDATE/DELETE-driven spikes: check whether these operations are bypassing Fixed Plan. SQL without a Fixed Plan causes table locks and lock waits under concurrent execution.
```
-- Find INSERT/UPDATE/DELETE operations that did not use a Fixed Plan in the last 3 hours.
SELECT *
FROM hologres.hg_query_log
WHERE query_start >= now() - INTERVAL '3 h'
  AND command_tag IN ('INSERT', 'UPDATE', 'DELETE')
  AND 'HQE' = ANY(engine_type)
ORDER BY query_start DESC
LIMIT 500;
```
If applicable, rewrite these statements to use a Fixed Plan. If all SQL is already optimized, scale out the instance or deploy read/write splitting. See Upgrade an instance or Deploy primary and secondary instances for read/write splitting (shared storage).

Check 2: Are there long-running queries?

If QPS and RPS are stable but CPU suddenly spikes, check the Running Query Duration metric. Queries running for more than 30 minutes or an hour are likely consuming CPU continuously. Run the following to find and cancel them:

-- Find long-running queries.
SELECT current_timestamp - query_start AS runtime, datname::text, usename, query, pid::text
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY 1 DESC;

-- Cancel a query.
SELECT pg_cancel_backend(<pid>);

Check 3: Are there high-CPU-consuming queries?

Use the slow query log to find queries that consume the most CPU, memory, and I/O:

-- Find high-resource queries in the last 3 hours.
SELECT
  status                                    AS "Status",
  duration                                  AS "Duration (ms)",
  query_start                               AS "Start time",
  (read_bytes / 1048576)::text || ' MB'     AS "Read volume",
  (memory_bytes / 1048576)::text || ' MB'   AS "Memory",
  (shuffle_bytes / 1048576)::text || ' MB'  AS "Shuffle",
  (cpu_time_ms / 1000)::text || ' s'        AS "CPU time",
  physical_reads                            AS "Disk reads",
  query_id                                  AS "Query ID",
  query
FROM hologres.hg_query_log
WHERE query_start > current_timestamp - INTERVAL '3 h'
  AND command_tag IN ('SELECT', 'INSERT', 'UPDATE', 'DELETE')
  AND duration > 1000
ORDER BY duration DESC,
         read_bytes DESC,
         shuffle_bytes DESC,
         memory_bytes DESC,
         cpu_time_ms DESC,
         physical_reads DESC
LIMIT 500;

Optimize the identified queries.

Check 4: Are new PQE queries appearing?

Check whether new PQE queries appeared around the time CPU usage increased:

-- Find queries that used PQE in the last 3 hours.
SELECT *
FROM hologres.hg_query_log
WHERE query_start > current_timestamp - INTERVAL '3 h'
  AND 'PQE' = ANY(engine_type)
ORDER BY query_start DESC
LIMIT 500;

If PQE queries exist, optimize the SQL operators to reduce or eliminate PQE usage. See Optimize query performance.

Check 5: Were bitmap or dictionary indexes recently modified?

Modifying bitmap or dictionary indexes on a table triggers an asynchronous background compaction process that consumes CPU. Storage usage typically increases first, then decreases as compaction finishes. Check for recent index changes:

-- Find index modification records in the last 3 hours.
SELECT *
FROM hologres.hg_query_log
WHERE query_start >= now() - INTERVAL '3 h'
  AND command_tag IN ('CALL')
ORDER BY query_start DESC
LIMIT 500;

How do I handle long-running queries?

The Running Query Duration metric shows queries that have been running longer than expected — for example, over 1 hour. Check the Active Query Tasks page to see what's currently running. For details, see Manage queries.

Long-running queries usually fall into one of four patterns:

Continuous write operations: Check the Real-time Write RPS metric to see whether ongoing bulk writes are holding resources.

Idle in transaction: A client has opened a transaction and executed a DDL operation but hasn't committed it. The query status appears as idle in transaction in pg_stat_activity. To find these:

-- Find long-running queries, including idle-in-transaction sessions.
SELECT current_timestamp - query_start AS runtime, datname::text, usename, query, pid::text
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY 1 DESC;

Close the transaction on the client, or set an idle transaction timeout. See Modify idle query timeout. To cancel a stuck query:

SELECT pg_cancel_backend(<pid>);

Complex SQL using PQE: Run EXPLAIN <sql> to check the execution plan. If the output contains External SQL (Postgres), the query is using PQE and may run slowly. Kill the query, then optimize the SQL to reduce PQE usage. See Best practices: internal table performance tuning.

Concurrent DDL causing lock contention: DDL operations lock tables. When multiple DDL statements run concurrently, they create lock contention and long waits. Check for in-progress DDL:

SELECT datname::text, usename, query, pid::text, state
FROM pg_stat_activity
WHERE state != 'idle';

Kill any blocking DDL operations to release locks, then execute DDL statements sequentially.

How do I troubleshoot failed queries?

The Failed Queries metric shows the number of failed queries per second. Don't use QPS alone to estimate the total number of failures — the total is the area under the curve (time range x QPS). Use the slow query log to get the exact count and failure reasons, then resolve based on the specific errors. See View and analyze slow query logs.

How do I resolve uneven CPU load among workers?

In Hologres, data is partitioned into shards. A worker processes one or more shards at a time, and each shard can be accessed by only one worker at a time. Uneven load typically comes from three causes:

Data skew: If one shard holds far more data than others, the worker assigned to it carries a disproportionate load. Check for skew:

SELECT hg_shard_id, count(1) FROM <table_name> GROUP BY hg_shard_id;

-- Example output showing skew: shard 39 holds significantly more rows than others.
-- hg_shard_id | count
-- -------------+--------
--           53 |  29130
--           65 |  28628
--           66 |  26970
--           70 |  28767
--           77 |  28753
--           24 |  30310
--           15 |  29550
--           39 | 164983

Handle the skewed data or set an appropriate distribution key. See Optimize query performance.

Shard count not a multiple of worker count: When the shard count in a table group is not an integer multiple of the total worker count, some workers get more shards than others. Set an appropriate shard count based on instance specifications. See Manage table groups and shard counts. This mainly affects large instances (more than 256 cores). For smaller instances, the default shard count is sufficient.

Uneven redistribution after a worker restart: When a worker is terminated due to an out-of-memory (OOM) error or other reasons, the system quickly migrates the shards of that worker to other workers to promptly recover queries. When the worker is restarted, the system reallocates some shards to it. If the instance load is low, this imbalance is harmless. If the load is high, restart the instance to redistribute shards evenly.