This topic provides answers to frequently asked questions about the metrics of Hologres.

How do I view and close connections if the value of the Connections metric is large?

The Connections metric indicates the total number of SQL connections to a Hologres instance, including Java Database Connectivity (JDBC) connections to PostgreSQL databases in the active and idle states. The number of connections to an instance depends on the instance type. If the number of connections to an instance is large or exceeds the maximum number of connections allowed for the instance, or if one of the following errors is reported, you must check whether a connection leak occurs:
  • FATAL: sorry, too many clients already connection limit exceeded for superusers
  • FATAL: remaining connection slots are reserved for non-replication superuser connections
These errors indicate that the number of connections to an instance reaches the upper limit. You can obtain information about the current connections in the HoloWeb console or by executing SQL statements. For more information, see Manage connections. You can close unexpected or idle connections with a superuser account.

The latency of queries is high. How do I troubleshoot the issue?

High latency of queries may occur due to the following possible causes. You can find the corresponding slow SQL queries by querying slow query logs. For more information, see Query and analyze slow query logs.
  • Cause 1: The queries per second (QPS) is low, but the SQL statements are complex.

    Solution: Optimize the SQL statements and set proper indexes to improve query performance. For more information, see Optimize the performance of internal table queries and Optimize the performance of querying MaxCompute tables in Hologres.

  • Cause 2: The QPS is high.

    Solution: If you require a higher QPS and lower latency after the SQL statements are optimized, scale out the instance to obtain better performance. For more information, see Upgrade an instance.

  • Cause 3: A large amount of data is written during queries.
    Solutions:
    • Write data during off-peak hours to reduce the impact on queries.
    • Reduce the number of concurrent write operations to improve query efficiency. If you write data by using a foreign table, you can execute the following statements to decrease the concurrency:
      -- Set the maximum number of queries that you want to perform at a time in MaxCompute. Default value: 128. To prevent one query from affecting other queries or prevent system unavailability, we recommend that you set this parameter to a small value. 
      set hg_experimental_foreign_table_executor_max_dop = 32; -- We recommend that you set this parameter to a small value to decrease the concurrency.
      
      -- Set the number of data entries that you want to read at a time from a MaxCompute table. Default value: 8192. 
      set hg_experimental_query_batch_size = 1024;
      
      -- Read data from the ORC table.
      set hg_experimental_enable_access_odps_orc_via_holo = on;
      
      -- Set the size of each shard of a MaxCompute table. Default value: 64. Unit: MB. The shard size affects the concurrency. If a table is large in size, you can increase the value of this parameter to prevent excessive shards from deteriorating query performance. 
      set hg_experimental_foreign_table_split_size = 512MB;

What are the causes of high memory usage? How do I troubleshoot the issue?

The Memory Usage metric indicates the memory usage of an instance. In Hologres, memory resources are reserved by backend processes. Even if no queries are in progress, metadata, indexes, and data cache of data tables are loaded to the memory to accelerate data searching and computing. Therefore, the memory usage is not zero. If no queries are in progress, a memory usage of 30% to 40% is normal.

In some cases, memory usage may continue to rise or even approach 80%. The issue of high memory usage may occur due to the following possible causes:
  • The computing capacity of the instance becomes overwhelmed as the number of tables and the total size of data continue to grow. The memory usage proportionally increases with the size of metadata and the number of indexes. As the number of tables and indexes and the data size increase, the memory usage grows.
  • The index settings are not suitable. For example, a large number of bitmap indexes are created, or dictionary encoding is enabled for a table that contains a large number of columns of the TEXT data type. In this case, you can modify the bitmap index or the dictionary encoding property of the table. For more information, see ALTER TABLE.

If the memory usage continues to rise and remains close to 80%, memory resources may become a bottleneck for the instance, and the stability or the performance of the instance may be affected. For example, if a large amount of metadata occupies the memory space available for queries, errors such as SERVER_INTERNAL_ERROR, ERPC_ERROR_CONNECTION_CLOSED, and Total memory used by all existing queries exceeded memory limitation may occur during queries. If a large amount of metadata occupies the cache space available for queries, cache hits decrease, and the query latency increases.

If the memory usage remains close to 80%, we recommend that you perform the following operations:
  • Delete the data that is no longer queried to release the memory space that is occupied by data such as metadata.
  • Set proper indexes. You can delete unnecessary bitmap indexes or disable dictionary encoding in specific business scenarios.
  • Upgrade the specifications of the instance to increase computing and storage resources of the instance. We recommend that you upgrade an instance based on specific scenarios.
    • In scenarios where disk data can be read at a specific latency and the response time (RT) is not strict, we recommend that you select a proper instance type based on the size of your data. One compute unit (CU) that includes 1 CPU core and 4 GB of memory can support data storage of 50 GB to 100 GB.
    • In serving scenarios that require a short RT, we recommend that you cache all hot data in the memory. By default, the cache accounts for 30% of the total memory. In such scenarios, 1.3 GB of memory out of 1 CU is used to cache data, and metadata of tables is stored in the cache. For example, in a scenario that requires a short RT, 100 GB of hot data needs to be cached in the memory. After the data is read and decompressed, it takes up more than 100 GB of memory. In this case, at least 320 GB of memory is required, which corresponds to at least 96 CUs.

Why is the CPU utilization as high as 100% even if only one query is in progress?

The CPU Utilization metric indicates the CPU utilization of an instance. Hologres supports multi-core parallel computing. In most cases, the CPU utilization during a single query can increase to 100%. It indicates that computing resources are being fully utilized. High CPU utilization is not an issue. If the CPU utilization is high and data queries and writes are slow, this is an issue. You must comprehensively analyze the issue.

How do I troubleshoot the issue of slow writes?

If the execution of INSERT, INSERT ON CONFLICT, or UPDATE statements is time-consuming, the write performance is poor. Generally, this issue occurs because fixed plans are not used to execute the SQL statements and tables are locked. If queries are concurrently executed, lock waits occur. As a result, the execution consumes more time. The Real-time Import (RPS) metric shows the records per second (RPS) for the records that are imported or updated by using INSERT statements. You can check the features of the queries and optimize the SQL statements executed in the queries by using fixed plans. In this case, the Real-time Import (RPS) metric shows the RPS for the records that are imported or updated by using SDKs. This helps improve write performance. For more information, see Accelerate the execution of SQL statements by using fixed plans.

What do I do if the CPU utilization remains close to 100%?

If the CPU utilization of a Hologres instance remains close to 100%, the instance is under heavy load. For example, the CPU utilization remains 100% for 3 consecutive hours or remains more than 90% for 12 consecutive hours. In this case, CPU resources become a bottleneck for the instance. You must analyze specific business scenarios and data query statements to identify the cause of the high CPU utilization. You can troubleshoot this issue based on the following possible causes:
  • Cause 1: The QPS or the RPS significantly increases.

    Compare the QPS and RPS metrics before and after the issue of the high CPU utilization occurs. If the QPS or the RPS significantly increases, this is the cause of the high CPU utilization.

    Solutions:
    • If the execution of SELECT statements for querying data causes the issue of the high CPU utilization, you can view the corresponding slow queries by querying slow query logs and accordingly optimize the queries.
    • If the execution of INSERT, UPDATE, or DELETE statements causes the issue of the high CPU utilization, we recommend that you query slow query logs by executing the following statement to check whether the INSERT, UPDATE, or Delete statements use fixed plans. If fixed plans are not used to execute the INSERT, UPDATE, or DELETE statements, tables are locked. If queries are concurrently executed, lock waits occur. To prevent table locks and decrease the CPU utilization, you can check whether the SQL statements can be optimized by using fixed plans based on your business requirements.
      -- Example: Query the INSERT, UPDATE, or DELETE statements that are not executed by using fixed plans in the past hour.
      select *
      from hologres.hg_query_log
      where query_start >= now() - interval '3 h'
          and command_tag in ('INSERT','UPDATE','DELETE')
          and 'HQE'=ANY(engine_type)
      order by query_start desc limit 500;
    • If SQL statements are optimized, but the CPU utilization is still high, the instance resources reach a bottleneck. You can scale out the instance or deploy shared storage on multiple instances to enable read/write splitting. For more information, see Upgrade an instance or Configure multi-instance high-availability deployment.
  • Cause 2: The QPS or the RPS does not significantly increase, but long-running queries exist.
    The metric data shows that the QPS or the RPS does not significantly increase. However, the CPU utilization suddenly increases, and the issue of the high CPU utilization persists for a period of time. You can view the Ongoing Query Duration metric to check long-running queries. If the metric data shows that queries that last for more than half an hour or 1 hour exist, these queries cause the issue of the high CPU utilization. You can execute the following statements to view the long-running queries by querying the active queries and terminate the queries. This decreases the CPU utilization.
    -- Query the long-running queries.
    SELECT current_timestamp - query_start as runtime, datname::text, usename, query, pid::text
        FROM pg_stat_activity
        WHERE state != 'idle'
        order by 1 desc;
    
    -- Terminate the long-running queries.
    select pg_cancel_backend(<pid>);
  • Cause 3: The QPS or the RPS does not significantly increase, but queries that consume a large number of CPU resources exist.
    The metric data shows that the QPS or the RPS does not significantly increase. However, the CPU utilization suddenly increases, and the issue of the high CPU utilization persists for a period of time. You can execute the following statement to view the queries with the high CPU utilization by querying slow query logs, and then optimize the SQL statements used for data queries.
    -- Query the slow queries that consume a large number of CPU resources in the past 3 hours.
    select status as "Status",
    duration as "Time consumed (ms)",
    query_start as "Started at",
    (read_bytes/1048576)::text || ' MB' as "Bytes",
    (memory_bytes/1048576)::text || ' MB' as "Memory",
    (shuffle_bytes/1048576)::text || ' MB' as "Shuffle",
    (cpu_time_ms/1000)::text || ' s' as "CPU time",
    physical_reads as "Physical reads",
    query_id as "QueryID",
    query
    from hologres.hg_query_log
    where query_start > current_timestamp - interval'3 h'
    and command_tag in ('SELECT','INSERT','UPDATE','DELETE')
    and duration > 1000
    order by duration desc,
    read_bytes desc,
    shuffle_bytes desc,
    memory_bytes desc,
    cpu_time_ms desc,
    physical_reads desc
    limit 500;
  • Cause 4: The CPU utilization increases to 100% due to the SQL statements executed in PostgreSQL Query Engine (PQE).
    If the metric data shows that the QPS or the RPS does not significantly increase, you can execute the following SQL statement to check whether new SQL statements are executed in PQE by querying slow query logs. If SQL statements are executed in PQE, these statements cause the issue of the high CPU utilization. If SQL statements are executed in PQE, you must optimize the SQL operators that are executed in PQE. For more information, see Optimize the performance of internal table queries.
    -- Query the queries that are executed in PQE in the past 3 hours.
    select *
    from hologres.hg_query_log
    where query_start > current_timestamp - interval'3 h'
        and 'PQE'=ANY(engine_type)
    order by query_start desc limit 500;
  • Cause 5: The bitmap index or the dictionary encoding property of tables is modified.
    If the bitmap index or the dictionary encoding property of tables is modified, a compaction operation is asynchronously performed in the backend. In this case, CPU resources are consumed, and the storage capacity of the instance may increase and then decrease. You can execute the following SQL statement to check whether the bitmap index or the dictionary encoding property of tables is modified by querying slow query logs.
    -- Query the records of which the bitmap index or the dictionary encoding property is modified in the past 3 hours.
    select *
    from hologres.hg_query_log
    where query_start >= now() - interval '3 h'
    and command_tag in ('CALL')
    order by query_start desc limit 500;

How do I troubleshoot the issue of long-running queries?

The Ongoing Query Duration metric indicates the duration of the ongoing query that lasts for a long period of time. For example, if an ongoing query lasts more than 1 hour, the query is considered as a long-running query. If a long-running query exists, you can view the query on the Active Query page. For more information, see Use HoloWeb to view and manage active queries in a visualized way. Long-running queries may occur due to the following causes:
  • Cause 1: Long-running write operations exist.

    Solution: Monitor the Real-time Import (RPS) metric to check whether long-running write operations exist.

  • Cause 2: Some queries are in the idle in transaction state.
    • If the client starts a transaction but does not commit the transaction after data definition language (DDL) statements are executed, the corresponding queries enter the idle in transaction state. You can execute the following SQL statement to query active queries that are in the idle in transaction state and have been running for a long period of time:
    • Queries are running for a long period of time due to lock waits.

    Solution: Execute the following sample SQL statement to view the long-running queries. If the long-running queries are in the idle in transaction state, you can terminate the transactions on the client or set a proper timeout period for idle transactions. For more information, see Change the timeout period of idle queries.

  • Cause 3: Some queries include complex SQL statements that are executed in PQE.
    Solutions: Execute the following SQL statement to query active queries that have been running for a long period of time. Then, execute the EXPLAIN statement to check the execution plans of the queries. If the execution plans contain External SQL (Postgres), the queries include SQL statements that are executed in PQE.
    -- Query the long-running queries.
    SELECT current_timestamp - query_start as runtime, datname::text, usename, query, pid::text
        FROM pg_stat_activity
        WHERE state != 'idle'
        order by 1 desc;
    
    -- Query the execution plan of the queries.
    explain sql
  • Cause 4: Lock conflicts occur due to concurrent DDL operations.

    If multiple DDL statements are executed at the same time, tables are locked. As a result, lock conflicts occur, and the DDL statements must wait in a queue for the lock to become available.

    Solutions:
    • Execute the following SQL statement to check whether DDL statements are being executed. Terminate the DDL statements that are being executed to release the lock.
      SELECT datname::text,usename,query,pid::text,state
         FROM pg_stat_activity
         WHERE state != 'idle' ;
    • Execute the DDL statements one by one.

How do I troubleshoot the issue of failed queries?

The Failed Queries per Second metric indicates the average number of failed queries per second. The total number of failed queries in a duration measured in seconds is calculated by multiplying the value of the Failed Queries per Second metric by the duration. We recommend that you do not determine the total number of failed queries only based on the Failed Queries per Second metric. You can check the total number of failed queries and failure causes by querying slow query logs. Then, you can resolve the issue based on the error messages. For more information, see Query and analyze slow query logs.

How do I troubleshoot the issue of unbalanced CPU load among workers?

In Hologres, data is distributed among shards. A worker may access the data of one or more shards during computing. In each instance, a shard can be accessed only by one worker at a time. If the total number of shards accessed by each worker varies, loads on workers may be unbalanced. This issue may occur due to the following possible causes:
  • Cause 1: Data skew exists.

    If severe data skew occurs, a worker accesses a fixed shard. This causes unbalanced CPU load among workers.

    Solution: Execute the following SQL statement to check whether data skew exists. In the following sample result, the count value of a shard is much larger than that of other shards. It indicates that data skew exists. If data skew exists, you can process the unevenly distributed data or set a proper distribution key based on your business requirements. For more information, see Optimize the performance of internal table queries.
    select hg_shard_id,count(1) from <table_name> group by hg_shard_id;
    
    -- Sample result: The count value of shard 39 is larger than that of other shards. It indicates that data skew exists.
    hg_shard_id | count
    -------------+--------
              53 |  29130
              65 |  28628
              66 |  26970
              70 |  28767
              77 |  28753
              24 |  30310
              15 |  29550
              39 | 164983
  • Cause 2: The shard count is not a multiple of the number of workers in the instance.

    If the shard count in a table group is not a multiple of the number of workers, the shard count allocated to each worker varies. This causes the load imbalance among workers.

    Solution: Set the shard count based on the instance type. For more information, see User guide of table groups and shard counts. In most cases, this error occurs in an instance that has more than 256 CPU cores. For instances with smaller specifications, you can use the default shard count.

  • Cause 3: Shards are unevenly allocated to workers after a worker fails over.

    If a worker is terminated due to reasons such as out of memory (OOM), the system allocates the corresponding shards to other workers to recover queries. After the terminated worker is recovered, the system allocates some shards to this worker. This causes uneven allocation of shards among workers.

    Solution: If the instance load is low, ignore the issue of the load imbalance. If the instance load is high, restart the instance to evenly reallocate shards.