All Products
Search
Document Center

Hologres:Hologres metrics

Last Updated:Jan 30, 2024

This topic describes the metrics of Hologres. Metrics reflect resource usages of instances and the execution status of SQL statements. You can identify system exceptions and perform troubleshooting at the earliest opportunities based on metrics.

Hologres provides the following metrics:

Take note of the following items:

  • The Connections metric is not supported in Hologres V0.8.

  • If an instance of a version earlier than V0.9.27 contains more than 1,000 tables, the Real-time Import (RPS) metric may be lost or the metric value may be excessively low due to the large amount of data. In this case, data accuracy is affected.

Note

You can call API operations provided by CloudMonitor to query the metrics of Hologres instances. For more information, see Request methods.

Metric collection and reporting principle

On the Monitoring Information page for a Hologres instance in the Hologres console, you can view metrics about instance resources and query execution. You can view the resource usage of the instance and the execution status of SQL statements. This allows you to manage and monitor the instance at any time, identify system bottlenecks, and handle instance exceptions at the earliest opportunity. In September 2023, the underlying technology of the monitoring service in the Hologres console was upgraded. After the upgrade, metrics are reported based on the following rules:

  • Metrics of the current day are aggregated and reported at an interval of 1 minute.

  • Metrics of the previous 5 days are aggregated and reported at an interval of 5 minutes.

  • Metrics of the previous 10 days are aggregated and reported at an interval of 10 minutes.

  • Metrics of the previous 15 days are aggregated and reported at an interval of 15 minutes.

  • Metrics of the previous 30 days are aggregated and reported at an interval of 30 minutes.

CPU Utilization (%)

The CPU Utilization (%) metrics include CPU Utilization (%) - Instance and CPU Utilization (%) - Worker Nodes.

  • CPU Utilization (%) - Instance

    The CPU Utilization (%) - Instance metric indicates the overall CPU utilization of an instance, and reflects the computing load of the instance.

    • The daemon processes that are running in the background or the compaction task that is executed asynchronously may occupy the CPU of the instance. Therefore, a small amount of CPU utilization is normal when no query tasks are running in the instance.

    • Hologres supports multi-core parallel computing. In most cases, the CPU utilization during a single query can increase to 100%. This indicates that the computing resources are being fully utilized.

    • If the CPU utilization remains close to 100%, the instance is heavily loaded. For example, the CPU utilization remains 100% for 3 consecutive hours or remains 90% and higher for 12 consecutive hours. In this case, CPU resources have become a bottleneck of the instance. You must analyze specific business scenarios and queries to identify the cause of the high CPU utilization. You can identify the cause from the following aspects:

      • Check whether a large amount of offline data is imported by executing the INSERT statement, and whether the amount of data is still increasing.

      • Check whether data is queried or written with high queries per second (QPS) to exhaust the CPU resources.

      • Check whether hybrid loads exist in the preceding or other scenarios.

    • Scale the instance to accommodate more complex queries or larger data volumes.

  • CPU Utilization (%) - Worker Nodes

    The CPU Utilization (%) - Worker Nodes metric indicates the CPU utilization of each worker node in an instance, and reflects the computing load of each worker node. The number of worker nodes allocated to a Hologres instance varies based on the instance type. For more information about instance types, see Instance types.

    Note

    This metric is added in July 2022.

    • This metric is displayed only for Hologres instances of V1.1 and later.

    • If the CPU utilization of all worker nodes remains close to 100%, the instance is heavily loaded. You must optimize the resource usage or scale the instance based on the specific business scenario.

    • If the CPU utilization of some worker nodes is high and the CPU utilization of other worker nodes is low, the resources are unevenly allocated to the worker nodes. For more information about common causes for uneven resource allocation and troubleshooting methods, see Metric FAQ.

Memory Usage (%)

The Memory Usage (%) metrics include Memory Usage (%) - Instance and Memory Usage (%) - Worker Nodes.

  • Memory Usage (%) - Instance

    The Memory Usage (%) - Instance metric indicates the overall memory usage of an instance.

    • In Hologres, memory resources are reserved by backend processes. Even if no queries are in progress, metadata, indexes, and data cache of data tables are loaded to the memory to facilitate archiving and computing. Therefore, the memory usage is not zero. When no queries are in progress, a memory usage of 30% to 40% is normal.

    • However, if the memory usage continues to increase and remains close to 80%, memory resources may become a bottleneck of the instance, and the stability or performance of the instance may be negatively affected.

    • You can identify which one of metadata, caching, and computing occupies a large amount of memory resources based on your business scenario. For more information, see FAQ about OOM.

  • Memory Usage (%) - Worker Nodes

    The Memory Usage (%) - Worker Nodes metric indicates the memory usage of each worker node in an instance, and reflects the computing load of each worker node. The number of worker nodes allocated to a Hologres instance varies based on the instance type. For more information about instance types, see Instance types.

    Note

    This metric is added in July 2022.

    • This metric is displayed only for Hologres instances of V1.1 and later.

    • If the memory usage of all worker nodes remains close to 80%, the instance is heavily loaded. You must optimize the resource usage or scale out the instance based on business requirements.

    • If the memory usage of some worker nodes is high and the memory usage of other worker nodes is low, the resources are unevenly allocated to worker nodes. For more information about common causes for uneven resource allocation and troubleshooting methods, see Metric FAQ.

Instance Memory Distribution Usage (%)

The Instance Memory Distribution Usage (%) metrics provide memory usage data of the following parts in an instance: System, Metadata, Cache, Query, and Background. Hologres V2.0.15 and later support the Instance Memory Distribution Usage (%) metrics. The metrics provide memory distribution of an instance, based on which you can perform performance optimization.

  • System: indicates the memory that is occupied by system components during runtime. The system components include the HoloHub, gateways, and frontend (FE) nodes. FE nodes include FE master nodes and FE query nodes. The memory that is occupied by system components fluctuates when queries are performed.

  • Cache: indicates the memory that is occupied by the following cached data:

    • Cached SQL-related data: includes cached results and cached blocks. If an SQL query hits the cache, the query performance increases. For example, you can execute the EXPLAIN ANALYZE statement to obtain the value of the Physical read bytes field. If the value of this field is small, the SQL query hits a large amount of the cache. The amount of cached SQL-related data is limited.

    • Cached metadata: includes cached metadata of table schemas and cached metadata of files. The storage engine of Hologres loads metadata to the cache before SQL queries. This way, data is not directly accessed from disks, and query performance is improved.

    • The memory that is occupied by cached data is fixed. In most cases, about 30% of the total memory of an instance is occupied by cached data. If no queries are performed on an instance, cached metadata occupies the memory.

  • Metadata: indicates the memory that is occupied by metadata and files. The Lazy Open mode is used for storing metadata. In this mode, frequently accessed metadata is stored in memory, and metadata that is not accessed is not stored in memory. This helps reduce the memory that is occupied by metadata. In most cases, we recommend that less than 30% of the total memory of an instance be occupied by metadata. If metadata occupies a large amount of memory resources in an instance, a large number of files or partitioned tables in the instance are accessed. In this case, you can perform governance on tables in the instance based on table statistics. For more information, see Query and analyze table statistics.

  • Query: indicates the memory resources that are consumed when SQL statements are executed. The consumed memory resources are positively correlated with the complexity and concurrency of queries. Memory resources are consumed when SQL statements are executed by using fixed plans, the Hologres Query Engine (HQE), or the Seahawks Query Engine (SQE).

    • The memory that is available for SQL queries is elastic. A minimum of 20 GB of memory is required by a worker node. The maximum available memory varies based on the total available memory of the instance. If a large amount of memory is used by other parts, a small amount of memory is available for computing.

    • If queries on an instance occupy a large amount of memory or an out of memory (OOM) issue occurs on the instance, the queries are complex or have high concurrency. We recommend that you optimize the query performance or scale out the instance to prevent a high memory usage of the instance. For more information, see Optimize performance of queries on Hologres internal tables.

  • Background: indicates the memory resources that are occupied by background tasks, including compaction and flushing tasks. In most cases, background tasks occupy less than 5% of the total memory of an instance. If table indexes are modified, or a large amount of data is written or updated, the memory resources that are occupied by background tasks increase and will decrease after the tasks are complete.

  • Memtable: indicates the memory resources that are occupied by memory tables. In Hologres, a memory table is used to store table data that is written, updated, or deleted in real time. In most cases, memory tables occupy less than 5% of the total memory of an instance.

Storage Usage (bytes)

The Storage Usage (bytes) metric indicates the size of the logical disk that is used to store the data of all databases in an instance. 1 GiB = 1,024 MiB = 1,024 × 1,024 KiB. The storage usage of a Hologres instance can be continuously increased without limits.

  • After the storage quota of a subscription instance is used up, the excess storage is charged in pay-as-you-go mode. Excess storage usage does not affect the stability or use of the Hologres instance. We recommend that you update storage configurations or clean up unwanted data at the earliest opportunity after your storage quota is used up. This prevents unnecessary and more expensive storage costs.

  • You can execute the following SQL statements to query the storage usage of each database and internal tables.

    Note

    The unit of the execution results of the pg_relation_size and pg_database_size functions is byte. The unit of the execution results of the pg_size_pretty function is KB, MB, or GB. The unit is automatically converted based on the data volume. In the unit conversion standard of PostgreSQL, 1 GB = 1,024 MB = 1,024 × 1,024 KB.

    select pg_database_size('dbname'); -- Query the storage size of the database.
    select pg_relation_size('tablename'); -- Query the storage size of the table.
    
    -- Sort all tables by the storage size.
    select relname as "Table",
           pg_relation_size(relname)/1024/1024/1024 as "Size(GB)"
    from pg_tables
    where schemaname = 'public'
    order by pg_relation_size(relname)/1024/1024/1024;

Connections

The Connections metrics include Connections - Instance and Connections - FEs. The default maximum number of connections supported by an Hologres instance varies based on the instance type. For more information about instance types, see Instance types.

  • Connections - Instance

    The Connections - Instance metrics include Used Connections and Highest FE Connection Usage (%).

    • Used Connections: the total number of connections in an instance, including Java Database Connectivity (JDBC) and PostgreSQL connections in different states, such as active connections and idle connections.

    • Highest FE Connection Usage (%): the highest connection usage among FE nodes. Hologres collects the connection usage of each FE node and displays the highest connection usage by using the Max(frontend_connection_used_rate) function. This allows you to check whether the number of connections exceeds the maximum number supported by an FE node to prevent connection failures.

      Note

      This metric is added in July 2022.

      • The round robin policy is adopted among multiple FE nodes to establish physical connections to these FE nodes in sequence.

      • If the number of connections is large or even exceeds the maximum number of connections supported by your instance, or one of the following errors occurs, the number of connections in the Hologres instance has reached the upper limit. In this case, you must check whether a connection leak occurs. If a connection leak occurs, release connections by following instructions in Manage connections.

        FATAL: sorry, too many clients already connection limit exceeded for superusers 
        
        FATAL: remaining connection slots are reserved for non-replication superuser connections

  • Connections - FEs

    The Connections - FEs metric indicates the number of connections used by each FE node in an instance, including JDBC and PostgreSQL connections in different states, such as active connections and idle connections.

    Note

    This metric is added in July 2022.

    • This metric is displayed only for Hologres instances of V1.1 and later.

    • By default, an FE node supports a maximum of 128 connections. For more information, see Instance types.

    • If the number of connections used by each FE node in an instance remains close to the maximum number, a large number of connections exist in the instance. You can check whether the connections are idle or active based on your business scenario. For more information, see Manage connections. Then, release idle connections or scale the instance at the earliest opportunity based on your business scenario.

    • If the number of connections used by some worker nodes is high whereas the number of connections used by other worker nodes is low, the connections of the instance are unevenly distributed. You can find idle connections and release the idle connections at the earliest opportunity to mitigate the uneven connection load.

QPS

The QPS metric indicates the average number of SQL statements that are executed per second, including the SELECT, INSERT, UPDATE, and DELETE statements.

  • This metric is updated every 20 seconds. If only one SQL statement is executed within 20 seconds, the QPS of SQL statements is 0.05 (1/20 = 0.05).

  • For insert do update, one UPDATE statement is recorded. For insert do nothing, one INSERT statement is recorded.

Query Latency (ms)

The Query Latency (ms) metric indicates the average response time (RT) of SQL statements, including the SELECT, INSERT, UPDATE, and DELETE statements. You can query slow query logs to analyze the query latency. For more information, see Query and analyze slow query logs. For insert do update, one UPDATE statement is recorded. For insert do nothing, one INSERT statement is recorded.

Real-time Import (RPS)

The Real-time Import (RPS) metric indicates the number of entries that are imported or updated per second by using SQL statements or fixed plans. This metric is displayed only for primary instances.

The insert RPS indicates the number of entries that are imported per second by using foreign tables, COPY statements, or Hologres tables.

The update RPS indicates the number of entries that are updated or deleted per second by executing the UPDATE or DELETE statement. For insert do update, one UPDATE statement is recorded. For insert do nothing, one INSERT statement is recorded.

The SDK RPS indicates the number of entries that are imported or updated per second by using fixed plans. Data can be imported by using SDKs in the following ways:

I/O Throughput (Bytes)

The I/O Throughput (Bytes) metric indicates the I/O throughput of a Hologres instance. This metric describes the amount of data read from or written to the instance and reflects how busy the read or write operations are. You can understand the workloads on the instance based on the I/O Throughput (Bytes) metric and troubleshoot relevant issues in real time.

Ongoing Query Duration

The Ongoing Query Duration metric indicates the duration of an ongoing query in an instance. By default, this metric displays the duration of the ongoing query that is the longest at the current point in time.

Note

This metric is added in July 2022.

  • This metric is displayed only for Hologres instances of V1.1 and later.

  • Hologres is a distributed system. The numbers of worker nodes that are configured on an Hologres instance varies based on the instance type. Queries are randomly distributed to worker nodes. Hologres collects the durations of ongoing queries on each worker node and displays the duration of the ongoing query that is the longest at the current point in time. For example, three queries are being executed on different worker nodes at the current point in time, and the durations of these queries are 10 minutes, 5 minutes, and 30 seconds. In this case, the Ongoing Query Duration metric displays 10 minutes.

  • You can check whether the duration of the query is normal based on logs of active queries or slow queries. For more information, see Manage queries and Query and analyze slow query logs. Troubleshoot slow queries based on the preceding topics, and resolve deadlocks or stuck issues at the earliest opportunity.

Note

The metrics in the Hologres console are updated every 20 seconds. Therefore, the start time of a query on the X axis of the Ongoing Query Duration metric may not be the precise point in time when the query starts. Therefore, this metric is used only as a reference for troubleshooting exceptions. This metric allows you to easily find slow queries in an instance. As an auxiliary metric for self-O&M, this metric does not provide the accurate durations of ongoing queries.

Failed Queries per Second

The Failed Queries per Second metric indicates the average number of failed SQL statements per second in an instance, including the SELECT, INSERT, UPDATE, DELETE, and OTHER statements.

Note

This metric is added in July 2022.

  • This metric is updated every 20 seconds. If only a single SQL statement fails within 20 seconds, the failure rate of SQL statements is 0.05 (1/20 = 0.05).

  • You can view the failed queries in slow query logs and troubleshoot the failures to improve the availability of your Hologres instance. For more information, see Query and analyze slow query logs.

Primary/Secondary Latency (ms)

Hologres provides the multi-instance high-availability deployment mode. You can configure four read-only secondary instances for a primary instance. For more information, see Configure read/write splitting for primary and secondary instances (shared storage). The Primary/Secondary Latency (ms) metric indicates the latency of data synchronization from a primary instance to secondary instances. The latency is displayed in milliseconds.

Note

This metric is added in July 2022.

  • Only Hologres instances of V1.1 and later support primary/secondary deployment. This metric is displayed only for secondary instances. By default, this metric is not displayed for primary instances.

  • This metric is initially displayed as 0 ms after a secondary instance is associated with a primary instance. When data is written to the primary instance, this metric displays the fluctuating latency of data synchronization to the secondary instance.

  • In most cases, the synchronization latency between the primary and secondary instances is in milliseconds. The occasional latency jitters are usually caused by metadata modification operations, such as DDL operations, that are performed on the primary instance. These latency jitters can be ignored. If the latency is greater than one second in a long period of time, the possible cause is that the resource usage is high in the instances. You can check the resource usage such as CPU utilization and memory usage and scale the instances if necessary to reduce the latency. If an instance is restarting or updating, the synchronization latency may increase to minutes and will automatically recover after the restart or update is complete.

Metric FAQ

The metric FAQ provides answers to frequently asked questions about the metrics of Hologres to help you diagnose and troubleshoot issues. This improves your O&M capabilities. For more information, see Metric FAQ.