Frequently Asked Questions About Monitoring Alarms

Monitoring Alarms Introduction:

Monitoring Alarms.This article summarizes the common monitoring and alarm related issues when using Hologres to help you make better use of Hologres .

Monitoring alarms List of monitoring indicators

Hologres management console provides you with instance resource management services. On this page, you can view the resource usage of the current instance and the execution of SQL statements, identify system errors in time, and handle instance exceptions.

The monitoring indicators report includes:
1. Monitoring Alarms.CPU usage (%) : refers to the overall CPU usage of the instance. The computing resources of Hologres adopt the reserved mode. Even if no query operation is performed, there will be running processes in the background . It is normal that the CPU usage is not zero. At the same time , Hologres can give full play to the ability of CPU multi-core parallel computing. Typically, a single query can quickly increase CPU usage to 100%, indicating that computing resources are fully utilized. But the long-term 100% CPU indicates that the instance load is high

2. Monitoring Alarms.Memory usage (%) : refers to the overall memory usage of the instance. The memory resource of Hologres adopts the reserved mode. Even if no query operation is performed, some Meta or Index metadata will be loaded into the memory. This type of metadata is used to improve the calculation speed. At this time, if there is no query, the memory usage may reach about 30% to 40%, which is a normal phenomenon. If the memory usage continues to rise, even close to 100%, the instance is under high load.
Description: Hologres is a distributed system. An instance has multiple nodes. The upper limit of memory of a node is 64G. The memory is usually divided into three parts: calculation, metadata and cache.

3. Instance storage usage (bytes) : refers to the size of the logical disk occupied by the data stored by the instance, which is the sum of all DB storage usage. If you purchase an instance using the annual and monthly payment model, after the quota of storage resources is used up, the excess will be automatically converted to billing . Please expand the storage in time after exceeding the storage specifications

4. Number of connections ( pieces ) : refers to the total number of SQL connections in the instance, including JDBC or PSQL connections in active and idle states. The default number of connections for an instance is usually related to the instance's specifications. For specific specifications, see Instance Specifications Overview .

5. Monitoring Alarms.QPS ( pieces /second) : QPS refers to the average number of times that 4 kinds of SQL statements of SELECT, INSERT, UPDATE or DELETE are executed per second. This indicator is reported every 20 seconds. If only one SQL statement is executed within 20 seconds, the QPS of the SQL statement will display 1/20=0.05 at the 20-second data point.

6. Query delay (milliseconds) : Query delay refers to the average delay (ie response time) of executing 4 SQL statements of SELECT, INSERT, UPDATE or DELETE. You can view and analyze query latency through the slow query log .

7. Real-time import RPS (records/second) : Real-time import RPS refers to the number of data records imported or updated through SQL statements or SDK per second.
1) Insert RPS indicates the import rate of using external table batch import, using COPY statement batch import or inserting data between Hologres tables.
2) Update RPS indicates the number of records updated or deleted per second by executing update or delete SQL statements.
3) SDK RPS indicates the number of data written or updated per second written to Hologres through SDK. SDK methods include:
•Write via Flink .
•Read and write data through Holo Client .
•Write Spark data to Hologres .
•Sync data from DataHub to Hologres in real time .
•Offline writing via data integration DataX .
•SQL or JDBC are written via the insert into values () statement.

8. IO throughput (bytes/second) : This indicator describes the read and write data volume of the instance, and reflects the read and write busyness of the instance. From the I/O level, you can understand the stress and load changes of the instance, and diagnose problems in time.

How to check which connections there are and how to kill the connections when there are too many connections?

The number of connections includes the total number of SQL connections in the instance, including JDBC/PSQL connections in active and idle states. The number of connections to an instance is usually related to the specifications of the instance. If you find that the number of connections exceeds max_connections , or you encounter one of the following errors:
Generate FATAL : sorry , too many clients already connection limit exceeded for superusers error.

Generate FATAL : remaining connection slots are reserved for non - replication superuser connections error.
Indicates that the number of instance connections has reached the upper limit, which means you need to check your instance for connection leaks. Through HoloWeb , on the active connection management page, you can perform the Kill operation on connections that do not meet expectations or idle . For details, see HoloWeb Visual Management Connections .
Note that the Superuser account can see all instances connected. The rest of the users can only see their own connections

How to check the slow query and how to reduce the delay when the delay is too high?

Historical Slow Query page of the HoloWeb Diagnosis and Optimization Module . For details, see Slow Query Log Viewing and Analysis .
Common ways to reduce query latency are as follows:

•Let the writing be performed during the low peak period of the query, or reduce the concurrency of the writing and improve the query efficiency. If it is an external write, you can use the following parameters to reduce the concurrency.
--Set the maximum concurrency of MaxCompute execution. The default value is 128. It is recommended to set a smaller value to prevent one query from affecting other queries and cause the system to be busy and cause errors to be reported.

set hg_experimental_foreign_table_executor_max_dop = 32; -- prioritize setting
each read MaxCompute table, the default is 8192.
set hg_experimental_query_batch_size = 1024;
--Direct reading orc
set hg_experimental_enable_access_odps_orc_via_holo = on;
--Set the number of split splits for accessing MaxCompute tables . The number of concurrent splits can be adjusted. The default is 64MB. When the table is large, it needs to be increased to avoid too many splits affecting performance.
set hg_experimental_foreign_table_split_size = 512MB;
•Optimize the query to make the query more efficient and get query results faster. See Performance Tuning .
•To expand an instance, see Instance Upgrade for details .

How to see running tasks?
It can be viewed visually on the Active Query page of the HoloWeb Diagnostics and Optimization module. For details, see View Active Query .

Query does not end for a long time, how to kill and set the timeout time?
You can view active queries on the Active Query page of the HoloWeb Diagnosis and Optimization module , and perform the Kill operation on the query. Use SQL commands in the SQL editor module to set the query timeout period. For details, see Query Management .

high memory usage ?
Hologres instance is the combined memory usage. The memory resource of Hologres adopts the reserved mode. When there is no query, the metadata of the data table, index, data cache, etc. will also be loaded into the memory to speed up retrieval and calculation. It is normal that the memory usage is not zero at this time. Condition. In theory, in the case of no query, it is normal for the memory usage to reach about 30% to 40%.

In some cases, the memory usage will continue to rise, even close to 100%. The main reasons are as follows:
•There are more and more tables, and the total amount of data is getting larger and larger, so that the scale of data is much larger than the current computing specifications. Since there is a positive correlation between the memory usage and the amount of metadata and indexes, the more tables, the larger the amount of data, and the more indexes, will lead to higher memory usage.
•The index is unreasonable. For example, the table has many columns, most of which are TEXT columns, and too many Bitmap or Dictionary indexes are set. In this case, consider removing some Bitmap or Dictionary indexes, see ALTER TABLE for details .

However, when the memory usage increases steadily and is close to 100% for a long time, it usually means that the memory resource may become the bottleneck of the system, which may affect the stability and /or performance of the instance. The stability impact is reflected in the fact that when the metadata is too large and occupies the memory space available for normal queries, errors such as SERVER_INTERNAL_ERROR , ERPC_ERROR_CONNECTION_CLOSED , and Total memory used by all existing queries exceeded memory limitation may occasionally be reported during the query process . The performance impact is reflected in the fact that when the metadata is too large, the cache space that can be used by normal queries is over-occupied, so the cache hits will be reduced, and the query delay will be increased.

Therefore, when the memory is close to 100% for a long time, there are the following suggestions for operation:
•Delete data that is no longer queried to release memory occupied by metadata, etc.
•Set a reasonable index. If the bitmap and dictionary are not used in the business scenario, you can remove it, but it is not recommended to remove it directly. It needs to be analyzed according to the business situation.
•Upgrade the computing and storage resources of an instance. Recommendations for upgrades are:
oCommon scenarios: The delay of reading disk data can be tolerated, and the response time requirements are not strict. 1CU (1Core+4GB memory) can support 50~100GB of data storage.
oServing scenarios with low response time requirements: It is best to query hot data in the memory cache. The proportion of the cache in the memory accounts for 30% of the total memory by default , that is, 1.3GB of 1CU (1Core+4GB memory) is used for data cache, and the data cache is also used by the metadata of the table. For example, in a scenario with low response requirements, if the hot data is 100GB, then it is better to require 100GB to be available in the cache (in fact, after the data is read and decompressed, it takes up more than 100GB of memory), so at least about 320GB of memory is required, so as to calculate Resources need at least about 96CU.

Why does CPU usage easily reach 100%?
The overall CPU usage of the instance. Because of its design principle, Hologres can give full play to the ability of multi-core parallel computing. Generally speaking, a single query can quickly increase the CPU usage to 100%, which shows that the computing resources are fully utilized.
CPU is not a problem. After the CPU is high, the problem is that the query is slow and the writing is slow. It needs to be analyzed comprehensively.

How to solve the problem that the CPU usage reaches 100% for a long time?
When the CPU usage of the Hologres instance is close to 100% for a long time (for example, the CPU usage is at 100% for 3 consecutive hours, or it reaches 90% or more for 12 consecutive hours, etc.), it means that the instance load is very high, which usually means that the CPU resources become the system's Bottlenecks require analysis of specific business scenarios and queries to determine the cause.
It can be checked from the following aspects:
•Check if there is a relatively large offline data import (INSERT), and the data size is still growing. You can consider reducing write concurrency, or staggered writing, so as not to affect the query, or use the read-write separation high- availability deployment mode of Hologres V1.1 .
•Check to see if there are high-QPS queries or writes that are using up the CPU usage. Consider reducing write concurrency and let the query complete first.
•Check whether the active query has a query with a large amount of data. You can check whether a query has not ended on the active query page of HoloWeb . If it does not meet business expectations, you can perform the KILL operation. For details, see Query management .
•If it is determined that it is required by the business scenario and the CPU resources are used up, the instance can be expanded to accommodate more complex queries or larger data volumes. For details, see Instance Upgrade .

Cloud Monitoring Alarm Configuration Best Practices
Hologres has been connected to the cloud service monitoring of cloud monitoring, so that you can comprehensively understand the resource usage, business operation and health status of Hologres instances through cloud monitoring, receive abnormal alarms and respond in time, and ensure the smooth operation of applications. For detailed configuration and usage, see Document Cloud Monitoring .

Monitoring Alarms.How to set appropriate alarms for your business, so as to alert unimportant events infrequently and to sense the occurrence of important abnormal events in a timely manner, actually depends on your specific business situation and the location of Hologres in your overall architecture. position, please set it according to the specific situation. However, there are some general indicators, it is recommended that you set alarms for them, and set appropriate alarms, as follows:

•CPU water level
CPU water level reflects whether there are bottlenecks in Hologres ' resources and whether your resources are fully utilized. The core logic of the CPU water level alarm is to set a rule to detect that the CPU water level continues to be 100%, that is, the CPU continues to be full. Because of this situation, it means that the resources of the current instance scale have been fully utilized, and it is difficult to support the growth of business data volume, query volume, calculation volume, etc., and measures such as capacity expansion need to be considered. Alert rule suggestions:

oIt is not recommended to generate an alarm once the CPU usage reaches 100%. If the CPU water level continues to reach 100% for a period of time, and then remains at a medium or low level, then it is generally during that time that large calculations are being performed, such as large-scale data writing or large-scale data querying, so it is not recommended. An alarm occurs when the CPU usage reaches 100%, which may generate more false alarms.
oan alarm rule similar to "CPU water level >=99% Info for 3 consecutive times (each interval is 5 minutes) ". Set the alarm rule as "CPU water level >=99% Info for 3 consecutive times (each interval is 5 minutes)", that is, if the CPU usage is 100% for 3 times within 15 minutes, an Info level alarm is generated. This alarm rule can effectively avoid false alarms. Depending on the specific situation, you can reduce the interval time or increase the number of detections.

•memory water level
The logic of the memory water level is roughly the same as the CPU water level. Unlike the CPU water level, which is almost 0 when no calculation is performed, the memory water level will not be 0 when there is no calculation, and sometimes it may even be relatively high. Hologres divides the computing memory into three blocks. One is the memory reserved for computing, which accounts for about 30%. When there is no calculation, this block is almost 0. The second is the data cache. In many cases, the data cache can greatly reduce I /O read, improve the computing efficiency of SQL, accounting for about 30%; the third is the metadata cache of all tables, indexes, etc. in the instance, as well as the handles and buffers of tables in memory, accounting for about 30%. Alert rule suggestion
oIt is not recommended to set a threshold too low for the memory water level.
When there is no calculation, only the first 30% can be guaranteed to be empty, and 60% or more of the memory may be occupied, so the memory water level is not 0 or even higher. It is possible that more than 60% is occupied because the metadata is put into the memory first. If there is too much metadata, plus 30% of the data cache, the memory usage will exceed 60% when there is no calculation .
oan alarm rule similar to "Memory water level >=90% Warning for 3 consecutive times (each interval is 5 minutes)" .
Set the alarm rule to the memory water level >=90% Warning for 3 consecutive times (each interval is 5 minutes) " , that is, if the memory usage exceeds 90% for 3 times within 15 minutes, an alarm of Warning level is generated.

•connection usage
The purpose of alarming the connection usage rate is to prevent the number of connections from being full, and the abnormal connection failure will affect the service. Generally, if the number of connections is less than 95% of the total number of connections, it is considered safe. Therefore, you can set the alarm rule as follows: the connection usage rate (Info) is >=95% three times in a row, and the alarm will be triggered .

Monitoring AlarmsP99 Delay
During the process from submission to return of the SQL statement, the Hologres engine processes 99% of the query statement time. The abnormal increase in response time reflects that the system may have a slow query trend, which may affect downstream business processing. The average response time is difficult to have a standard and needs to be determined according to your actual query usage pattern. The P99 delay represents the time when 99% of the Query runs, and has a specific meaning relative to the average.

For example, if you are a continuous service application, and the normal normal delay is within 1s, then the response time suddenly reaches 5s or 10s, which is not normal, and an alarm needs to be issued. For example, if you are an analytical application, according to the needs of different business personnel, the size of the query and the response time are different, so you need to set it according to your own scenario. In addition, it should be noted that when the instance load is high, the SQL running time may fluctuate greatly, and alarms are likely to be generated in this case. The solution is to adjust the threshold appropriately, or increase the instance size to maintain a stable load.

Monitoring Alarms.Query QPS
In your business, when there is continuous execution of SQL statements, there is often a continuous and stable QPS (Query Per Second) indicator. When the QPS suddenly drops to a very low level or is 0, if the business is not intentionally stopped, it may mean that there is an abnormality in the system. If the service QPS is continuously stable at a certain value A, it is recommended to set the alarm threshold to 80%, 50% or 10% of A, that is, the alarm rule is set to: QPS for 3 consecutive times <= (A*0.8) Warning (every time 5 minutes interval).

Monitoring Alarms.Write to RPS in real time
In your business, when there is continuous external data import (through Flink /data integration, etc.), there is often continuous real-time import of RPS (Record Per Second) metrics. When the RPS suddenly drops to very low, or is 0, if the job is not intentionally stopped, it may mean that there is an abnormality in the system. For example, set the warning as: RPS ( cmdType = sdk ) <=10 Warning for 1 time in a row.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00