Alibaba Cloud Elasticsearch provides multiple basic monitoring metrics (such as cluster status, cluster query QPS, node CPU utilization, node disk usage) for running clusters to monitor their operational status. You can use these metrics to understand cluster operational status in real time, handle potential risks promptly, and ensure stable cluster operation. This topic describes how to view cluster monitoring details and provides the meanings of various monitoring metrics, exception causes, and exception handling suggestions.
Differences with other monitoring features
The cluster monitoring feature provided by Alibaba Cloud Elasticsearch may differ from the monitoring feature provided by Kibana or third-party services in the following aspects:
Sampling period differences: The sampling period differs from that of Kibana or third-party monitoring, resulting in different collected data and therefore differences.
Query algorithm differences: Both Alibaba Cloud Elasticsearch cluster monitoring and Kibana monitoring are affected by cluster stability when collecting data. The QPS metric in cluster monitoring may show sudden increases, negative values, or no monitoring data due to cluster jitter, while Kibana monitoring may show empty values.
NoteIf the cluster monitoring feature provides more metrics than the monitoring feature provided by Kibana, we recommend that you use both features simultaneously to perform monitoring in your business scenarios.
Collection interface differences: Kibana monitoring metrics depend on the Elasticsearch API, while some node-level metrics in cluster monitoring (such as CPU utilization, load_1m, disk usage) call the underlying system interfaces of Alibaba Cloud Elasticsearch. Therefore, the monitoring includes not only the Elasticsearch process but also the usage of system-level resources.
View cluster monitoring data
Log on to the Alibaba Cloud Elasticsearch console.
In the left-side navigation pane, click Elasticsearch Clusters.
Navigate to the desired cluster.
In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
On the Elasticsearch Clusters page, find the cluster and click its ID.
In the left-side navigation pane, choose .
View monitoring details.
View Infrastructure Monitoring details
On the Infrastructure Monitoring tab, select a Group category and a monitoring period as needed to view the monitoring details of resources in the corresponding category during the specified period.
NoteClick Custom to view monitoring details within a custom time period as needed.
The monitoring and alerting feature for Elasticsearch instances is enabled by default. Therefore, you can view historical monitoring data on the Cluster Monitoring page. You can view monitoring data by minute, and monitoring data is retained only for 30 days.
For more information about infrastructure monitoring metrics, see Overview of infrastructure monitoring metrics.
Overview of infrastructure monitoring metrics
The following table describes the categories and overview of infrastructure monitoring metrics for clusters.
You can view the parameter settings in the DataWorks console.
Overview
Metric | Description |
Indicates the health status of a cluster. A value of | |
Indicates the snapshot status of the Auto Snapshot feature in the Elasticsearch console. A value of | |
Indicates the total number of nodes in a cluster. | |
Indicates the total number of disconnected nodes in a cluster. | |
Indicates the number of indexes in a cluster. | |
Indicates the number of shards in a cluster. | |
Indicates the number of primary shards in a cluster. | |
Indicates the number of slow queries in a cluster. | |
Indicates the number of documents written to a cluster per second. | |
Indicates the number of queries executed per second in a cluster. The number of query QPS is related to the number of primary shards in the index to be queried. | |
Indicates the CPU utilization of each node in a cluster. | |
Indicates the heap memory usage of each node in a cluster. | |
Indicates the disk usage of each node in a cluster. We recommend that you set the threshold to a value less than | |
Indicates the load of each node in a cluster within | |
Indicates the inbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s. | |
Indicates the outbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s. | |
Indicates the number of inbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute. | |
Indicates the number of outbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute. | |
Indicates the number of TCP connection requests received by each node in a cluster from clients. | |
Indicates the I/O usage of each node in a cluster. | |
Indicates the amount of data read from each node in a cluster per second. | |
Indicates the amount of data written to each node in a cluster per second. | |
Indicates the number of read requests completed per second for each node in a cluster. | |
Indicates the number of write requests completed per second for each node in a cluster. |
Cluster metrics
Metric | Description |
Indicates the health status of a cluster. A value of | |
Indicates the total number of nodes in a cluster. | |
Indicates the total number of disconnected nodes in a cluster. | |
Indicates the number of indexes in a cluster. | |
Indicates the number of shards in a cluster. | |
Indicates the number of primary shards in a cluster. | |
Indicates the number of slow queries in a cluster. | |
This metric is based on the | |
Indicates the snapshot status of the Auto Snapshot feature in the Elasticsearch console. A value of | |
Indicates the number of documents written to a cluster per second. | |
Indicates the number of queries executed per second in a cluster. The number of query QPS is related to the number of primary shards in the index to be queried. | |
Indicates the memory usage of Fielddata in a cluster. A higher monitoring curve indicates that a large amount of Fielddata data is cached in the heap memory. Excessive Fielddata memory usage triggers Fielddata memory circuit breaking, which affects cluster stability. |
Index metrics
Metric | Description |
Indicates the number of bulk requests per second for an index. | |
Indicates the number of queries executed per second for an index. The number of query QPS is related to the number of primary shards in the index to be queried. | |
Indicates the maximum time taken for a query request on an index. Unit: milliseconds. |
Node resource metrics
Metric | Description |
Indicates the CPU utilization of each node in a cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected. | |
Indicates the heap memory usage of each node in a cluster. If the heap memory usage is high or large objects are stored in the memory, the services that run on the cluster are affected and GC operations are automatically triggered. | |
Indicates the disk usage of each node in a cluster. We recommend that you set the threshold to a value less than | |
Indicates the system memory usage of a node. Note This metric is supported only by cloud-native new control (v3) versions. | |
Indicates the CPU I/O wait percentage of a node. Note This metric is supported only by cloud-native new control (v3) versions. | |
Indicates the load of each node in a cluster within |
Node network metrics
Metric | Description | Note |
Indicates the inbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KiB/s. | None. | |
Indicates the outbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KiB/s. | None. | |
NodeNetworkTraffic(KiB/s) = NodeNetworkInTraffic(KiB/s) + NodeNetworkOutTraffic(KiB/s). | This metric is supported only by cloud-native new control (v3) versions. | |
NodeNetworkTrafficUtilization(%) = (NodeNetworkInTraffic(KiB/s) + NodeNetworkOutTraffic(KiB/s)) / Node network base bandwidth (Gbit/s). | This metric is supported only by cloud-native new control (v3) versions. | |
Indicates the number of TCP connection requests received by each node in a cluster from clients. | None. | |
Indicates the network retransmission rate of a node. | This metric is supported only by cloud-native new control (v3) versions. | |
Indicates the number of inbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute. | None. | |
Indicates the number of outbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute. | None. | |
NodeNetworkPackets(count) = NodeNetworkOutPackets(count) + NodeNetworkInPackets(count). | This metric is supported only by cloud-native new control (v3) versions. | |
NodeNetworkPacketsUtilization(%) = (NodeNetworkOutPackets(count) + NodeNetworkInPackets(count)) / Node network packet forwarding PPS. | None. |
Node disk metrics
Metric | Description | Note |
Indicates the amount of data read from nodes in the secondary cluster per second. | None. | |
Indicates the amount of data written to each node in the cluster per second. | None. | |
Disk bandwidth (MB/s) = Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s). | Only supported by the cloud-native new control (v3) version. | |
Disk bandwidth usage_disk (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / (ESSD) single disk throughput (MB/s). | Only supported by the cloud-native new control (v3) version. For ESSD single disk throughput, see ESSD introduction. | |
Disk bandwidth usage_node (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / Node disk basic bandwidth (Gbit/s). | Only supported by the cloud-native new control (v3) version. | |
Indicates the IO utilization percentage of each node in the cluster. | None. | |
Indicates the number of read requests completed per second by each node in the cluster. | None. | |
Indicates the number of write requests completed per second by each node in the cluster. | None | |
Disk IOPS (count) = Disk IOPS_Read (count) + Disk IOPS_Write (count). | Only supported by the cloud-native new control (v3) version. | |
Disk IOPS usage_disk (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / (ESSD) single disk IOPS. | Only supported by the cloud-native new control (v3) version. For ESSD single disk throughput, see ESSD introduction. | |
Disk IOPS usage_node (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / node disk basic IOPS. | Only supported by the cloud-native new control (v3) version. | |
Indicates the average length of the request queue. | Not applicable |
Node JVM metrics
Metric name | Metric description |
Indicates the heap memory Old generation usage size of each node in the cluster. When the Old generation occupancy is high or large memory objects exist, cluster services will be affected, and GC operations will be automatically triggered. The collection of large objects may result in long GC duration or Full GC. | |
Indicates the total number of GC operations within | |
Indicates the number of GC collections in the Old generation of each node in the cluster. When the Old generation occupancy is high or large memory objects exist, cluster services will be affected, and GC operations will be automatically triggered. The collection of large objects may result in long GC duration or Full GC. | |
Indicates the average duration of Old generation GC collections for each node in the cluster. When the Old generation occupancy is high or large memory objects exist, GC operations will be automatically triggered. The collection of large objects may result in long GC duration or Full GC. |
Thread pool metrics
Metric name | Metric description |
Indicates the number of threads in the query thread pool that are currently executing tasks in the cluster. | |
Rejected requests in query thread pool (new version) (count) | Indicates the number of rejected requests in the query thread pool within the cluster. |
Other metrics
Metric name | Metric description |
Indicates the total number of warning-level logs that appear in the main log of the cluster within one minute. |
Deprecated metrics
Metric name | Metric description |
Number of rejected requests in the query thread pool (count) | Indicates the number of rejected requests in the query thread pool. This metric differs in calculation method from the Number Of Rejected Requests In The Query Thread Pool (new Version) (count) metric. This metric is now deprecated. Please use Number of rejected requests in the query thread pool (new version) (count) instead. |
ClusterStatus(value)
Description
This metric displays the health status of the cluster. A value of 0.00 indicates that the cluster is normal. You must configure this metric. For more information about how to configure this metric, see Configure cluster alerts. The following table describes the values of the metric.
Value | Color | Status | Description |
0.00 | Green | All primary and replica shards are available. | All the indexes stored on the cluster are healthy and do not have unassigned shards. |
1.00 | Yellow | All primary shards are available, but not all of the replica shards are available. | One or more indexes have unassigned replica shards. |
2.00 | Red | Not all of the primary shards are available. | One or more indexes have unassigned primary shards. |
The colors in the table refer to the colors of the cluster status displayed on the Basic Information page of the instance.
Exception causes
During monitoring, when the metric value is not 0.00
, it indicates that the cluster status is abnormal. Common causes include the following:
The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
The disk usage of the nodes in the cluster is excessively high. For example, the disk usage is higher than 85% or reaches 100%.
The Load_1m of the nodes is excessively high.
The statuses of the indexes stored on the cluster are abnormal (not green).
Suggestions for handling exceptions
View the monitoring information on the Monitoring page of the Kibana console, or view the logs of the instance to obtain specific information about the issue and troubleshoot it (for example, if an index uses too much memory, you can delete some indexes).
For cluster exceptions caused by high disk usage, we recommend that you troubleshoot the issue based on Methods to troubleshoot and handle high disk usage and read_only issues.
For instances with 1 core and 2 GB of memory, if the instance status is abnormal, we recommend that you first upgrade the cluster to a specification with a CPU-to-memory ratio of 1:4 to increase the instance specifications. If the cluster is still abnormal after you increase the specifications, we recommend that you troubleshoot the issue based on the preceding two solutions.
Snapshot status (value)
Description
This metric displays the snapshot status of the automatic backup feature in the Elasticsearch console. When the metric value is 0
, it indicates that snapshots are created. The following table describes the values of the metric.
Snapshot status | Description |
0 | Snapshots are created. |
-1 | No snapshots are created. |
1 | The system is creating a snapshot. |
2 | The system failed to create a snapshot. |
Exception causes
When the metric value is 2
, the service is abnormal. The common causes are as follows:
The disk usage of the nodes in the cluster is excessively high or close to 100%.
The cluster is abnormal.
Total number of nodes in the cluster (count)
This metric indicates the total number of nodes in the cluster, which is used to monitor whether the node scale meets expectations.
Total number of disconnected nodes in the cluster (count)
This metric indicates the total number of disconnected nodes in the cluster. Disconnected nodes may cause shards to be reassigned or increase query latency.
Cluster index count (count)
This metric indicates the number of indexes in the cluster. Too many indexes may lead to resource contention (for example, memory, CPU).
Cluster shard count (count)
This metric indicates the number of shards in a cluster. Too many shards increase management costs (for example, metadata operations). Too few shards may affect query performance (for example, uneven payload).
Cluster primary shard count
This metric indicates the number of primary shards in the cluster. Insufficient primary shards may cause write bottlenecks.
Number of slow queries in the cluster (count)
This metric indicates the number of slow queries in the cluster. You can use this metric to identify performance bottlenecks (such as complex queries or index design issues).
Cluster write QPS (count/s)
If the write QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. You should avoid this situation.
This metric shows the number of documents written to a cluster per second. The details are as follows:
If the cluster receives a write request that contains only one document within 1 second, the value of this metric is 1. The value increases with the number of write requests received per second.
If multiple documents are written to the cluster in batch by using the _bulk API within 1 second, the write QPS is calculated based on the total number of documents in the request. If multiple batch write requests are sent by using the _bulk API within 1 second, the values are accumulated.
Cluster query QPS (count/s)
If the query QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. Try to avoid this situation.
This metric shows the number of queries per second (QPS) that are executed on the cluster. The number of queries per second is related to the number of primary shards in the index that you want to query.
For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.
Cluster slow query time distribution
Description
This metric is based on the logs of index.search.slowlog.query
and index.search.slowlog.fetch
in the slow query log. It aggregates the time taken (took_millis) and displays the distribution in intervals of 1
second (0~1s, 1~2s, up to 10s). You can configure the threshold for slow logs. For related parameters, see the index.search.slowlog.threshold.xxx
parameter in Index template configuration.
Exception causes
During the monitoring period, when the slow query time interval increases and the number of queries increases, service exceptions may occur. Common causes are as follows:
Exception cause | Description |
QPS | Query QPS or Write QPS traffic surges or fluctuates significantly, causing high cluster pressure and longer query response times. |
Aggregate queries or script queries | Aggregate query scenarios require a large amount of computing resources for data aggregation, especially memory. Please be cautious when using them. |
Term queries on numeric fields | When performing term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregate queries, it is recommended to change it to a keyword type field. |
Fuzzy matching | Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale. |
The cluster receives a few slow query or write requests | In this case, the query and write QPS traffic fluctuations are small or not obvious. You can view and analyze by clicking Searching Slow Log on the Query logs page in the Alibaba Cloud Elasticsearch console. |
The cluster stores many indexes or shards | Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high load_1m, affecting the query speed of the entire cluster. |
Merge operations are performed on the cluster | Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. You can check this on the Overview page of the node in the Kibana console. |
Garbage collection (GC) operations are performed on the cluster | GC operations attempt to release memory (such as full GC), consume CPU resources, and may cause CPU utilization to surge, affecting query speed. |
Scheduled tasks are performed on the cluster | Data backup or other custom tasks require a large amount of IO resources, affecting query speed. |
Cluster Fielddata memory usage (B)
Description
This metric shows the Fielddata memory usage in the cluster. The higher the monitoring curve, the more Fielddata data is cached in heap memory. Excessive Fielddata memory usage can trigger Fielddata memory circuit breaking, affecting cluster stability.
Exception causes
During the monitoring period, when the metric occupies a large amount of heap memory, service abnormalities may occur. Common causes include the following:
Queries contain many sort or aggregation operations on string (Text) fields. Fielddata for such queries is not revoked by default. It is recommended to use numeric field types.
Query QPS or Write QPS traffic surges or fluctuates significantly, causing Fielddata to be cached frequently.
The cluster stores many indexes or shards. Because Elasticsearch monitors the indexes in the cluster and writes logs, when there are too many total indexes or shards, this can easily cause high CPU or HeapMemory usage, or high Load_1m payload.
Index Bulk write TPS (count/s)
Description
This metric displays the number of Bulk requests per second for the index.
Exception causes
During the monitoring period, this metric may have no data. Common causes include the following:
High cluster pressure affects the normal collection of cluster monitoring data.
Monitoring data failed to be pushed.
Index query QPS (count/s)
Description
This metric shows the number of queries per second (QPS) executed on an index. The QPS value is related to the number of primary shards in the index being queried.
For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.
Exception causes
During the monitoring period, this metric may have no data. Common causes include the following:
High cluster pressure affects the normal collection of cluster monitoring data.
Monitoring data failed to be pushed.
A sudden increase in index query QPS may cause high CPU utilization, HeapMemory usage, or Load_1m in the cluster, affecting the entire cluster service. You can optimize the index to address these issues.
Index end-to-end query latency_max (ms)
This metric indicates the maximum time consumed by query requests to the index, measured in milliseconds.
Node CPU utilization_ES service (%)
Description
This metric displays the CPU utilization percentage of each node in the cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.
Exception causes
If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
Exception cause | Description |
QPS | Query QPS or Write QPS traffic spikes or significantly fluctuates. |
The cluster receives a few slow query or write requests | In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Searching Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze. |
The cluster stores many indexes or shards | Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high Load_1m. |
Merge operations are performed on the cluster | Merge operations consume CPU resources. The Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console. |
GC operations are performed | GC operations attempt to free up memory (such as full gc) and consume CPU resources. As a result, the CPU utilization may spike. |
Scheduled tasks are performed on the cluster | Scheduled tasks, such as data backup or custom tasks, are performed on the cluster. |
The NodeCPUUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.
Node disk usage (%)
This metric displays the disk usage percentage of each node in the cluster. We recommend that you set the threshold to a value less than 75%. Do not set the threshold for this metric to a value greater than 85%. Otherwise, the following situations may occur, which can affect your services that run on the cluster.
Node disk usage | Description |
>85% | New shards cannot be assigned. |
>90% | The cluster attempts to migrate shards from the node to other data nodes with lower disk usage. |
>95% | Elasticsearch forcibly sets the |
We recommend that you configure this metric. After the related alerts are triggered, you can resize disks, add nodes, or delete index data at the earliest opportunity to ensure that your services are not affected.
The NodeDiskUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.
Node heap memory usage_ES service (%)
Description
This metric displays the heap memory usage percentage of each node in the cluster. When the heap memory usage is high or large memory objects exist, cluster services are affected, and garbage collection (GC) operations are automatically triggered.
Exception causes
If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
Exception cause | Description |
QPS | Query QPS or Write QPS traffic spikes or significantly fluctuates. |
The cluster receives a few slow query or write requests | In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Searching Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs. |
The cluster receives many slow query or write requests | In this case, the query and write QPS traffic fluctuates significantly or noticeably. You can click Indexing Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs. |
The cluster stores many indexes or shards | Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, the CPU or heap memory usage, or Load_1m can become too high. |
Merge operations are performed on the cluster | Merge operations consume CPU resources, and the Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console. |
GC operations are performed | GC operations attempt to free up memory (for example, Full GC) and consume CPU resources. This may cause the heap memory usage to drop sharply. |
Scheduled tasks are performed on the cluster | Data backup or other custom tasks. |
Node Load_1m (value)
Description
This metric shows the load of each node in the cluster within 1 minute, indicating how busy each node's system is. In normal cases, the value of this metric is less than the number of vCPUs on the node. The following table describes the values of the metric for a node that has only one vCPU.
Node Load_1m | Description |
< 1 | No pending processes exist. |
= 1 | The system does not have idle resources to run more processes. |
> 1 | Processes are queuing for resources. |
The Node Load_1m metric includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.
Fluctuations in the Node Load_1m metric might be normal. We recommend that you focus on analyzing the Node CPU Utilization metric.
Exception causes
If the value of the metric exceeds the number of vCPUs on a node, an error occurs. This issue may be caused by one or more of the following reasons:
The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
The Query QPS or Write QPS traffic surges or increases significantly.
The cluster receives slow query requests.
You can go to the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the corresponding logs.
Node Load_1m includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.
Node memory usage_total (%)
This metric displays the system memory usage of the node.
Node CPU IO wait percentage (%)
This metric displays the CPU IO wait percentage of the node.
Node network package_input (count)
This metric displays the number of inbound traffic packets for each node in the cluster. The monitoring cycle of the metric is 1 minute.
Node network plan_Outputs (count)
This metric displays the number of outbound traffic from the data transfer plan for each node in the cluster. The monitoring cycle of the metric is 1 minute.
Node network bandwidth_input (KiB/s)
This metric displays the inbound rate of data packets per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.
Node network bandwidth_Outputs (KiB/s)
This metric displays the outbound data packet rate per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.
Node TCP connections (count)
Description
This metric displays the number of TCP connection requests initiated by clients to each node in the cluster.
Exception causes
During monitoring, when the metric value spikes or significantly fluctuates, a service error occurs. A common cause is that TCP connections initiated by clients are not released for an extended period of time, resulting in a sudden increase in the number of TCP connections on nodes. We recommend that you configure related policies for your client to release connections.
IOUtil (%)
Description
This metric displays the IO usage percentage of each node in the cluster.
Exception causes
If the value of the metric spikes or significantly fluctuates during monitoring, a service error occurs. This issue may be caused by high disk usage. High disk usage increases the average wait time for data read and write operations, resulting in a sudden increase in IO usage, which may even reach 100%. We recommend that you troubleshoot the issue based on your cluster configuration and other metrics. For example, you can upgrade the configuration of your cluster.
Node network retransmission rate (%)
This metric displays the network retransmission rate of the node.
Node network bandwidth (KiB/s)
Node network bandwidth (KiB/s) = Node network bandwidth_Input (KiB/s) + Node network bandwidth_Outputs (KiB/s).
Node network bandwidth usage (%)
Node network bandwidth usage (%) = (Node network bandwidth_input (KiB/s) + Node network bandwidth_Outputs (KiB/s) / Node network base bandwidth (Gbit/s).
Node network plan (count)
Node network plan (count) = Node network plan_Outputs (count) + Node network plan_Inputs (count).
Node network packet usage (%)
Node network packet usage (%) = (Node network packet_Outputs (count) + Node network packet_Inputs (count)) / packet forwarding PPS.
Disk bandwidth_read (MiB/s)
This metric displays the amount of data read from nodes in the secondary cluster per second.
Disk bandwidth_write (MiB/s)
This metric displays the amount of data written to each node in the cluster per second.
Disk IOPS_read (count)
This metric displays the number of read requests completed per second on each node in the cluster.
Disk IOPS_write (count)
This metric displays the number of write requests completed per second by each node in the cluster.
Average request queue length
This metric displays the average length of the request queue.
Disk bandwidth (MiB/s)
Disk bandwidth (MiB/s) = Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s).
Disk bandwidth usage_disk (%)
Disk bandwidth usage_disk (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / Formula for calculating single disk throughput performance (MB/s).
Disk bandwidth usage_node (%)
Disk bandwidth usage_node (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / Disk basic bandwidth (Gbit/s).
Disk IOPS (count)
Disk IOPS (count) = Disk IOPS_read (count) + Disk IOPS_write (count).
Disk IOPS usage_disk (%)
Disk IOPS usage_disk (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / Single disk IOPS performance calculation formula.
Disk IOPS usage_node (%)
Disk IOPS usage_node (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / Disk basic IOPS.
Node old generation usage (B)
Description
This metric shows the size of the old generation heap memory usage for each node in the cluster. When the old generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers garbage collection (GC) operations. The collection of large objects may result in long GC durations or full GC.
Exception causes
If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
Exception cause | Description |
QPS | Query QPS or Write QPS traffic spikes or significantly fluctuates. |
Aggregation queries or script queries | Aggregation query scenarios require a large amount of computing resources for data aggregation, especially memory. Please be cautious when using them. |
Term queries on numeric fields | When performing term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregation operations, we recommend that you change it to a keyword type field. |
Fuzzy matching | Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale. |
The cluster receives a few slow query or write requests | In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Searching Slow Logs to view and analyze the logs. |
The cluster receives many slow query or write requests | In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Indexing Slow Logs to view and analyze the logs. |
The cluster stores many indexes or shards | The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level. |
Merge operations are performed on the cluster | Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. You can check this on the Overview page of the node in the Kibana console. |
GC operations are performed | GC operations attempt to free up memory (for example, Full GC), consume CPU resources, and may cause a significant decrease in heap memory usage. |
Scheduled tasks are performed on the cluster | Scheduled tasks, such as data backup or custom tasks, are performed on the cluster. |
Full GC count (Count)
Frequent Full GC occurrences in the system affect cluster service performance.
Description
This metric displays the total number of GC operations in the cluster within 1 minute.
Exception causes
If the value of this metric is not 0, an error has occurred. This issue may be caused by one or more of the following reasons:
High heap memory usage in the cluster.
Large objects stored in the cluster memory.
Node Old GC frequency (count)
Metric description
This metric indicates the number of Old Generation garbage collections on each node in the cluster. When the Old Generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers garbage collection operations. The collection of large objects may result in long GC durations or Full GC.
The Full GC basic monitoring metric is obtained through logs, while memory metrics in advanced monitoring depend on ES engine collection. These two methods have differences in data acquisition and application. We recommend that you comprehensively evaluate cluster performance by combining all metrics.
Exception causes
For more information, see Node Old Area Usage (B).
Node Old GC duration (ms)
Description
This metric indicates the average time spent on Old generation garbage collection for each node in the cluster. When the Old generation area usage is high or large memory objects exist, GC operations are automatically triggered. The collection of large objects may result in longer GC durations or Full GC.
Exception causes
For more information, see Node Old generation usage (B).
Query thread pool running thread count (count)
Indicates the number of threads in the query thread pool that are currently executing tasks in the cluster.
Query thread pool rejected requests (count)
Indicates the number of rejected requests in the query thread pool within the cluster. This metric is deprecated. We recommend that you use Query thread pool rejected requests (new) (count).
Query the number of rejected requests in the thread pool (new version) (count)
Indicates the number of rejected requests in the query thread pool within the cluster. When all threads in the thread pool are processing tasks and the task queue is full, new query requests are rejected and exceptions are thrown.
NodeStatsExceptionLogCount(Count)
Description
This metric shows the time consumed by garbage collection (GC) operations on each node in the cluster. The higher the value, the longer the GC operations take. Long GC operations may affect cluster services.
Exception causes
During monitoring, when the metric value is not 0
, the service is abnormal. Common causes include the following:
The cluster receives abnormal query requests.
The cluster receives abnormal write requests.
Errors occur when the cluster runs tasks.
Garbage collection operations have been executed.
Suggestions for handling exceptions
You can go to the LogSearch page in the Alibaba Cloud Elasticsearch console, and click Main Logs. On the Main Logs page, you can view detailed exception information based on the time point and analyze the cause of the exception.
If there are GC records in the Main Logs, they will also be counted and displayed in the Exception Count monitoring metric.