Metrics and exception handling - Elasticsearch - Alibaba Cloud Documentation Center

Alibaba Cloud Elasticsearch provides monitoring metrics to help you understand cluster health, identify performance issues, and take corrective action. This topic describes available metrics, their meanings, common exception causes, and recommended handling procedures.

Note

Metric values in this document are for reference and may differ from those displayed in the console. Always refer to the console for the most current and authoritative information.

Quick reference: Recommended thresholds

Metric	Warning	Critical
Disk usage	> 75%	> 85%
CPU utilization	> 80%	> 95%
Heap memory	> 75%	> 85%
Node Load_1m	> vCPU count	> 2× vCPU count

View metrics

Log on to the Alibaba Cloud Elasticsearch console.
In the left navigation menu, choose Elasticsearch Clusters.
Navigate to the target cluster.
1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
2. On the Elasticsearch Clusters page, find the cluster and click its ID.
In the left-side navigation pane, choose Monitoring and Logs > Cluster Monitoring.
View monitoring details.
- View Basic Monitoring details
  
  On the Basic Monitoring tab, select a Group Name category and a monitoring period as needed to view the monitoring details of resources in the corresponding category during the specified period.
  Note
  
  Click Custom to view monitoring details within a custom time period as needed.
  
  The monitoring and alerting feature for Elasticsearch clusters is enabled by default. Therefore, you can view historical monitoring data on the Cluster Monitoring page. You can view monitoring data by minute, and monitoring data is retained only for 30 days.

Understand monitoring differences

The cluster monitoring feature provided by Alibaba Cloud Elasticsearch may differ from the monitoring feature provided by Kibana or third-party services in the following aspects:

Sampling period differences: The sampling period differs from that of Kibana or third-party monitoring, resulting in different collected data and therefore differences.
Query algorithm differences: Both Alibaba Cloud Elasticsearch cluster monitoring and Kibana monitoring are affected by cluster stability when collecting data. The QPS metric in cluster monitoring may show sudden increases, negative values, or no monitoring data due to cluster jitter, while Kibana monitoring may show empty values.

Note
If cluster monitoring provides more metrics than Kibana, use both features for comprehensive monitoring.
Collection interface differences: Kibana monitoring metrics depend on the Elasticsearch API, while some node-level metrics in cluster monitoring (such as CPU utilization, load_1m, disk usage) call the underlying system interfaces of Alibaba Cloud Elasticsearch. Therefore, the monitoring includes not only the Elasticsearch process but also the usage of system-level resources.

Comprehensive metric reference

This section provides detailed information for each available monitoring metric, organized by category.

Cluster metrics

ClusterStatus

Description

This metric displays the health status of the cluster. A value of 0.00 indicates that the cluster is normal. This metric is essential to cluster monitoring. For detailed instructions, see Configure cluster alerts. The following table describes the values of the metric.

Value	Color	Status	Description
0.00	Green	All primary and replica shards are available.	All the indexes stored on the cluster are healthy and do not have unassigned shards.
1.00	Yellow	All primary shards are available, but not all of the replica shards are available.	One or more indexes have unassigned replica shards.
2.00	Red	Not all of the primary shards are available.	One or more indexes have unassigned primary shards.

Note

The colors listed above correspond to the cluster status displayed on your instance's Basic Information page.

Common causes for exceptions

During monitoring, when the metric value is not 0.00, it indicates an abnormal state. Common reasons for such exceptions include:

The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
The disk usage of the nodes in the cluster is excessively high. For example, the disk usage is higher than 85% or reaches 100%.
The Load_1m of the nodes is excessively high.
The statuses of the indexes stored on the cluster are abnormal (not green).

Troubleshooting

View the monitoring information on the Monitoring page of the Kibana console, or view the logs of the instance to obtain specific information about the issue and troubleshoot it (for example, if an index uses too much memory, you can delete some indexes).
For cluster exceptions caused by high disk usage, troubleshoot based on Methods to troubleshoot and handle high disk usage and read_only issues.
For instances with 1 core and 2 GB of memory, if the instance status is abnormal, first upgrade the cluster to a specification with a CPU-to-memory ratio of 1:4. If the cluster is still abnormal after upgrading, troubleshoot based on the preceding two solutions.

ClusterAutoSnapshotLatestStatus

Description

This metric displays the snapshot status of the automatic backup feature in the Elasticsearch console. When the metric value is 0, it indicates that snapshots are created. The following table describes the values of the metric.

Snapshot status	Description
0	Snapshots are created.
-1	No snapshots are created.
1	The system is creating a snapshot.
2	The system failed to create a snapshot.

Common causes for exceptions

When the metric value is 2, the service is abnormal. The common causes are as follows:

The disk usage of the nodes in the cluster is excessively high or close to 100%.
The cluster is abnormal.

ClusterNodeCount

This metric indicates the total number of nodes in the cluster, which is used to monitor whether the node scale meets expectations.

ClusterDisconnectedNodeCount

This metric indicates the total number of disconnected nodes in the cluster. Disconnected nodes may cause shards to be reassigned or increase query latency.

Cluster Index Count

This metric indicates the number of indexes in the cluster. Too many indexes may lead to resource contention (for example, memory, CPU).

Cluster Shard Count

This metric indicates the number of shards in a cluster. Too many shards increase management costs (for example, metadata operations). Too few shards may affect query performance (for example, uneven payload).

Cluster Primary Shard Count

This metric indicates the number of primary shards in the cluster. Insufficient primary shards may cause write bottlenecks.

Cluster Slow Searching Count

This metric indicates the number of slow queries in the cluster. You can use this metric to identify performance bottlenecks (such as complex queries or index design issues).

ClusterIndexQPS

Important

If the write QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. You should avoid this situation.

This metric shows the number of documents written to a cluster per second. The details are as follows:

If the cluster receives a write request that contains only one document within 1 second, the value of this metric is 1. The value increases with the number of write requests received per second.
If multiple documents are written to the cluster in batch by using the _bulk API within 1 second, the write QPS is calculated based on the total number of documents in the request. If multiple batch write requests are sent by using the _bulk API within 1 second, the values are accumulated.

ClusterQueryQPS

Important

If the query QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. Try to avoid this situation.

This metric shows the number of queries per second (QPS) that are executed on the cluster. The number of queries per second is related to the number of primary shards in the index that you want to query.

For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.

Cluster Slow Searching Distribution

Description

This metric is based on the logs of index.search.slowlog.query and index.search.slowlog.fetch in the slow query log. It aggregates the time taken (took_millis) and displays the distribution in intervals of 1 second (0-1s, 1-2s, up to 10s). You can configure the threshold for slow logs. For related parameters, see the index.search.slowlog.threshold.xxx parameter in Index template configuration.

Common causes for exceptions

During the monitoring period, when the slow query time interval increases and the number of queries increases, service exceptions may occur. Common causes are as follows:

Exception cause	Description
QPS	Query or write QPS surges or fluctuates significantly, causing high cluster pressure and longer query response times.
Aggregate queries or script queries	Aggregate query scenarios require a large amount of computing resources for data aggregation, especially memory.
Term queries on numeric fields	During term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregate queries, it is recommended to change it to a keyword type field.
Fuzzy matching	Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale.
The cluster receives a few slow query or write requests	In this case, the query and write QPS traffic fluctuations are small or not obvious. View and analyze it by clicking Search Slow Log on the Query logs page in the Alibaba Cloud Elasticsearch console.
The cluster stores many indexes or shards	Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high load_1m, affecting the query speed of the entire cluster.
Merge operations are performed on the cluster	Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. Check this on the Overview page of the node in the Kibana console.
Garbage collection (GC) operations are performed on the cluster	GC operations attempt to release memory (such as full GC), consume CPU resources, and may cause CPU utilization to surge, affecting query speed.
Scheduled tasks are performed on the cluster	Data backup or other custom tasks require a large amount of IO resources, affecting query speed.

FielddataMemoryUsedBytes

Description

This metric shows the Fielddata memory usage in the cluster. The higher the monitoring curve, the more Fielddata data is cached in heap memory. Excessive Fielddata memory usage can trigger Fielddata memory circuit breaking, affecting cluster stability.

Common causes for exceptions

During the monitoring period, when the metric occupies a large amount of heap memory, service abnormalities may occur. Common causes include the following:

Queries contain many sort or aggregation operations on string (Text) fields. Fielddata for such queries is not revoked by default. It is recommended to use numeric field types.
Query or write QPS traffic surges or fluctuates significantly, causing Fielddata to be cached frequently.
The cluster stores many indexes or shards. Because Elasticsearch monitors the indexes in the cluster and writes logs, when there are too many total indexes or shards, this can easily cause high CPU or HeapMemory usage, or high Load_1m payload.

Index metrics

BulkTotalOperation

Description

This metric displays the number of bulk requests per second for the index.

Common causes for exceptions

During the monitoring period, this metric may have no data. Common causes include the following:

High cluster pressure affects the normal collection of cluster monitoring data.
Monitoring data failed to be pushed.

IndexSearchQPS

Description

This metric shows the number of QPS on an index. The QPS value is related to the number of primary shards in the index being queried.

For example, if the index to query data has five primary shards, your cluster can process five queries per second.

Common causes for exceptions

During the monitoring period, this metric may have no data. Common causes include the following:

High cluster pressure affects the normal collection of cluster monitoring data.
Monitoring data failed to be pushed.

Important

A sudden increase in index query QPS may cause high CPU utilization, HeapMemory usage, or Load_1m in the cluster, affecting the entire cluster service. You can optimize the index to address these issues.

IndexSearchDelayMax

Indicates the maximum time consumed by query requests to the index, measured in milliseconds.

Node resource metrics

Node CPU Utilization_ES Business

Description

This metric displays the CPU utilization percentage of each node in the cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.

Common causes for exceptions

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause	Description
QPS	Query or write QPS spikes or significantly fluctuates.
The cluster receives a few slow query or write requests	In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Search Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze.
The cluster stores many indexes or shards	Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high Load_1m.
Merge operations are performed on the cluster	Merge operations consume CPU resources. The Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console.
GC operations are performed	GC operations attempt to free up memory (such as full gc) and consume CPU resources. As a result, the CPU utilization may spike.
Scheduled tasks are performed on the cluster	Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

Note

The NodeCPUUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

Node Disk Usage

This metric displays the disk usage percentage of each node. Keep disk usage below 75%. Do not exceed 85%. Otherwise, the following situations may occur, which can affect your services that run on the cluster.

Node disk usage	Description
>85%	New shards cannot be assigned.
>90%	The cluster attempts to migrate shards from the node to other data nodes with lower disk usage.
>95%	Elasticsearch forcibly sets the `read_only_allow_delete` property for each index in the cluster. In this case, data cannot be written to the indexes. The indexes can only be read or deleted.

Important

It's highly recommended to configure this metric. When alerts are triggered, resize disks, add nodes, or delete index data promptly to avoid service impact.
The NodeDiskUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

Node Heap Memory Usage_ES Business

Description

This metric displays the heap memory usage percentage of each node in the cluster. When the heap memory usage is high or large memory objects exist, cluster services are affected, and GC operations are automatically triggered.

Common causes for exceptions

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause	Description
QPS	Query or write QPS traffic spikes or significantly fluctuates.
The cluster receives a few slow query or write requests	In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Search Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs.
The cluster receives many slow query or write requests	In this case, the query and write QPS traffic fluctuates significantly or noticeably. You can click Indexing Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs.
The cluster stores many indexes or shards	Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, the CPU or heap memory usage, or Load_1m can become too high.
Merge operations are performed on the cluster	Merge operations consume CPU resources, and the Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console.
GC operations are performed	GC operations attempt to free up memory (for example, Full GC) and consume CPU resources. This may cause the heap memory usage to drop sharply.
Scheduled tasks are performed on the cluster	Data backup or other custom tasks.

Node Memory Usage_Total

This metric displays the system memory usage of the node.

NodeStatsCpuIOWaitPercentage

This metric displays the CPU IO wait percentage of the node.

NodeLoad_1m

Description

This metric shows the 1-minute load of each node, indicating system busyness. Normally, this value is less than the number of vCPUs on the node. The following table describes the values of the metric for a node that has only one vCPU.

Node Load_1m	Description
< 1	No pending processes exist.
= 1	The system does not have idle resources to run more processes.
> 1	Processes are queuing for resources.

Note

The metric includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.
Fluctuations in the NodeLoad_1m metric is typically normal. The Node CPU utilization provides more information about fluactuations.

Common causes for exceptions

If the value of the metric exceeds the number of vCPUs on a node, an error occurs. This issue may be caused by one or more of the following reasons:

The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
The query or write QPS traffic surges or increases significantly.
The cluster receives slow query requests.

You can go to the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the corresponding logs.

Note

NodeLoad_1m includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.

Node network metrics

Node Network Plan_Input

This metric displays the number of inbound traffic packets for each node in the cluster. The monitoring cycle of the metric is 1 minute.

Node Network Plan_Output

This metric displays the number of outbound traffic from the data transfer plan for each node in the cluster. The monitoring cycle of the metric is 1 minute.

Node Network Bandwidth_Input

This metric displays the inbound rate of data packets per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

Node Network Bandwidth_Output

This metric displays the outbound data packet rate per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

NodeStatsTcpEstablished

Description

This metric displays the number of TCP connection requests initiated by clients to each node in the cluster.

Common causes for exceptions

During monitoring, when the metric value spikes or significantly fluctuates, a service error occurs. A common cause is that TCP connections initiated by clients are not released for an extended period, causing a sudden increase in TCP connections on nodes. Configure client policies to release connections.

NodeStatsIOUtil

Description

This metric displays the IO usage percentage of each node in the cluster.

Common causes for exceptions

If the metric value spikes or significantly fluctuates during monitoring, a service error occurs. This may be caused by high disk usage, which increases the average wait time for data read and write operations, causing IO usage to spike, potentially reaching 100%. Troubleshoot based on your cluster configuration and other metrics. For example, upgrade the cluster configuration.

NodeStatsNetworkRetransRate

This metric displays the network retransmission rate of the node.

Node Network Bandwidth

Node network bandwidth (KiB/s) = Node network bandwidth_Input (KiB/s) + Node Network Bandwidth_Output.

Node Network Bandwidth Usage

Node network bandwidth usage (%) = (Node Network Bandwidth_Input + Node Network Bandwidth_Output / Node network base bandwidth (Gbit/s).

Node Network Plan

Node network plan (count) = Node Network Plan_Output + Node network plan_Inputs (count).

Node Network Plan Usage

Node network packet usage (%) = (Node network packet_Outputs (count) + Node network packet_Inputs (count)) / packet forwarding PPS.

Node disk metrics

Disk Bandwidth_Read

This metric displays the amount of data read from nodes in the secondary cluster per second.

Disk Bandwidth_Write

This metric displays the amount of data written to each node in the cluster per second.

Disk IOPS_Read

This metric displays the number of read requests completed per second on each node in the cluster.

Disk IOPS_Write

This metric displays the number of write requests completed per second by each node in the cluster.

DiskAverageQueueSize

This metric displays the average length of the request queue.

Disk Bandwidth

Disk bandwidth (MiB/s) = Disk Bandwidth_Read + Disk Bandwidth_Write.

Disk Bandwidth Usage_Disk

Disk bandwidth usage_disk (%) = (Disk Bandwidth_Read + Disk Bandwidth_Write) / Formula for calculating single disk throughput performance (MB/s).

Disk Bandwidth Usage_Node

Disk bandwidth usage_node (%) = (Disk Bandwidth_Read + Disk Bandwidth_Write) / Disk basic bandwidth (Gbit/s).

NodeStatsDiskIops

Disk IOPS (count) = Disk IOPS_Read + Disk IOPS_Write.

Disk IOPS Usage_Disk

Disk IOPS usage_disk (%) = (Disk IOPS_Read + Disk IOPS_Write) / Single disk IOPS performance calculation formula.

Disk IOPS Usage_Node

Disk IOPS usage_node (%) = (Disk IOPS_Read + Disk IOPS_Write) / Disk basic IOPS.

Node JVM metrics

JVMMemoryOldUsedBytes

Description

This metric shows the size of the old generation heap memory usage for each node in the cluster. When the old generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers GC. The collection of large objects may result in long GC durations or full GC.

Common causes for exceptions

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Cause	Description
QPS	Query or write QPS traffic spikes or significantly fluctuates.
Aggregation queries or script queries	Aggregation query scenarios require a large amount of computing resources for data aggregation, especially memory. Please be cautious when using them.
Term queries on numeric fields	When performing term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregation operations, change it to a keyword type field.
Fuzzy matching	Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale.
The cluster receives a few slow query or write requests	In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Search Slow Logs to view and analyze the logs.
The cluster receives many slow query or write requests	In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Indexing Slow Logs to view and analyze the logs.
The cluster stores many indexes or shards	The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level.
Merge operations are performed on the cluster	Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. You can check this on the Overview page of the node in the Kibana console.
GC operations are performed	GC operations attempt to free up memory (for example, Full GC), consume CPU resources, and may cause a significant decrease in heap memory usage.
Scheduled tasks are performed on the cluster	Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

NodeStatsFullGcCollectionCount

Important

Frequent Full GC occurrences in the system affect cluster service performance.

Description

This metric displays the total number of GC operations in the cluster within 1 minute.

Common causes for exceptions

If the value of this metric is not 0, an error has occurred. This issue may be caused by one or more of the following reasons:

High heap memory usage in the cluster.
Large objects stored in the cluster memory.

JVMGCOldCollectionCount

Description

This metric indicates the number of Old Generation garbage collections on each node in the cluster. When the Old Generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers garbage collection operations. The collection of large objects may result in long GC durations or Full GC.

Note

The Full GC basic monitoring metric is obtained through logs, while memory metrics in advanced monitoring depend on ES engine collection. These two methods have differences in data acquisition and application. Evaluate cluster performance by combining all metrics.

Common causes for exceptions

For more information, see Node Old Area Usage (B).

JVMGCOldCollectionDuration

Description

This metric indicates the average time spent on Old generation garbage collection for each node in the cluster. When the Old generation area usage is high or large memory objects exist, GC operations are automatically triggered. The collection of large objects may result in longer GC durations or Full GC.

Common causes for exceptions

For more information, see JVMMemoryOldUsedBytes.

Thread pool metrics

SearchThreadpoolActiveThreads

Indicates the number of threads in the query thread pool that are currently executing tasks in the cluster.

SearchThreadpoolRejectedV2

Indicates the number of rejected requests in the query thread pool within the cluster. When all threads in the thread pool are processing tasks and the task queue is full, new query requests are rejected and exceptions are thrown.

Other metrics

NodeStatsExceptionLogCount

Description

This metric shows the time consumed by GC operations on each node in the cluster. The higher the value, the longer the GC operations take. Long GC operations may affect cluster services.

Common causes for exceptions

During monitoring, when the metric value is not 0, the service is abnormal. Common causes include the following:

The cluster receives abnormal query requests.
The cluster receives abnormal write requests.
Errors occur when the cluster runs tasks.
Garbage collection operations have been executed.

Troubleshooting

You can go to the Query logs page in the Alibaba Cloud Elasticsearch console, and click Cluster Logs. On the Cluster Logs page, you can view detailed exception information based on the time point and analyze the cause of the exception.

Note

If there are GC records in the Cluster Logs, they will also be counted and displayed in the NodeStatsExceptionLogCount monitoring metric.

Deprecated metric

SearchThreadpoolRejected

Indicates the number of rejected requests in the query thread pool within the cluster. This metric is deprecated. Use SearchThreadpoolRejected instead.