All Products
Search
Document Center

Elasticsearch:Metrics and exception handling suggestions

Last Updated:Jun 12, 2025

Alibaba Cloud Elasticsearch provides multiple basic monitoring metrics (such as cluster status, cluster query QPS, node CPU utilization, node disk usage) for running clusters to monitor their operational status. You can use these metrics to understand cluster operational status in real time, handle potential risks promptly, and ensure stable cluster operation. This topic describes how to view cluster monitoring details and provides the meanings of various monitoring metrics, exception causes, and exception handling suggestions.

Differences with other monitoring features

The cluster monitoring feature provided by Alibaba Cloud Elasticsearch may differ from the monitoring feature provided by Kibana or third-party services in the following aspects:

  • Sampling period differences: The sampling period differs from that of Kibana or third-party monitoring, resulting in different collected data and therefore differences.

  • Query algorithm differences: Both Alibaba Cloud Elasticsearch cluster monitoring and Kibana monitoring are affected by cluster stability when collecting data. The QPS metric in cluster monitoring may show sudden increases, negative values, or no monitoring data due to cluster jitter, while Kibana monitoring may show empty values.

    Note

    If the cluster monitoring feature provides more metrics than the monitoring feature provided by Kibana, we recommend that you use both features simultaneously to perform monitoring in your business scenarios.

  • Collection interface differences: Kibana monitoring metrics depend on the Elasticsearch API, while some node-level metrics in cluster monitoring (such as CPU utilization, load_1m, disk usage) call the underlying system interfaces of Alibaba Cloud Elasticsearch. Therefore, the monitoring includes not only the Elasticsearch process but also the usage of system-level resources.

View cluster monitoring data

  1. Log on to the Alibaba Cloud Elasticsearch console.

  2. In the left-side navigation pane, click Elasticsearch Clusters.

  3. Navigate to the desired cluster.

    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.

    2. On the Elasticsearch Clusters page, find the cluster and click its ID.

  4. In the left-side navigation pane, choose Monitoring And Logs > Cluster Monitoring.

  5. View monitoring details.

    • View Infrastructure Monitoring details

      On the Infrastructure Monitoring tab, select a Group category and a monitoring period as needed to view the monitoring details of resources in the corresponding category during the specified period.

      Note
      • Click Custom to view monitoring details within a custom time period as needed.

      • The monitoring and alerting feature for Elasticsearch instances is enabled by default. Therefore, you can view historical monitoring data on the Cluster Monitoring page. You can view monitoring data by minute, and monitoring data is retained only for 30 days.

      • For more information about infrastructure monitoring metrics, see Overview of infrastructure monitoring metrics.

Overview of infrastructure monitoring metrics

The following table describes the categories and overview of infrastructure monitoring metrics for clusters.

Note

You can view the parameter settings in the DataWorks console.

Overview

Metric

Description

ClusterStatus(value)

Indicates the health status of a cluster. A value of 0.00 indicates that the cluster is normal.

ClusterAutoSnapshotLatestStatus(value)

Indicates the snapshot status of the Auto Snapshot feature in the Elasticsearch console.

A value of 0 indicates that snapshots are created.

ClusterNodeCount(count)

Indicates the total number of nodes in a cluster.

ClusterDisconnectedNodeCount(count)

Indicates the total number of disconnected nodes in a cluster.

ClusterIndexCount(count)

Indicates the number of indexes in a cluster.

ClusterShardCount(count)

Indicates the number of shards in a cluster.

ClusterPrimaryShardCount(count)

Indicates the number of primary shards in a cluster.

ClusterSlowQueryCount(count)

Indicates the number of slow queries in a cluster.

ClusterWriteQps(count/s)

Indicates the number of documents written to a cluster per second.

ClusterQueryQps(count/s)

Indicates the number of queries executed per second in a cluster. The number of query QPS is related to the number of primary shards in the index to be queried.

NodeCPUUtilization_ESBusiness(%)

Indicates the CPU utilization of each node in a cluster.

NodeHeapMemoryUtilization_ESBusiness(%)

Indicates the heap memory usage of each node in a cluster.

NodeDiskUtilization(%)

Indicates the disk usage of each node in a cluster. We recommend that you set the threshold to a value less than 75%. Do not set the threshold for this metric to a value greater than 85%.

NodeLoad_1m(value)

Indicates the load of each node in a cluster within 1 minute, which reflects the system busyness of each node. In normal cases, the value of this metric is less than the number of vCPUs on the node.

NodeNetworkInTraffic(KiB/s)

Indicates the inbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

NodeNetworkOutTraffic(KiB/s)

Indicates the outbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

NodeNetworkInPackets(count)

Indicates the number of inbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute.

NodeNetworkOutPackets(count)

Indicates the number of outbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute.

NodeStatsTcpEstablished(count)

Indicates the number of TCP connection requests received by each node in a cluster from clients.

IOUtil(%)

Indicates the I/O usage of each node in a cluster.

DiskReadTraffic(MiB/s)

Indicates the amount of data read from each node in a cluster per second.

DiskWriteTraffic(MiB/s)

Indicates the amount of data written to each node in a cluster per second.

DiskReadIops(count)

Indicates the number of read requests completed per second for each node in a cluster.

DiskWriteIops(count)

Indicates the number of write requests completed per second for each node in a cluster.

Cluster metrics

Metric

Description

ClusterStatus(value)

Indicates the health status of a cluster. A value of 0.00 indicates that the cluster is normal.

ClusterNodeCount(count)

Indicates the total number of nodes in a cluster.

ClusterDisconnectedNodeCount(count)

Indicates the total number of disconnected nodes in a cluster.

ClusterIndexCount(count)

Indicates the number of indexes in a cluster.

ClusterShardCount(count)

Indicates the number of shards in a cluster.

ClusterPrimaryShardCount(count)

Indicates the number of primary shards in a cluster.

ClusterSlowQueryCount(count)

Indicates the number of slow queries in a cluster.

ClusterSlowQueryLatencyDistribution

This metric is based on the index.search.slowlog.query and index.search.slowlog.fetch logs in the slow log. It aggregates data based on the time taken (took_millis) and displays the distribution in 1-second interval windows (0~1s, 1~2s, up to 10s).

ClusterAutoSnapshotLatestStatus(value)

Indicates the snapshot status of the Auto Snapshot feature in the Elasticsearch console. A value of 0 indicates that snapshots are created.

ClusterWriteQps(count/s)

Indicates the number of documents written to a cluster per second.

ClusterQueryQps(count/s)

Indicates the number of queries executed per second in a cluster. The number of query QPS is related to the number of primary shards in the index to be queried.

ClusterFielddataMemory(B)

Indicates the memory usage of Fielddata in a cluster. A higher monitoring curve indicates that a large amount of Fielddata data is cached in the heap memory. Excessive Fielddata memory usage triggers Fielddata memory circuit breaking, which affects cluster stability.

Index metrics

Metric

Description

IndexBulkTps(count/s)

Indicates the number of bulk requests per second for an index.

IndexQueryQps(count/s)

Indicates the number of queries executed per second for an index. The number of query QPS is related to the number of primary shards in the index to be queried.

IndexQueryLatencyMax(ms)

Indicates the maximum time taken for a query request on an index. Unit: milliseconds.

Node resource metrics

Metric

Description

NodeCPUUtilization_ESBusiness(%)

Indicates the CPU utilization of each node in a cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.

NodeHeapMemoryUtilization_ESBusiness(%)

Indicates the heap memory usage of each node in a cluster. If the heap memory usage is high or large objects are stored in the memory, the services that run on the cluster are affected and GC operations are automatically triggered.

NodeDiskUtilization(%)

Indicates the disk usage of each node in a cluster. We recommend that you set the threshold to a value less than 75%. Do not set the threshold for this metric to a value greater than 85%.

NodeMemoryUtilization_Total(%)

Indicates the system memory usage of a node.

Note

This metric is supported only by cloud-native new control (v3) versions.

NodeCPUIOWait(%)

Indicates the CPU I/O wait percentage of a node.

Note

This metric is supported only by cloud-native new control (v3) versions.

NodeLoad_1m(value)

Indicates the load of each node in a cluster within 1 minute, which reflects the system busyness of each node. In normal cases, the value of this metric is less than the number of vCPUs on the node.

Node network metrics

Metric

Description

Note

NodeNetworkInTraffic(KiB/s)

Indicates the inbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KiB/s.

None.

NodeNetworkOutTraffic(KiB/s)

Indicates the outbound traffic rate of each node in a cluster. The monitoring cycle of the metric is 1 minute. Unit: KiB/s.

None.

NodeNetworkTraffic(KiB/s)

NodeNetworkTraffic(KiB/s) = NodeNetworkInTraffic(KiB/s) + NodeNetworkOutTraffic(KiB/s).

This metric is supported only by cloud-native new control (v3) versions.

NodeNetworkTrafficUtilization(%)

NodeNetworkTrafficUtilization(%) = (NodeNetworkInTraffic(KiB/s) + NodeNetworkOutTraffic(KiB/s)) / Node network base bandwidth (Gbit/s).

This metric is supported only by cloud-native new control (v3) versions.

NodeStatsTcpEstablished(count)

Indicates the number of TCP connection requests received by each node in a cluster from clients.

None.

NodeNetworkRetransmitRate(%)

Indicates the network retransmission rate of a node.

This metric is supported only by cloud-native new control (v3) versions.

NodeNetworkInPackets(count)

Indicates the number of inbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute.

None.

NodeNetworkOutPackets(count)

Indicates the number of outbound packets for each node in a cluster. The monitoring cycle of the metric is 1 minute.

None.

NodeNetworkPackets(count)

NodeNetworkPackets(count) = NodeNetworkOutPackets(count) + NodeNetworkInPackets(count).

This metric is supported only by cloud-native new control (v3) versions.

NodeNetworkPacketsUtilization(%)

NodeNetworkPacketsUtilization(%) = (NodeNetworkOutPackets(count) + NodeNetworkInPackets(count)) / Node network packet forwarding PPS.

None.

Node disk metrics

Metric

Description

Note

Disk bandwidth_read (MiB/s)

Indicates the amount of data read from nodes in the secondary cluster per second.

None.

Disk bandwidth_write (MiB/s)

Indicates the amount of data written to each node in the cluster per second.

None.

Disk bandwidth (MiB/s)

Disk bandwidth (MB/s) = Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s).

Only supported by the cloud-native new control (v3) version.

Disk bandwidth usage_disk (%)

Disk bandwidth usage_disk (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / (ESSD) single disk throughput (MB/s).

Only supported by the cloud-native new control (v3) version. For ESSD single disk throughput, see ESSD introduction.

Disk bandwidth usage_node (%)

Disk bandwidth usage_node (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / Node disk basic bandwidth (Gbit/s).

Only supported by the cloud-native new control (v3) version.

IOUtil (%)

Indicates the IO utilization percentage of each node in the cluster.

None.

Disk IOPS_Read (count)

Indicates the number of read requests completed per second by each node in the cluster.

None.

Disk IOPS_Write (count)

Indicates the number of write requests completed per second by each node in the cluster.

None

Disk IOPS (count)

Disk IOPS (count) = Disk IOPS_Read (count) + Disk IOPS_Write (count).

Only supported by the cloud-native new control (v3) version.

Disk IOPS usage_disk (%)

Disk IOPS usage_disk (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / (ESSD) single disk IOPS.

Only supported by the cloud-native new control (v3) version. For ESSD single disk throughput, see ESSD introduction.

Disk IOPS usage_node (%)

Disk IOPS usage_node (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / node disk basic IOPS.

Only supported by the cloud-native new control (v3) version.

Average request queue length

Indicates the average length of the request queue.

Not applicable

Node JVM metrics

Metric name

Metric description

Node Old generation usage (B)

Indicates the heap memory Old generation usage size of each node in the cluster. When the Old generation occupancy is high or large memory objects exist, cluster services will be affected, and GC operations will be automatically triggered. The collection of large objects may result in long GC duration or Full GC.

NodeStatsFullGcCollectionCount(Count)

Indicates the total number of GC operations within 1 minute in the cluster.

Node Old GC frequency (count)

Indicates the number of GC collections in the Old generation of each node in the cluster. When the Old generation occupancy is high or large memory objects exist, cluster services will be affected, and GC operations will be automatically triggered. The collection of large objects may result in long GC duration or Full GC.

Node Old GC duration (ms)

Indicates the average duration of Old generation GC collections for each node in the cluster. When the Old generation occupancy is high or large memory objects exist, GC operations will be automatically triggered. The collection of large objects may result in long GC duration or Full GC.

Thread pool metrics

Metric name

Metric description

Query thread pool active threads (count)

Indicates the number of threads in the query thread pool that are currently executing tasks in the cluster.

Rejected requests in query thread pool (new version) (count)

Indicates the number of rejected requests in the query thread pool within the cluster.

Other metrics

Metric name

Metric description

Exception count (count)

Indicates the total number of warning-level logs that appear in the main log of the cluster within one minute.

Deprecated metrics

Metric name

Metric description

Number of rejected requests in the query thread pool (count)

Indicates the number of rejected requests in the query thread pool. This metric differs in calculation method from the Number Of Rejected Requests In The Query Thread Pool (new Version) (count) metric. This metric is now deprecated. Please use Number of rejected requests in the query thread pool (new version) (count) instead.

ClusterStatus(value)

Description

This metric displays the health status of the cluster. A value of 0.00 indicates that the cluster is normal. You must configure this metric. For more information about how to configure this metric, see Configure cluster alerts. The following table describes the values of the metric.

Value

Color

Status

Description

0.00

Green

All primary and replica shards are available.

All the indexes stored on the cluster are healthy and do not have unassigned shards.

1.00

Yellow

All primary shards are available, but not all of the replica shards are available.

One or more indexes have unassigned replica shards.

2.00

Red

Not all of the primary shards are available.

One or more indexes have unassigned primary shards.

Note

The colors in the table refer to the colors of the cluster status displayed on the Basic Information page of the instance.

Exception causes

During monitoring, when the metric value is not 0.00, it indicates that the cluster status is abnormal. Common causes include the following:

  • The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.

  • The disk usage of the nodes in the cluster is excessively high. For example, the disk usage is higher than 85% or reaches 100%.

  • The Load_1m of the nodes is excessively high.

  • The statuses of the indexes stored on the cluster are abnormal (not green).

Suggestions for handling exceptions

  • View the monitoring information on the Monitoring page of the Kibana console, or view the logs of the instance to obtain specific information about the issue and troubleshoot it (for example, if an index uses too much memory, you can delete some indexes).

  • For cluster exceptions caused by high disk usage, we recommend that you troubleshoot the issue based on Methods to troubleshoot and handle high disk usage and read_only issues.

  • For instances with 1 core and 2 GB of memory, if the instance status is abnormal, we recommend that you first upgrade the cluster to a specification with a CPU-to-memory ratio of 1:4 to increase the instance specifications. If the cluster is still abnormal after you increase the specifications, we recommend that you troubleshoot the issue based on the preceding two solutions.

Snapshot status (value)

Description

This metric displays the snapshot status of the automatic backup feature in the Elasticsearch console. When the metric value is 0, it indicates that snapshots are created. The following table describes the values of the metric.

Snapshot status

Description

0

Snapshots are created.

-1

No snapshots are created.

1

The system is creating a snapshot.

2

The system failed to create a snapshot.

Exception causes

When the metric value is 2, the service is abnormal. The common causes are as follows:

  • The disk usage of the nodes in the cluster is excessively high or close to 100%.

  • The cluster is abnormal.

Total number of nodes in the cluster (count)

This metric indicates the total number of nodes in the cluster, which is used to monitor whether the node scale meets expectations.

Total number of disconnected nodes in the cluster (count)

This metric indicates the total number of disconnected nodes in the cluster. Disconnected nodes may cause shards to be reassigned or increase query latency.

Cluster index count (count)

This metric indicates the number of indexes in the cluster. Too many indexes may lead to resource contention (for example, memory, CPU).

Cluster shard count (count)

This metric indicates the number of shards in a cluster. Too many shards increase management costs (for example, metadata operations). Too few shards may affect query performance (for example, uneven payload).

Cluster primary shard count

This metric indicates the number of primary shards in the cluster. Insufficient primary shards may cause write bottlenecks.

Number of slow queries in the cluster (count)

This metric indicates the number of slow queries in the cluster. You can use this metric to identify performance bottlenecks (such as complex queries or index design issues).

Cluster write QPS (count/s)

Important

If the write QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. You should avoid this situation.

This metric shows the number of documents written to a cluster per second. The details are as follows:

  • If the cluster receives a write request that contains only one document within 1 second, the value of this metric is 1. The value increases with the number of write requests received per second.

  • If multiple documents are written to the cluster in batch by using the _bulk API within 1 second, the write QPS is calculated based on the total number of documents in the request. If multiple batch write requests are sent by using the _bulk API within 1 second, the values are accumulated.

Cluster query QPS (count/s)

Important

If the query QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. Try to avoid this situation.

This metric shows the number of queries per second (QPS) that are executed on the cluster. The number of queries per second is related to the number of primary shards in the index that you want to query.

For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.

Cluster slow query time distribution

Description

This metric is based on the logs of index.search.slowlog.query and index.search.slowlog.fetch in the slow query log. It aggregates the time taken (took_millis) and displays the distribution in intervals of 1 second (0~1s, 1~2s, up to 10s). You can configure the threshold for slow logs. For related parameters, see the index.search.slowlog.threshold.xxx parameter in Index template configuration.

Exception causes

During the monitoring period, when the slow query time interval increases and the number of queries increases, service exceptions may occur. Common causes are as follows:

Exception cause

Description

QPS

Query QPS or Write QPS traffic surges or fluctuates significantly, causing high cluster pressure and longer query response times.

Aggregate queries or script queries

Aggregate query scenarios require a large amount of computing resources for data aggregation, especially memory. Please be cautious when using them.

Term queries on numeric fields

When performing term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregate queries, it is recommended to change it to a keyword type field.

Fuzzy matching

Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale.

The cluster receives a few slow query or write requests

In this case, the query and write QPS traffic fluctuations are small or not obvious. You can view and analyze by clicking Searching Slow Log on the Query logs page in the Alibaba Cloud Elasticsearch console.

The cluster stores many indexes or shards

Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high load_1m, affecting the query speed of the entire cluster.

Merge operations are performed on the cluster

Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. You can check this on the Overview page of the node in the Kibana console.

Garbage collection (GC) operations are performed on the cluster

GC operations attempt to release memory (such as full GC), consume CPU resources, and may cause CPU utilization to surge, affecting query speed.

Scheduled tasks are performed on the cluster

Data backup or other custom tasks require a large amount of IO resources, affecting query speed.

Cluster Fielddata memory usage (B)

Description

This metric shows the Fielddata memory usage in the cluster. The higher the monitoring curve, the more Fielddata data is cached in heap memory. Excessive Fielddata memory usage can trigger Fielddata memory circuit breaking, affecting cluster stability.

Exception causes

During the monitoring period, when the metric occupies a large amount of heap memory, service abnormalities may occur. Common causes include the following:

  • Queries contain many sort or aggregation operations on string (Text) fields. Fielddata for such queries is not revoked by default. It is recommended to use numeric field types.

  • Query QPS or Write QPS traffic surges or fluctuates significantly, causing Fielddata to be cached frequently.

  • The cluster stores many indexes or shards. Because Elasticsearch monitors the indexes in the cluster and writes logs, when there are too many total indexes or shards, this can easily cause high CPU or HeapMemory usage, or high Load_1m payload.

Index Bulk write TPS (count/s)

Description

This metric displays the number of Bulk requests per second for the index.

Exception causes

During the monitoring period, this metric may have no data. Common causes include the following:

  • High cluster pressure affects the normal collection of cluster monitoring data.

  • Monitoring data failed to be pushed.

Index query QPS (count/s)

Description

This metric shows the number of queries per second (QPS) executed on an index. The QPS value is related to the number of primary shards in the index being queried.

For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.

Exception causes

During the monitoring period, this metric may have no data. Common causes include the following:

  • High cluster pressure affects the normal collection of cluster monitoring data.

  • Monitoring data failed to be pushed.

Important

A sudden increase in index query QPS may cause high CPU utilization, HeapMemory usage, or Load_1m in the cluster, affecting the entire cluster service. You can optimize the index to address these issues.

Index end-to-end query latency_max (ms)

This metric indicates the maximum time consumed by query requests to the index, measured in milliseconds.

Node CPU utilization_ES service (%)

Description

This metric displays the CPU utilization percentage of each node in the cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause

Description

QPS

Query QPS or Write QPS traffic spikes or significantly fluctuates.

The cluster receives a few slow query or write requests

In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Searching Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze.

The cluster stores many indexes or shards

Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high Load_1m.

Merge operations are performed on the cluster

Merge operations consume CPU resources. The Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console.

GC operations are performed

GC operations attempt to free up memory (such as full gc) and consume CPU resources. As a result, the CPU utilization may spike.

Scheduled tasks are performed on the cluster

Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

Note

The NodeCPUUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

Node disk usage (%)

This metric displays the disk usage percentage of each node in the cluster. We recommend that you set the threshold to a value less than 75%. Do not set the threshold for this metric to a value greater than 85%. Otherwise, the following situations may occur, which can affect your services that run on the cluster.

Node disk usage

Description

>85%

New shards cannot be assigned.

>90%

The cluster attempts to migrate shards from the node to other data nodes with lower disk usage.

>95%

Elasticsearch forcibly sets the read_only_allow_delete property for each index in the cluster. In this case, data cannot be written to the indexes. The indexes can only be read or deleted.

Important
  • We recommend that you configure this metric. After the related alerts are triggered, you can resize disks, add nodes, or delete index data at the earliest opportunity to ensure that your services are not affected.

  • The NodeDiskUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

Node heap memory usage_ES service (%)

Description

This metric displays the heap memory usage percentage of each node in the cluster. When the heap memory usage is high or large memory objects exist, cluster services are affected, and garbage collection (GC) operations are automatically triggered.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause

Description

QPS

Query QPS or Write QPS traffic spikes or significantly fluctuates.

The cluster receives a few slow query or write requests

In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Searching Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs.

The cluster receives many slow query or write requests

In this case, the query and write QPS traffic fluctuates significantly or noticeably. You can click Indexing Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs.

The cluster stores many indexes or shards

Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, the CPU or heap memory usage, or Load_1m can become too high.

Merge operations are performed on the cluster

Merge operations consume CPU resources, and the Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console.

GC operations are performed

GC operations attempt to free up memory (for example, Full GC) and consume CPU resources. This may cause the heap memory usage to drop sharply.

Scheduled tasks are performed on the cluster

Data backup or other custom tasks.

Node Load_1m (value)

Description

This metric shows the load of each node in the cluster within 1 minute, indicating how busy each node's system is. In normal cases, the value of this metric is less than the number of vCPUs on the node. The following table describes the values of the metric for a node that has only one vCPU.

Node Load_1m

Description

< 1

No pending processes exist.

= 1

The system does not have idle resources to run more processes.

> 1

Processes are queuing for resources.

Note
  • The Node Load_1m metric includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.

  • Fluctuations in the Node Load_1m metric might be normal. We recommend that you focus on analyzing the Node CPU Utilization metric.

Exception causes

If the value of the metric exceeds the number of vCPUs on a node, an error occurs. This issue may be caused by one or more of the following reasons:

  • The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.

  • The Query QPS or Write QPS traffic surges or increases significantly.

  • The cluster receives slow query requests.

    You can go to the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the corresponding logs.

Note

Node Load_1m includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.

Node memory usage_total (%)

This metric displays the system memory usage of the node.

Node CPU IO wait percentage (%)

This metric displays the CPU IO wait percentage of the node.

Node network package_input (count)

This metric displays the number of inbound traffic packets for each node in the cluster. The monitoring cycle of the metric is 1 minute.

Node network plan_Outputs (count)

This metric displays the number of outbound traffic from the data transfer plan for each node in the cluster. The monitoring cycle of the metric is 1 minute.

Node network bandwidth_input (KiB/s)

This metric displays the inbound rate of data packets per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

Node network bandwidth_Outputs (KiB/s)

This metric displays the outbound data packet rate per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

Node TCP connections (count)

Description

This metric displays the number of TCP connection requests initiated by clients to each node in the cluster.

Exception causes

During monitoring, when the metric value spikes or significantly fluctuates, a service error occurs. A common cause is that TCP connections initiated by clients are not released for an extended period of time, resulting in a sudden increase in the number of TCP connections on nodes. We recommend that you configure related policies for your client to release connections.

IOUtil (%)

Description

This metric displays the IO usage percentage of each node in the cluster.

Exception causes

If the value of the metric spikes or significantly fluctuates during monitoring, a service error occurs. This issue may be caused by high disk usage. High disk usage increases the average wait time for data read and write operations, resulting in a sudden increase in IO usage, which may even reach 100%. We recommend that you troubleshoot the issue based on your cluster configuration and other metrics. For example, you can upgrade the configuration of your cluster.

Node network retransmission rate (%)

This metric displays the network retransmission rate of the node.

Node network bandwidth (KiB/s)

Node network bandwidth (KiB/s) = Node network bandwidth_Input (KiB/s) + Node network bandwidth_Outputs (KiB/s).

Node network bandwidth usage (%)

Node network bandwidth usage (%) = (Node network bandwidth_input (KiB/s) + Node network bandwidth_Outputs (KiB/s) / Node network base bandwidth (Gbit/s).

Node network plan (count)

Node network plan (count) = Node network plan_Outputs (count) + Node network plan_Inputs (count).

Node network packet usage (%)

Node network packet usage (%) = (Node network packet_Outputs (count) + Node network packet_Inputs (count)) / packet forwarding PPS.

Disk bandwidth_read (MiB/s)

This metric displays the amount of data read from nodes in the secondary cluster per second.

Disk bandwidth_write (MiB/s)

This metric displays the amount of data written to each node in the cluster per second.

Disk IOPS_read (count)

This metric displays the number of read requests completed per second on each node in the cluster.

Disk IOPS_write (count)

This metric displays the number of write requests completed per second by each node in the cluster.

Average request queue length

This metric displays the average length of the request queue.

Disk bandwidth (MiB/s)

Disk bandwidth (MiB/s) = Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s).

Disk bandwidth usage_disk (%)

Disk bandwidth usage_disk (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / Formula for calculating single disk throughput performance (MB/s).

Disk bandwidth usage_node (%)

Disk bandwidth usage_node (%) = (Disk bandwidth_read (MiB/s) + Disk bandwidth_write (MiB/s)) / Disk basic bandwidth (Gbit/s).

Disk IOPS (count)

Disk IOPS (count) = Disk IOPS_read (count) + Disk IOPS_write (count).

Disk IOPS usage_disk (%)

Disk IOPS usage_disk (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / Single disk IOPS performance calculation formula.

Disk IOPS usage_node (%)

Disk IOPS usage_node (%) = (Disk IOPS_read (count) + Disk IOPS_write (count)) / Disk basic IOPS.

Node old generation usage (B)

Description

This metric shows the size of the old generation heap memory usage for each node in the cluster. When the old generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers garbage collection (GC) operations. The collection of large objects may result in long GC durations or full GC.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause

Description

QPS

Query QPS or Write QPS traffic spikes or significantly fluctuates.

Aggregation queries or script queries

Aggregation query scenarios require a large amount of computing resources for data aggregation, especially memory. Please be cautious when using them.

Term queries on numeric fields

When performing term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregation operations, we recommend that you change it to a keyword type field.

Fuzzy matching

Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale.

The cluster receives a few slow query or write requests

In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Searching Slow Logs to view and analyze the logs.

The cluster receives many slow query or write requests

In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Indexing Slow Logs to view and analyze the logs.

The cluster stores many indexes or shards

The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level.

Merge operations are performed on the cluster

Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. You can check this on the Overview page of the node in the Kibana console.

GC operations are performed

GC operations attempt to free up memory (for example, Full GC), consume CPU resources, and may cause a significant decrease in heap memory usage.

Scheduled tasks are performed on the cluster

Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

Full GC count (Count)

Important

Frequent Full GC occurrences in the system affect cluster service performance.

Description

This metric displays the total number of GC operations in the cluster within 1 minute.

Exception causes

If the value of this metric is not 0, an error has occurred. This issue may be caused by one or more of the following reasons:

  • High heap memory usage in the cluster.

  • Large objects stored in the cluster memory.

Node Old GC frequency (count)

Metric description

This metric indicates the number of Old Generation garbage collections on each node in the cluster. When the Old Generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers garbage collection operations. The collection of large objects may result in long GC durations or Full GC.

Note

The Full GC basic monitoring metric is obtained through logs, while memory metrics in advanced monitoring depend on ES engine collection. These two methods have differences in data acquisition and application. We recommend that you comprehensively evaluate cluster performance by combining all metrics.

Exception causes

For more information, see Node Old Area Usage (B).

Node Old GC duration (ms)

Description

This metric indicates the average time spent on Old generation garbage collection for each node in the cluster. When the Old generation area usage is high or large memory objects exist, GC operations are automatically triggered. The collection of large objects may result in longer GC durations or Full GC.

Exception causes

For more information, see Node Old generation usage (B).

Query thread pool running thread count (count)

Indicates the number of threads in the query thread pool that are currently executing tasks in the cluster.

Query thread pool rejected requests (count)

Indicates the number of rejected requests in the query thread pool within the cluster. This metric is deprecated. We recommend that you use Query thread pool rejected requests (new) (count).

Query the number of rejected requests in the thread pool (new version) (count)

Indicates the number of rejected requests in the query thread pool within the cluster. When all threads in the thread pool are processing tasks and the task queue is full, new query requests are rejected and exceptions are thrown.

NodeStatsExceptionLogCount(Count)

Description

This metric shows the time consumed by garbage collection (GC) operations on each node in the cluster. The higher the value, the longer the GC operations take. Long GC operations may affect cluster services.

Exception causes

During monitoring, when the metric value is not 0, the service is abnormal. Common causes include the following:

  • The cluster receives abnormal query requests.

  • The cluster receives abnormal write requests.

  • Errors occur when the cluster runs tasks.

  • Garbage collection operations have been executed.

Suggestions for handling exceptions

You can go to the LogSearch page in the Alibaba Cloud Elasticsearch console, and click Main Logs. On the Main Logs page, you can view detailed exception information based on the time point and analyze the cause of the exception.

Note

If there are GC records in the Main Logs, they will also be counted and displayed in the Exception Count monitoring metric.