This topic describes the monitoring metrics for Alibaba Cloud Elasticsearch. These metrics, such as ClusterStatus, ClusterQueryQPS(Count/Second), NodeCPUUtilization(%), and NodeDiskUtilization(%), are used to monitor the running status of Elasticsearch clusters. You can obtain the running status of the clusters in real time and address potential issues to ensure the stable running of the clusters.

Overview

Monitoring metric details

The following monitoring metrics are provided: ClusterStatus, ClusterQueryQPS(Count/Second), ClusterIndexQPS(Count/Second), NodeCPUUtilization(%), NodeDiskUtilization(%), NodeHeapMemoryUtilization(%), NodeLoad_1m, NodeStatsFullGcCollectionCount(unit), NodeStatsExceptionLogCount(unit), and ClusterAutoSnapshotLatestStatus.

ClusterStatus

The ClusterStatus metric indicates the health status of an Elasticsearch cluster. The value 0.00 indicates that the cluster is normal. This metric is required. For more information, see Configure the monitoring and alerting feature in Cloud Monitor.

If the status is not displayed in green on the Basic Information page, the value of the metric is not 0.00. This indicates that the cluster is abnormal. The following content provides a few common reasons for this issue:
  • The CPU utilization or heap memory usage of the nodes in the cluster is too high or reaches 100%.
  • The disk usage of the nodes in the cluster is too high (higher than 85%) or reaches 100%.
  • The minute-average node workload (NodeLoad_1m) is too high.
  • The status of the indexes in the cluster is abnormal (not green).
The following table describes the values of the ClusterStatus metrics.
Value Color Status Description
2.00 Red Not all of the primary shards are available. One or more indexes have unassigned primary shards.
1.00 Yellow All primary shards are available, but not all of the replica shards are available. One or more indexes have unassigned replica shards.
0.00 Green All primary and replica shards are available. All indexes in the cluster are healthy and do not have unassigned shards.
Note The colors in the table indicate those of Status on the Basic Information page of a cluster.

ClusterQueryQPS(Count/Second)

Notice If the query QPS of an Elasticsearch cluster spikes, the CPU utilization, heap memory usage, or minute-average node workload of the cluster may reach a high level. This may affect your services that run on the cluster.

The ClusterQueryQPS(Count/Second) metric indicates the number of queries processed by the cluster per second.

The number of queries processed per second depends on the number of primary shards of the index that is queried. For example, if an index has five primary shards, the cluster can process requests to this index at a rate of five queries per second.

ClusterIndexQPS(Count/Second)

Notice If the write QPS of an Elasticsearch cluster spikes, the CPU utilization, heap memory usage, or minute-average node workload of the cluster may reach a high level. This may affect your services that run on the cluster.

The ClusterIndexQPS(Count/Second) metric is calculated based on the number of write requests that a cluster receives per second and the number of documents that these requests write.

If a cluster receives only one write request within one second and the request only writes one document, the write QPS is 1. The value of the metric increases with the number of write requests received per second.

If the cluster receives a _bulk request that writes multiple documents within one second, the write QPS equals the number of documents to be written. The value of the metric increases with the number of _bulk requests received per second.

NodeCPUUtilization(%)

The NodeCPUUtilization(%) metric indicates the CPU utilization of each data node in an Elasticsearch cluster. When the CPU utilization is high or close to 100%, your services that run on the cluster are affected.

If the CPU utilization spikes or significantly fluctuates, an error occurs. The following content provides a few common reasons for this issue:
  • The query QPS or write QPS spikes or significantly fluctuates.
  • The cluster receives a few slow queries or write requests.

    In this case, you may not find spikes or fluctuations in the query QPS and write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Search Slow Log tab to analyze log data.

  • The cluster has a large number of indexes or shards.

    The system monitors indexes in the cluster and logs index changes. If the cluster has excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node workload may reach a high level.

  • Merge operations are performed on the cluster.

    Merge operations consume CPU resources. However, the number of segments on the related node significantly decreases. You can check the number of segments on the Overview page of the node in the Kibana console.

  • Garbage collection (GC) operations are performed on the cluster.

    GC operations, such as full GC, can be used to release memory resources. However, these operations consume CPU resources. As a result, the CPU utilization may spike.

  • Scheduled tasks, such as backup or custom tasks, are performed on the cluster.

NodeDiskUtilization(%)

The NodeDiskUtilization(%) metric indicates the disk usage of each data node in an Elasticsearch cluster. The disk usage must be less than 85%. We recommend that you configure an alert rule for this metric. Otherwise, the following situations may occur, which can affect your services that run on the cluster:
  • By default, if the disk usage of a data node exceeds 85%, new shards cannot be allocated to the data node.
  • By default, if the disk usage of a data node exceeds 90%, Elasticsearch attempts to migrate the shards on this node to data nodes with low disk usage.
  • By default, if the disk usage of a data node exceeds 95%, Elasticsearch adds the read_only_allow_delete attribute to all indexes in the cluster. As a result, you cannot write data to the indexes. These indexes can only be read or deleted.
Notice Do not set the threshold for this metric to a value greater than 80%. We recommend that you set the threshold to a value less than 75%. When alerts are triggered, you can resize disks, add nodes, or delete index data to ensure that your services are not affected.

NodeHeapMemoryUtilization(%)

The NodeHeapMemoryUtilization(%) metric indicates the heap memory usage of each data node in an Elasticsearch cluster. If the heap memory usage is high or a large object is stored in the memory, your services that run on the cluster are affected. This also triggers a GC operation.

If the heap memory usage spikes or significantly fluctuates, an error occurs. The following content provides a few common reasons for this issue:
  • The query QPS or write QPS spikes or significantly fluctuates.
  • The cluster receives a few slow queries or write requests.

    In this case, you may not find spikes or fluctuations in the query QPS and write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Search Slow Log tab to analyze log data.

  • The cluster receives a large number of slow queries or write requests.

    In this case, you may find spikes or fluctuations in the query QPS and write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Indexing Slow Log tab to analyze log data.

  • The cluster has a large number of indexes or shards.

    The system monitors indexes in the cluster and logs index changes. If the cluster has excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node workload may reach a high level.

  • Merge operations are performed on the cluster.

    Merge operations consume CPU resources. However, the number of segments on the related node significantly decreases. You can check the number of segments on the Overview page of the node in the Kibana console.

  • GC operations are performed on the cluster.

    GC operations, such as full GC, can be used to release memory resources. However, these operations consume CPU resources. As a result, the heap memory usage significantly decreases.

  • Scheduled tasks, such as backup or custom tasks, are performed on the cluster.

NodeLoad_1m

The NodeLoad_1m metric indicates the workload of each node in an Elasticsearch cluster within one minute. You can reference this metric to determine whether a node is busy. In normal cases, the value of this metric is less than the number of CPU cores on the node.

If the value exceeds the number of CPU cores on the node, an error occurs. The following content provides a few common reasons for this issue:
  • The CPU utilization or heap memory usage is high or reaches 100%.
  • The query QPS or write QPS spikes or significantly fluctuates.
  • The cluster receives slow queries.

    The cluster may receive a few or a large number of slow queries. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and click the required tab to analyze log data.

The following content describes the values of this metric for a single-core node:
  • NodeLoad_1m < 1: No pending processes exist.
  • NodeLoad_1m = 1: The system does not have idle resources to run more processes.
  • NodeLoad_1m > 1: Processes are queuing for resources.

NodeStatsFullGcCollectionCount(unit)

Warning If full GC is frequently triggered for an Elasticsearch cluster, your services that run on the cluster are affected.

The NodeStatsFullGcCollectionCount(unit) metric indicates the number of times that full GC is triggered for an Elasticsearch cluster within one minute.

If the value is not 0, an error occurs. The following content provides a few common reasons for this issue:

  • The heap memory usage is high.
  • Large objects are stored in the memory.

NodeStatsExceptionLogCount(unit)

The NodeStatsExceptionLogCount(unit) metric indicates the number of warning-level entries generated in an Elasticsearch cluster log within one minute.

If the value is not 0, an error occurs. The following content provides a few common reasons for this issue:

  • The cluster receives abnormal queries.
  • The cluster receives abnormal write requests.
  • Errors occur when the cluster runs tasks.
  • GC operations are performed on the cluster.
Note
  • Log on to the Elasticsearch console, go to the Logs page of the cluster, and click the Cluster Log tab. On the Cluster Log tab, find exceptions that occurred at specific time and analyze causes.
  • The NodeStatsExceptionLogCount(unit) metric also counts the GC operations that are recorded in cluster logs.

ClusterAutoSnapshotLatestStatus

The ClusterAutoSnapshotLatestStatus metric indicates the status of the Auto Snapshot feature of an Elasticsearch cluster. If the value is -1 or 0, the feature is normally running.

If the value is 2, an error occurs. The following content provides a few common reasons for this issue:

  • The disk usage of nodes is high or close to 100%.
  • The cluster is abnormal.
The following content describes the values of this metric:
  • 0: Snapshots are created.
  • -1: No snapshots are created.
  • 1: The system is creating a snapshot.
  • 2: The system failed to create a snapshot.