Alibaba Cloud Elasticsearch provides a variety of metrics to monitor the statuses of clusters. The metrics include ClusterStatus, ClusterQueryQPS(Count/Second), NodeCPUUtilization(%), and NodeDiskUtilization(%). These metrics help you obtain the statuses of your clusters in real time and address potential issues at the earliest opportunity to ensure cluster stability. This topic describes how to view cluster monitoring data. This topic also describes the metrics and the causes of exceptions, and provides suggestions for handling the exceptions.

View cluster monitoring data

  1. Log on to the Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
    2. In the left-side navigation pane, click Elasticsearch Clusters. On the Elasticsearch Clusters page, find the cluster and click its ID.
  4. In the left-side navigation pane of the page that appears, choose Monitoring and Logs > Cluster Monitoring. The Basic Monitoring tab appears.
  5. On the Basic Monitoring tab, select a period of time to view the detailed monitoring data that is generated within that period of time.
    You can also click the View the monitoring data within a customized period of time icon, configure Start At and End At, and then click OK to view the detailed monitoring data that is generated within the customized period of time.
    Note The alerting feature is enabled for Elasticsearch clusters by default. Therefore, you can view historical monitoring data on the Cluster Monitoring page. You can view monitoring data by minute, and monitoring data is retained only for 30 days.

ClusterStatus

Description

The ClusterStatus metric indicates the health status of a cluster. The value 0.00 indicates that the cluster is normal. This metric is required. For more information, see Configure cluster alerting. The following table describes the values of the metric.
Value Color Status Description
2.00 Red Not all of the primary shards are available. One or more indexes have unassigned primary shards.
1.00 Yellow All primary shards are available, but not all of the replica shards are available. One or more indexes have unassigned replica shards.
0.00 Green All primary and replica shards are available. All the indexes stored on the cluster are healthy and do not have unassigned shards.
Note The colors in the table refer to the colors displayed by the Status parameter on the Basic Information page of a cluster.

Exception causes

If the value of the metric is not 0.00, the cluster is abnormal. This issue may be caused by one or more of the following reasons:
  • The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
  • The disk usage of the nodes in the cluster is excessively high. For example, the disk usage is higher than 85% or reaches 100%.
  • The minute-average node load indicated by the NodeLoad_1m metric is excessively high.
  • The statuses of the indexes stored on the cluster are abnormal (not green).

Suggestions for handling exceptions

  • You can view monitoring data on the Monitoring page of the Kibana console or view the logs of the cluster to obtain the specific information of the issue and troubleshoot it. For example, if you identify that the indexes stored on your cluster occupy excessive memory, you can delete some indexes.
  • If the issue is caused by excessive disk usage, we recommend that you troubleshoot this issue based on the instructions provided in High disk usage and read-only indexes.
  • If the specifications of the cluster are 1 vCPU and 2 GiB of memory, we recommend that you increase the specifications based on the vCPU-to-memory ratio of 1:4. If the cluster is still abnormal after you increase the specifications, we recommend that you troubleshoot the issue based on the preceding two solutions.

ClusterIndexQPS(Count/Second)

Notice If the write QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster.

Description

The ClusterIndexQPS(Count/Second) metric indicates the number of documents that are written to the cluster per second. The following descriptions provide specific information about the value of the metric:
  • If the cluster receives a write request that contains only one document within 1 second, the value of this metric is 1. The value increases with the number of write requests received per second.
  • If the cluster receives a _bulk write request that contains multiple documents within 1 second, the value of this metric is the number of the documents in the request. The value increases with the number of _bulk write requests received per second.

ClusterQueryQPS(Count/Second)

Notice If the query QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster.

Description

The ClusterQueryQPS(Count/Second) metric indicates the number of queries processed by the cluster per second. This number depends on the number of primary shards for the index from which you want to query data.

For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.

NodeCPUUtilization(%)

Description

The NodeCPUUtilization(%) metric indicates the CPU utilization of each node in a cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
Exception cause Description
An exception occurs on the QPS of the cluster. The query or write QPS spikes or significantly fluctuates.
The cluster receives a few slow query or write requests. In this case, you may not observe spikes or fluctuations in the query QPS or write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Search Slow Log tab to view and analyze log information on this tab.
The cluster stores a large number of indexes or shards. The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level.
Merge operations are performed on the cluster. Merge operations consume CPU resources, and the number of segments on the related node significantly decreases. You can check the number of segments on the Overview page of the node in the Kibana console.
Garbage collection (GC) operations are performed on the cluster. GC operations, such as full GC, can be used to release memory resources. However, these operations consume CPU resources. As a result, the CPU utilization may spike.
Scheduled tasks are performed on the cluster. Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

NodeHeapMemoryUtilization(%)

Description

The NodeHeapMemoryUtilization(%) metric indicates the heap memory usage of each node in a cluster. If the heap memory usage is high or a large object is stored in the memory, the services that run on your cluster are affected. This also triggers a GC operation.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
Exception cause Description
An exception occurs on the QPS of the cluster. The query or write QPS spikes or significantly fluctuates.
The cluster receives a few slow query or write requests. In this case, you may not observe spikes or fluctuations in the query QPS or write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Search Slow Log tab to view and analyze log information on this tab.
The cluster receives a large number of slow query or write requests. In this case, you may observe significant spikes or fluctuations in the query QPS and write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Indexing Slow Log tab to view and analyze log information on this tab.
The cluster stores a large number of indexes or shards. The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level.
Merge operations are performed on the cluster. Merge operations consume CPU resources, and the number of segments on the related node significantly decreases. You can check the number of segments on the Overview page of the node in the Kibana console.
GC operations are performed on the cluster. GC operations, such as full GC, can be used to release memory resources. However, these operations consume CPU resources. As a result, the heap memory usage significantly decreases.
Scheduled tasks are performed on the cluster. Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

NodeLoad_1m

Description

The NodeLoad_1m metric indicates the load of each node in a cluster within 1 minute. You can reference this metric to determine whether a node is busy. In normal cases, the value of this metric is less than the number of vCPUs on the node. The following table describes the values of the metric for a node that has only one vCPU.
Value of NodeLoad_1m Description
< 1 No pending processes exist.
= 1 The system does not have idle resources to run more processes.
> 1 Processes are queuing for resources.

Exception causes

If the value of the metric exceeds the number of vCPUs on a node, an error occurs. This issue may be caused by one or more of the following reasons:
  • The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
  • The query or write QPS spikes or significantly fluctuates.
  • The cluster receives slow query requests.

    You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the required tab to view and analyze log information.

NodeDiskUtilization(%)

Description

The NodeDiskUtilization(%) metric indicates the disk usage of each node in a cluster. We recommend that you set the threshold to a value less than 75%. Do not set the threshold for this metric to a value greater than 85%. Otherwise, the following situations may occur, which can affect your services that run on the cluster.
Value of NodeDiskUtilization(%) Description
> 85% New shards cannot be allocated.
> 90% The system attempts to migrate the shards on a data node whose disk usage is greater than 90% to a data node with low disk usage.
> 95% The system forcefully adds the read_only_allow_delete attribute to all the indexes stored on the cluster. As a result, you cannot write data to the indexes. You can only read data from or delete the indexes.
Notice We recommend that you configure this metric. After the related alerts are triggered, you can resize disks, add nodes, or delete index data at the earliest opportunity to ensure that your services are not affected.

NodeStatsFullGcCollectionCount(unit)

Notice If full GC is frequently triggered on a cluster, the services that run on the cluster are affected.

Description

The NodeStatsFullGcCollectionCount(unit) metric indicates the number of times that full GC is triggered for a cluster within 1 minute.

Exception causes

If the value of the metric is not 0, an error occurs. This issue may be caused by one or more of the following reasons:
  • The heap memory usage of the cluster is high.
  • Large objects are stored in the memory of the cluster.

NodeStatsExceptionLogCount(unit)

Description

The NodeStatsExceptionLogCount(unit) metric indicates the number of warning-level entries generated in a cluster log within 1 minute.

Exception causes

If the value of the metric is not 0, an error occurs. This issue may be caused by one or more of the following reasons:
  • The cluster receives abnormal query requests.
  • The cluster receives abnormal write requests.
  • Errors occur when the cluster runs tasks.
  • GC operations are performed on the cluster.

Suggestions for handling exceptions

You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Cluster Log tab. On the Cluster Log tab, find the exceptions that occurred at a specific time and analyze their causes.
Note The NodeStatsExceptionLogCount(unit) metric also counts the GC operations that are recorded in cluster logs.

ClusterAutoSnapshotLatestStatus

Description

The ClusterAutoSnapshotLatestStatus metric indicates the status of the Auto Snapshot feature of a cluster. If the value of the metric is -1 or 0, the feature is normally running. The following table describes the values of the metric.
Value of ClusterAutoSnapshotLatestStatus Description
0 Snapshots are created.
-1 No snapshots are created.
1 The system is creating a snapshot.
2 The system failed to create a snapshot.

Exception causes

If the value of the metric is 2, an error occurs. This issue may be caused by one or more of the following reasons:
  • The disk usage of the nodes in the cluster is excessively high or close to 100%.
  • The cluster is abnormal.