All Products
Search
Document Center

Elasticsearch:Metrics and exception handling suggestions

Last Updated:Feb 21, 2024

Alibaba Cloud Elasticsearch provides a variety of basic metrics to monitor the statuses of clusters. The metrics include ClusterStatus(value), ClusterQueryQPS(Count/Second), NodeCPUUtilization(%), and NodeDiskUtilization(%). These metrics help you obtain the statuses of your clusters in real time and address potential issues at the earliest opportunity to ensure cluster stability. This topic describes how to view cluster monitoring data. This topic also describes the metrics and the causes of exceptions, and provides suggestions for handling the exceptions.

Differences with other monitoring features

The cluster monitoring feature provided by Alibaba Cloud Elasticsearch may be different from the monitoring feature provided by Kibana or a third-party service in the following aspects:

  • Data collection period: The data collection period of the cluster monitoring feature and that of the monitoring feature provided by Kibana or a third-party service are different. As a result, the data collected by using these features is different.

  • Algorithms for the monitoring of data queries: The cluster monitoring feature and the monitoring feature provided by Kibana are affected by cluster stability when they collect data. The values of the QPS metrics provided by the cluster monitoring feature may spike, be negative, or be empty due to the instability of an Elasticsearch cluster, whereas monitoring results may not be displayed for the monitoring feature provided by Kibana.

    Note

    If the cluster monitoring feature provides more metrics than the monitoring feature provided by Kibana, we commend that you use both features for monitoring in your business scenarios.

  • APIs used to collect data: Data of the metrics provided by the monitoring feature of Kibana is collected by calling the Elasticsearch API. Data of some node-level metrics, such as NodeCPUUtilization(%), NodeLoad_1m(value), and NodeDiskUtilization(%), provided by the cluster monitoring feature is collected by calling the underlying system API of Alibaba Cloud Elasticsearch. Therefore, in addition to the monitoring data of Elasticsearch processes, the cluster monitoring feature also collects the usage of system resources.

View cluster monitoring data

  1. Log on to the Alibaba Cloud Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
    2. On the Elasticsearch Clusters page, find the cluster and click its ID.
  4. In the left-side navigation pane of the page that appears, choose Monitoring and Logs > Cluster Monitoring.

  5. On the Basic Monitoring tab, select a resource type and a period of time to view the detailed monitoring data that is generated for this type of resource within this period of time.

    You can also click Custom, configure Start Date and End Date, and then click Confirm to view the detailed monitoring data that is generated within the custom period of time.

    Note

    The alerting feature is enabled for Elasticsearch clusters by default. Therefore, you can view historical monitoring data on the Cluster Monitoring page. You can view monitoring data by minute, and monitoring data is retained only for 30 days.

ClusterStatus(value)

Description

The ClusterStatus(value) metric indicates the health status of a cluster. The value 0.00 indicates that the cluster is normal. This metric is required. For more information, see Configure cluster alerting. The following table describes the values of the metric.

Value

Color

Status

Description

2.00

Red

Not all of the primary shards are available.

One or more indexes have unassigned primary shards.

1.00

Yellow

All primary shards are available, but not all of the replica shards are available.

One or more indexes have unassigned replica shards.

0.00

Green

All primary and replica shards are available.

All the indexes stored on the cluster are healthy and do not have unassigned shards.

Note

The colors in the table refer to the colors displayed next to the Status parameter on the Basic Information page of a cluster.

Exception causes

If the value of the metric is not 0.00, the cluster is abnormal. This issue may be caused by one or more of the following reasons:

  • The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.

  • The disk usage of the nodes in the cluster is excessively high. For example, the disk usage is higher than 85% or reaches 100%.

  • The minute-average node load indicated by the NodeLoad_1m(value) metric is excessively high.

  • The statuses of the indexes stored on the cluster are abnormal (not green).

Suggestions for handling exceptions

  • You can view monitoring data on the Monitoring page of the Kibana console or view the logs of the cluster to obtain the specific information of the issue and troubleshoot it. For example, if you identify that the indexes stored on your cluster occupy excessive memory, you can delete some indexes.

  • If the issue is caused by excessive disk usage, we recommend that you troubleshoot this issue based on the instructions provided in High disk usage and read-only indexes.

  • If the specifications of the cluster are 1 vCPU and 2 GiB of memory, we recommend that you increase the specifications based on the vCPU-to-memory ratio of 1:4. For more information about how to increase the specifications of an Elasticsearch cluster, see Upgrade the configuration of a cluster. If the cluster is still abnormal after you increase the specifications, we recommend that you troubleshoot the issue based on the preceding two solutions.

ClusterIndexQPS(Count/Second)

Important

If the write QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster.

Description

The ClusterIndexQPS(Count/Second) metric indicates the number of documents that are written to the cluster per second. The following descriptions provide specific information about the value of the metric:

  • If the cluster receives a write request that contains only one document within 1 second, the value of this metric is 1. The value increases with the number of write requests received per second.

  • If the cluster receives a _bulk write request that contains multiple documents within 1 second, the value of this metric is the number of the documents in the request. The value increases with the number of _bulk write requests received per second.

ClusterQueryQPS(Count/Second)

Important

If the query QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster.

Description

The ClusterQueryQPS(Count/Second) metric indicates the number of queries processed by the cluster per second. This number depends on the number of primary shards for the index from which you want to query data.

For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.

NodeCPUUtilization(%)

Description

The NodeCPUUtilization(%) metric indicates the CPU utilization of each node in a cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause

Description

QPS

The query or write QPS spikes or significantly fluctuates.

The cluster receives a few slow query or write requests.

In this case, you may not observe spikes or fluctuations in the query QPS or write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Search Slow Log tab to view and analyze log information on this tab.

The cluster stores a large number of indexes or shards.

The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level.

Merge operations are performed on the cluster.

Merge operations consume CPU resources, and the number of segments on the related node significantly decreases. You can check the number of segments on the Overview page of the node in the Kibana console.

Garbage collection (GC) operations are performed on the cluster.

GC operations, such as full GC, can be used to release memory resources. However, these operations consume CPU resources. As a result, the CPU utilization may spike.

Scheduled tasks are performed on the cluster.

Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

Note

The NodeCPUUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

NodeHeapMemoryUtilization(%)

Description

The NodeHeapMemoryUtilization(%) metric indicates the heap memory usage of each node in a cluster. If the heap memory usage is high or a large object is stored in the memory, the services that run on your cluster are affected. This also triggers a GC operation.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.

Exception cause

Description

QPS

The query or write QPS spikes or significantly fluctuates.

The cluster receives a few slow query or write requests.

In this case, you may not observe spikes or fluctuations in the query QPS or write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Search Slow Log tab to view and analyze log information on this tab.

The cluster receives a large number of slow query or write requests.

In this case, you may observe significant spikes or fluctuations in the query QPS and write QPS. You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Indexing Slow Log tab to view and analyze log information on this tab.

The cluster stores a large number of indexes or shards.

The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level.

Merge operations are performed on the cluster.

Merge operations consume CPU resources, and the number of segments on the related node significantly decreases. You can check the number of segments on the Overview page of the node in the Kibana console.

GC operations are performed on the cluster.

GC operations, such as full GC, can be used to release memory resources. However, these operations consume CPU resources. As a result, the heap memory usage significantly decreases.

Scheduled tasks are performed on the cluster.

Scheduled tasks, such as data backup or custom tasks, are performed on the cluster.

NodeLoad_1m(value)

Description

The NodeLoad_1m(value) metric indicates the load of each node in a cluster within 1 minute. You can determine whether a node is busy based on this metric. In normal cases, the value of this metric is less than the number of vCPUs for the node. The following table describes the values of the metric for a node that has only one vCPU.

Value of NodeLoad_1m(value)

Description

<1

No pending processes exist.

= 1

The system does not have idle resources to run more processes.

> 1

Processes are queuing for resources.

Note
  • The NodeLoad_1m(value) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

  • The value of the NodeLoad_1m(value) metric may fluctuate. This is normal. We recommend that you focus on the NodeCPUUtilization(%) metric for exception analysis.

Exception causes

If the value of the metric exceeds the number of vCPUs on a node, an error occurs. This issue may be caused by one or more of the following reasons:

  • The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.

  • The query or write QPS spikes or significantly fluctuates.

  • The cluster receives slow query requests.

    You can log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the required tab to view and analyze log information.

Note

The NodeLoad_1m(value) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

NodeDiskUtilization(%)

Description

The NodeDiskUtilization(%) metric indicates the disk usage of each node in a cluster. We recommend that you set the threshold to a value less than 75%. Do not set the threshold for this metric to a value greater than 85%. Otherwise, the situations described in the following table may occur, which can affect your services that run on the cluster.

Value of NodeDiskUtilization(%)

Description

> 85%

New shards cannot be allocated.

> 90%

The system attempts to migrate the shards on a data node whose disk usage is greater than 90% to a data node with low disk usage.

> 95%

The system forcefully adds the read_only_allow_delete attribute to all the indexes stored on the cluster. As a result, you cannot write data to the indexes. You can only read data from or delete the indexes.

Important
  • We recommend that you configure this metric. After the related alerts are triggered, you can resize disks, add nodes, or delete index data at the earliest opportunity to ensure that your services are not affected.

  • The NodeDiskUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.

NodeStatsFullGcCollectionCount(Count)

Important

If full GC is frequently triggered on a cluster, the services that run on the cluster are affected.

Description

The NodeStatsFullGcCollectionCount(Count) metric indicates the number of times that full GC is triggered for a cluster within 1 minute.

Exception causes

If the value of the metric is not 0, an error occurs. This issue may be caused by one or more of the following reasons:

  • The heap memory usage of the cluster is high.

  • Large objects are stored in the memory of the cluster.

NodeStatsExceptionLogCount(Count)

Description

The NodeStatsExceptionLogCount(Count) metric indicates the number of warning-level entries generated in a cluster log within 1 minute.

Exception causes

If the value of the metric is not 0, an error occurs. This issue may be caused by one or more of the following reasons:

  • The cluster receives abnormal query requests.

  • The cluster receives abnormal write requests.

  • Errors occur when the cluster runs tasks.

  • GC operations are performed on the cluster.

Suggestions for handling exceptions

Log on to the Elasticsearch console, go to the Logs page of the cluster, and then click the Cluster Log tab. On the Cluster Log tab, find the exceptions that occurred at a specific point in time and analyze their causes.

Note

The NodeStatsExceptionLogCount(Count) metric also counts the GC operations that are recorded in cluster logs.

ClusterAutoSnapshotLatestStatus(value)

Description

The ClusterAutoSnapshotLatestStatus(value) metric indicates the status of the Auto Snapshot feature of a cluster. If the value of the metric is -1 or 0, the feature is normally running. The following table describes the values of the metric.

Value of ClusterAutoSnapshotLatestStatus(value)

Description

0

Snapshots are created.

-1

No snapshots are created.

1

The system is creating a snapshot.

2

The system failed to create a snapshot.

Exception causes

If the value of the metric is 2, an error occurs. This issue may be caused by one or more of the following reasons:

  • The disk usage of the nodes in the cluster is excessively high or close to 100%.

  • The cluster is abnormal.

NodeStatsNetworkinPackages(count)

The NodeStatsNetworkinPackages(count) metric indicates the number of inbound packets for each node in the cluster. The monitoring cycle of the metric is 1 minute.

Node Network out Packages(count)

The Node Network out Packages(count) metric indicates the number of outbound packets for each node in the cluster. The monitoring cycle of the metric is 1 minute.

NodeStatsNetworkinRate(kB/s)

The NodeStatsNetworkinRate(kB/s) metric indicates the throughput of inbound packets for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

NodeStatsNetworkoutRate(kB/s)

The NodeStatsNetworkoutRate(kB/s) metric indicates the throughput of outbound packets for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.

NodeStatsTcpEstablished(count)

Description

The NodeStatsTcpEstablished(count) metric indicates the number of requests that each node in the cluster receives from your client to establish TCP connections.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by TCP connections that are not released for an extended period of time. We recommend that you configure related policies for your client to release TCP connections at the earliest opportunity.

NodeStatsDataDiskUtil(%)

Description

The NodeStatsDataDiskUtil(%) metric indicates the disk I/O usage of each node in the cluster.

Exception causes

If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by high disk usage. High disk usage increases the average time of data read and write. As a result, the disk I/O usage spikes or even reaches 100%. We recommend that you troubleshoot the issue based on your cluster configuration and other metrics. For example, you can upgrade the configuration of your cluster.

NodeStatsDataDiskR(count)

The NodeStatsDataDiskR(count) metric indicates the number of data read requests that are completed by each node in the cluster per second.

NodeStatsDataDiskRm(MB/s)

The NodeStatsDataDiskRm(MB/s) metric indicates the amount of data that is read from each node in the cluster per second.

NodeStatsDataDiskW(count)

The NodeStatsDataDiskW(count) metric indicates the number of data write requests that are completed by each node in the cluster per second.

NodeStatsDataDiskWm(MB/s)

The NodeStatsDataDiskWm(MB/s) metric indicates the amount of data that is written to each node in the cluster per second.