This topic describes how to view the monitoring data of an Alibaba Cloud Elasticsearch cluster, and also describes the relevant monitor metrics in details.

View cluster monitoring data

  1. Log on to the Alibaba Cloud Elasticsearch console.
  2. Choose Instance ID/Name > Cluster Monitoring.
  3. On the Cluster Monitoring page, select a time period under Cluster Monitoring to view detailed monitoring data generated within the specified time period.View cluster monitoring details
  4. Click the Custom icon, select a start date and end date, and then click OK to view the detailed monitoring data generated within the customized time period.View monitoring data within the customized time period
    Notice You can query monitoring data generated in the last 30 days with a per minute granularity, or query monitoring data generated in seven days in a row.

    For more information about the monitor metrics, see Monitor metrics.

Monitor metrics

Monitor metric details

Alibaba Cloud Elasticsearch cluster monitoring supports the following metrics: ClusterStatus, ClusterQueryQPS(Count/Second), ClusterIndexQPS(Count/Second), NodeCPUUtilization(%), NodeHeapMemoryUtilization(%), NodeDiskUtilization(%), NodeLoad_1m, NodeStatsFullGcCollectionCount(unit), NodeStatsExceptionLogCount(unit), and ClusterAutoSnapshotLatestStatus.

ClusterStatus

The ClusterStatus metric indicates the status of an Elasticsearch cluster. The value 0.00 indicates that the cluster is in the Normal state. You must create an alert rule for this metric. For more information, see Create alert rules.

If the value of the metric is not 0 (the color is not green), the cluster is not in the Normal state. The common causes are as follows:
  • The CPU usage or heap memory usage of the nodes in the cluster is too high or reaches 100%.
  • The disk usage of the nodes in the cluster is too high (such as a usage higher than 85%) or reaches 100%.
  • The node workload within one minute (load_1m) is too high.
  • The status of the indexes in the cluster is abnormal (the color is not green).
The metric values and their definitions are as follows.
Value Color Status Definition
2.00 Red Not all of the primary shards are available. Some primary shards in the cluster are unavailable. This means that one or more indexes have unassigned primary shards.
1.00 Yellow All primary shards are available. Not all of the replica shards are available. This means that one or more indexes have unassigned replica shards.
0.00 Green All primary and replica shards are available. This means that all indexes in the cluster are healthy. No unassigned shard exists.

ClusterQueryQPS(Count/Second)

Notice When the cluster QPS spikes, the CPU usage, heap memory usage, or node workload within one minute may reach a high level. This may adversely affect your businesses running in the cluster. You must avoid this issue.

The ClusterQueryQPS(Count/Second) metric indicates the number of queries processed by the cluster per second.

The number of queries processed per second is affected by the number of primary shards of the index that is queried. For example, if an index has five primary shards, the cluster can process five queries addressed to this index per second.

ClusterIndexQPS(Count/Second)

Notice When the write QPS spikes, the CPU usage, heap memory usage, or node workload within one minute may reach a high level. This may adversely affect your businesses running in the cluster. You must avoid this issue.

The value of the ClusterIndexQPS(Count/Second) metric is calculated based on the number of write requests that a cluster receives per second and the number of documents that these requests write.

If a cluster receives only one write request in one second and the request only writes one document, the write QPS is 1. The value of the metric increases with the number of write requests received per second.

If the cluster receives a _bulk request that writes more than one document in one second, the value of the write QPS metric equals the number of the documents to be written. The value of the metric also increases with the number of _bulk requests received per second.

NodeCPUUtilization(%)

The NodeCPUUtilization(%) metric indicates the CPU usage of the nodes in a cluster. When the CPU usage is high or close to 100%, your businesses running in the cluster are adversely affected.

If the CPU usage spikes or vastly fluctuates and your businesses are affected, check for the following causes:
  • The cluster QPS or write QPS spikes or vastly fluctuates.
  • The cluster receives a few slow queries or slow write requests.

    In this case, you may not be able to find spikes or fluctuations in cluster QPS and write QPS. You can log on to the Elasticsearch console and go to the Logs page, and then click the Search Slow Log tab to analyze the log data.

  • The cluster has a large amount of indexes or shards.

    Elasticsearch monitors indexes in the cluster and records index changes in the log. If the cluster has too many indexes or shards, the CPU usage, heap memory usage, or node workload within one minute may reach a high level.

  • Merge operations are performed on the cluster.

    Merge operations consume CPU resources. Consequently, the number of segments on the corresponding node vastly drops down. You can check the number of segments on the Overview page of the node in the Kibana console.

  • Garbage collection operations are performed on the cluster.

    Garbage collection operations, such as full garbage collection, will try to release memory resources. CPU resources are consumed during a garbage collection. Consequently, the CPU usage may spike.

  • Schedule tasks, such as backup tasks or other customized tasks, are performed on the cluster.

NodeHeapMemoryUtilization(%)

The NodeHeapMemoryUtilization(%) metric indicates the heap memory usage of the nodes in a cluster. If the heap memory usage is high or a large object is stored in the memory, your businesses running in the cluster are adversely affected. This also triggers a garbage collection.

If the heap memory usage spikes or vastly fluctuates and your businesses are affected, check for the following causes:
  • The cluster QPS or write QPS spikes or vastly fluctuates.
  • The cluster receives a few slow queries or slow write requests.

    In this case, you may not be able to find spikes or fluctuations in cluster QPS and write QPS. You can log on to the Elasticsearch console and go to the Logs page, and then click the Search Slow Log tab to analyze the log data.

  • The cluster receives a large amount of slow queries or slow write requests.

    In this case, you may find spikes or fluctuations in cluster QPS and write QPS. You can log on to the Elasticsearch console and go to the Logs page, and then click the Indexing Slow Log tab to analyze the log data.

  • The cluster has a large amount of indexes or shards.

    Elasticsearch monitors indexes in the cluster and records index changes in the log. If the cluster has too many indexes or shards, the CPU usage, heap memory usage, or node workload within one minute may reach a high level.

  • Merge operations are performed on the cluster.

    Merge operations consume CPU resources. Consequently, the number of segments on the corresponding node vastly drops down. You can check the number of segments on the Overview page of the node in the Kibana console.

  • Garbage collection operations are performed on the cluster.

    Garbage collection operations, such as full garbage collection, will try to release memory resources. CPU resources are consumed during a garbage collection. Consequently, the heap memory usage vastly drops down.

    Garbage collection operations, such as full garbage collection, will try to release memory resources. CPU resources are consumed during a garbage collection. Consequently, the heap memory usage vastly drops down.
  • Schedule tasks, such as backup tasks or other customized tasks, are performed on the cluster.

NodeDiskUtilization(%)

The NodeDiskUtilization(%) metric indicates the disk usage of the nodes in a cluster. The disk usage must be controlled under 85%. We recommend that you configure an alert rule for this metric. Otherwise, your businesses running in the cluster may be adversely affected in the following situations:
  • By default, if the disk usage of a data node exceeds 85%, new shards cannot be allocated to the data node. This may adversely affect your businesses.
  • By default, if the disk usage of a data node exceeds 90%, Elastcsearch attempts to move the shards on this node to data nodes with low disk usage. This may adversely affect your businesses.
  • By default, if the disk usage of a data node exceeds 95%, Elasticsearch adds the read_only_allow_delete attribute to all indexes in the cluster. This means that the indexes cannot be written. They can only be read or deleted. This may adversely affect your businesses.
Notice Do not set the alert threshold of the disk usage to a value higher than 80%. We recommend that you set the threshold to a value lower than 75%. In this way, when alerts are triggered, you can expand disks, add nodes, or clear index data to avoid your businesses being affected.

NodeLoad_1m

The NodeLoad_1m metric indicates the workload of a node within one minute. You can reference this metric to determine whether the node is busy. You must keep this value lower than the number of CPU cores of the node to guarantee that your businesses are running normally.

If this value exceeds the number of CPU cores of the node, your businesses are adversely affected. The common causes are as follows:
  • The CPU usage or heap memory usage is high or reaches 100%.
  • The cluster QPS or write QPS spikes or vastly fluctuates.
  • Slow queries that need a large amount of time to process are received.

    A few or a large amount of slow queries are received. Log on to the Elasticsearch console, go to the Logs page, and click the corresponding log tab to analyze the log data.

Taking a single-core node as an example, the values of this metric and their definitions are described as follows:
  • Load < 1: No pending process exists.
  • Load = 1: The system does not have additional resources to run more processes.
  • Load > 1: Processes are queued up waiting for resources.

NodeStatsFullGcCollectionCount(unit)

Warning If full garbage collection is frequently triggered, your businesses running in the cluster are adversely affected.

The NodeStatsFullGcCollectionCount(unit) metric indicates the number of full garbage collection times triggered within one minute.

If this value is not 0, your businesses are affected. The common causes are as follows:

  • The heap memory usage is high.
  • Large objects are stored in memory.

NodeStatsExceptionLogCount(unit)

The NodeStatsExceptionLogCount(unit) metric indicates the number of warning-level entries generated in the instance log within one minute.

If this value is not 0, your businesses are affected. The common causes are as follows:

  • Abnormal cluster queries are received.
  • Abnormal write requests are received.
  • Errors occur when the cluster runs tasks.
  • A garbage collection is triggered.
Note
  • Log on to the Elastcisearch console, go to the Logs page, and click the Instance Log tab. On the Instance Log tab, look for exceptions occurred in the specific time and analyze the causes.
  • The NodeStatsExceptionLogCount(unit) metric also counts the garbage collection times recorded in the Instance Log.

ClusterAutoSnapshotLatestStatus

The ClusterAutoSnapshotLatestStatus metric indicates the status of the auto snapshot feature of the cluster. If the value is -1 or 0, auto snapshot is running normally.

If the value is 2, an error has occurred to auto snapshot. The common causes are as follows:

  • The disk usage of the nodes is high or close to 100%.
  • The cluster is not in the Normal state.
The values of this metric and their definitions are as follows:
  • 0: Snapshots are created.
  • -1: No snapshot is created.
  • 1: The system is creating a snapshot.
  • 2: The system failed to create snapshots.