All Products
Search
Document Center

Container Service for Kubernetes:Monitor etcd

Last Updated:Mar 01, 2024

This topic describes the metrics supported by etcd, provides usage notes for the dashboards of etcd, and suggests how to troubleshoot common metric anomalies.

Metrics

Metrics can indicate the status and parameter settings of a component. The following table describes the metrics supported by etcd.

Metric

Type

Description

cpu_utilization_core

Gauge

The CPU usage. Unit: vCores.

cpu_utilization_ratio

Gauge

CPU utilization = Number of used vCores/Total number of vCores. Unit: %.

etcd_server_has_leader

Gauge

Indicates whether the etcd member has a leader. Valid values:

  • 1: The etcd member has a leader.

  • 0: The etcd member does not have a leader.

etcd_server_is_leader

Gauge

Indicates whether the etcd member is a leader. Valid values:

  • 1: The etcd member is a leader.

  • 0: The etcd member is not a leader.

etcd_server_leader_changes_seen_total

Counter

The number of leader changes within a period of time.

etcd_mvcc_db_total_size_in_bytes

Gauge

The size of the etcd member DB.

etcd_mvcc_db_total_size_in_use_in_bytes

Gauge

The usage of the etcd member DB.

etcd_disk_backend_commit_duration_seconds_bucket

Histogram

The etcd backend commit delay.

Buckets: 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2.048, 4.096, and 8.192.

etcd_debugging_mvcc_keys_total

Gauge

The total number of etcd keys.

etcd_server_proposals_committed_total

Gauge

The total number of raft proposals committed.

etcd_server_proposals_applied_total

Gauge

The total number of raft proposals applied.

etcd_server_proposals_pending

Gauge

The total number of pending raft proposals.

etcd_server_proposals_failed_total

Counter

The total number of failed raft proposals.

memory_utilization_byte

Gauge

The memory usage. Unit: bytes.

memory_utilization_ratio

Gauge

Memory utilization = Amount of used memory/Total amount of memory. Unit: %.

Usage notes for dashboards

Dashboards are generated based on metrics and Prometheus Query Language (PromQL). The following sections describe the observability and features of the dashboards of etcd.

Observability

etcd

Features

Dashboard

PromQL

Description

Etcd Cluster Healthy

  • etcd_server_has_leader

  • etcd_server_is_leader == 1

  • Indicates the liveness of the etcd member. A value of 3 indicates that the etcd member is alive.

  • Indicates whether the etcd member is a leader. In most cases, an etcd member must be elected as a leader.

Leader Changes for Latest Day

changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])

The number of leader changes within the previous day.

Mem Usage

memory_utilization_byte{container="etcd"}

The memory usage. Unit: bytes.

CPU Usage

cpu_utilization_core{container="etcd"}*1000

The CPU usage. Unit: millicores.

Mem Usage Rate

memory_utilization_ratio{container="etcd"}

The memory utilization. Unit: percentage.

CPU Usage Rate

cpu_utilization_ratio{container="etcd"}

The CPU utilization. Unit: percentage.

DB Size

  • etcd_mvcc_db_total_size_in_bytes

  • etcd_mvcc_db_total_size_in_use_in_bytes

  • The size of the etcd backend DB.

  • The usage of the etcd backend DB.

kv total

etcd_debugging_mvcc_keys_total

The total number of key-value pairs in the etcd cluster.

Backend Commit Delay

histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))

The DB commit delay.

Raft Proposals Status

  • rate(etcd_server_proposals_failed_total{job="etcd"}[1m])

  • etcd_server_proposals_pending{job="etcd"}

  • etcd_server_proposals_committed_total{job="etcd"} - etcd_server_proposals_applied_total{job="etcd"}

  • The number of failed raft proposals per minute.

  • The total number of pending raft proposals.

  • The difference between the number of committed raft proposals and the number of applied raft proposals.

Common metric anomalies

Etcd Cluster Healthy

Normal

Abnormal

Anomaly description

All three etcd members have a leader and one of the etcd members must be a leader. This means that sum(etcd_server_has_leader)=3. In addition, member etcd_server_is_leader == 1 is displayed for one of the etcd members.

One etcd member is abnormal.

This means that member etcd_server_has_leader!=1. This anomaly does not have an impact on the external services provided by the etcd cluster.

Multiple etcd members are abnormal.

This means that member etcd_server_has_leader!=1 is displayed for multiple etcd members. Multiple etcd members are abnormal. In this scenario, the etcd cluster cannot provide external services.

Check whether etcd_server_is_leader == 1 is displayed for the etcd members. If not, the etcd members do not have a leader and cannot provide external services.

Backend Commit Delay

Normal

Abnormal

Anomaly description

The metric indicates a delay of tens of milliseconds.

The metric indicates a delay of hundreds of milliseconds or even several seconds for a period of time.

Disk reads and writes are abnormal.

Raft Proposals Status

Normal

Abnormal

Anomaly description

The number of failed raft proposals per minute is 0.

The number of failed raft proposals per minute is greater than 0.

Raft proposals failed. If a large number of raft proposals failed, troubleshoot the issue.

The number of pending raft proposals is 0.

The number of pending raft proposals is greater than 0.

A large number of raft proposals are pending because raft proposals are applied slowly. Check the Backend Commit Delay metric and troubleshoot the issue.

The difference between the number of committed raft proposals and the number of applied raft proposals is 0.

The difference between the number of committed raft proposals and the number of applied raft proposals is greater than 0.

The etcd is overwhelmed by client requests.

If the difference is greater than 5000, etcd denies subsequent requests and returns the too many request message. etcd can accept requests until all pending proposals are processed.