Monitor etcd - Container Service for Kubernetes - Alibaba Cloud Documentation Center

This topic describes the metrics supported by etcd, provides usage notes for the dashboards of etcd, and suggests how to troubleshoot common metric anomalies.

Metrics

Metrics can indicate the status and parameter settings of a component. The following table describes the metrics supported by etcd.

Metric	Type	Description
cpu_utilization_core	Gauge	The CPU usage. Unit: vCores.
cpu_utilization_ratio	Gauge	CPU utilization = Number of used vCores/Total number of vCores. Unit: %.
etcd_server_has_leader	Gauge	Indicates whether the etcd member has a leader. Valid values: 1: The etcd member has a leader. 0: The etcd member does not have a leader.
etcd_server_is_leader	Gauge	Indicates whether the etcd member is a leader. Valid values: 1: The etcd member is a leader. 0: The etcd member is not a leader.
etcd_server_leader_changes_seen_total	Counter	The number of leader changes within a period of time.
etcd_mvcc_db_total_size_in_bytes	Gauge	The size of the etcd member DB.
etcd_mvcc_db_total_size_in_use_in_bytes	Gauge	The usage of the etcd member DB.
etcd_disk_backend_commit_duration_seconds_bucket	Histogram	The etcd backend commit delay. Buckets: `0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2.048, 4.096, and 8.192`.
etcd_debugging_mvcc_keys_total	Gauge	The total number of etcd keys.
etcd_server_proposals_committed_total	Gauge	The total number of raft proposals committed.
etcd_server_proposals_applied_total	Gauge	The total number of raft proposals applied.
etcd_server_proposals_pending	Gauge	The total number of pending raft proposals.
etcd_server_proposals_failed_total	Counter	The total number of failed raft proposals.
memory_utilization_byte	Gauge	The memory usage. Unit: bytes.
memory_utilization_ratio	Gauge	Memory utilization = Amount of used memory/Total amount of memory. Unit: %.

Usage notes for dashboards

Dashboards are generated based on metrics and Prometheus Query Language (PromQL). The following sections describe the observability and features of the dashboards of etcd.

Observability

etcd

Features

Dashboard	PromQL	Description
Etcd Cluster Healthy	etcd_server_has_leader etcd_server_is_leader == 1	Indicates the liveness of the etcd member. A value of 3 indicates that the etcd member is alive. Indicates whether the etcd member is a leader. In most cases, an etcd member must be elected as a leader.
Leader Changes for Latest Day	changes(etcd_server_leader_changes_seen_total{job="etcd"}[1d])	The number of leader changes within the previous day.
Mem Usage	memory_utilization_byte{container="etcd"}	The memory usage. Unit: bytes.
CPU Usage	cpu_utilization_core{container="etcd"}*1000	The CPU usage. Unit: millicores.
Mem Usage Rate	memory_utilization_ratio{container="etcd"}	The memory utilization. Unit: percentage.
CPU Usage Rate	cpu_utilization_ratio{container="etcd"}	The CPU utilization. Unit: percentage.
DB Size	etcd_mvcc_db_total_size_in_bytes etcd_mvcc_db_total_size_in_use_in_bytes	The size of the etcd backend DB. The usage of the etcd backend DB.
kv total	etcd_debugging_mvcc_keys_total	The total number of key-value pairs in the etcd cluster.
Backend Commit Delay	histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job="etcd"}[5m])) by (instance, le))	The DB commit delay.
Raft Proposals Status	rate(etcd_server_proposals_failed_total{job="etcd"}[1m]) etcd_server_proposals_pending{job="etcd"} etcd_server_proposals_committed_total{job="etcd"} - etcd_server_proposals_applied_total{job="etcd"}	The number of failed raft proposals per minute. The total number of pending raft proposals. The difference between the number of committed raft proposals and the number of applied raft proposals.

Common metric anomalies

Etcd Cluster Healthy

Normal

Abnormal

Anomaly description

All three etcd members have a leader and one of the etcd members must be a leader. This means that sum(etcd_server_has_leader)=3. In addition, member etcd_server_is_leader == 1 is displayed for one of the etcd members.

One etcd member is abnormal.

This means that member etcd_server_has_leader!=1. This anomaly does not have an impact on the external services provided by the etcd cluster.

Multiple etcd members are abnormal.

This means that member etcd_server_has_leader!=1 is displayed for multiple etcd members. Multiple etcd members are abnormal. In this scenario, the etcd cluster cannot provide external services.

Check whether etcd_server_is_leader == 1 is displayed for the etcd members. If not, the etcd members do not have a leader and cannot provide external services.

Backend Commit Delay

Normal	Abnormal	Anomaly description
The metric indicates a delay of tens of milliseconds.	The metric indicates a delay of hundreds of milliseconds or even several seconds for a period of time.	Disk reads and writes are abnormal.

Raft Proposals Status

Normal	Abnormal	Anomaly description
The number of failed raft proposals per minute is 0.	The number of failed raft proposals per minute is greater than 0.	Raft proposals failed. If a large number of raft proposals failed, troubleshoot the issue.
The number of pending raft proposals is 0.	The number of pending raft proposals is greater than 0.	A large number of raft proposals are pending because raft proposals are applied slowly. Check the Backend Commit Delay metric and troubleshoot the issue.
The difference between the number of committed raft proposals and the number of applied raft proposals is 0.	The difference between the number of committed raft proposals and the number of applied raft proposals is greater than 0.	The etcd is overwhelmed by client requests. If the difference is greater than 5000, etcd denies subsequent requests and returns the `too many request` message. etcd can accept requests until all pending proposals are processed.