All Products
Search
Document Center

Container Service for Kubernetes:kube-controller-manager component monitoring metrics and dashboard usage

Last Updated:Mar 26, 2026

kube-controller-manager is a control plane component that runs the core Kubernetes controllers—the node controller, StatefulSet controller, and Deployment controller, among others. This topic describes the metrics exposed by kube-controller-manager and how to use the built-in dashboards to monitor its health.

Key concepts

Workqueue

Most controllers in kube-controller-manager process resource changes through a workqueue. When an event occurs—such as a Pod being created, updated, or deleted—the controller receives a notification and enqueues the resource identifier (Pod name and namespace) into the workqueue. During each reconcile loop, the controller dequeues identifiers and applies the corresponding business logic.

The workqueue is the primary signal for controller throughput. A growing queue depth indicates the controller is falling behind—events are arriving faster than the controller can process them.

Metrics

Metrics reflect the internal state of kube-controller-manager. The following table describes the available metrics.

Metric Type Description When to watch
workqueue_adds_total Counter Total number of Add events processed by the workqueue. Unusual spikes may indicate a burst of resource changes—correlate with cluster activity.
workqueue_depth Gauge Current length of the workqueue. A sustained high depth means the controller cannot keep up with incoming events. Investigate controller goroutine status and resource pressure.
workqueue_queue_duration_seconds_bucket Histogram Time a task spends waiting in the workqueue before processing. Bucket thresholds: {10⁻⁸, 10⁻⁷, 10⁻⁶, 10⁻⁵, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹, 1, 10} seconds. High latency at the p95/p99 quantile indicates queue backlog. Use with histogram_quantile to set SLO-based alerts.
memory_utilization_byte Gauge Memory used by kube-controller-manager. Unit: bytes. Alert if memory approaches the container limit to prevent OOM restarts.
cpu_utilization_core Gauge CPU capacity used by kube-controller-manager. Unit: core. Sustained high CPU usage may signal excessive controller reconciliations.
rest_client_requests_total Counter Number of HTTP requests to kube-apiserver, grouped by status code, method, and host. A rise in 4xx or 5xx responses indicates API server errors affecting controller operations.
rest_client_request_duration_seconds_bucket Histogram Latency of HTTP requests from kube-controller-manager to kube-apiserver, broken down by verb and URL. Elevated p99 latency suggests API server slowness, which delays controller reconciliations.

Deprecated metrics

The following metrics are deprecated and will be removed in a future version. Remove any alerts or dashboards that depend on them.

Deprecated metric Replacement
cpu_utilization_ratio cpu_utilization_core
memory_utilization_ratio memory_utilization_byte

Access dashboards

For instructions on opening the kube-controller-manager dashboards, see View control plane component dashboards in ACK Pro clusters.

Dashboard reference

Dashboards are built from the metrics above using Prometheus Query Language (PromQL). Two parameters apply globally across all dashboards:

  • quantile: The request quantile to display (for example, 0.99 for p99).

  • interval: The PromQL sampling interval used in rate() calculations.

The following sections describe each dashboard panel and its PromQL.

Workqueue dashboard

kcm1
Panel PromQL What it shows
Workqueue entry rate sum(rate(workqueue_adds_total{job="ack-kube-controller-manager"}[$interval])) by (name) Rate of Add events entering the workqueue per controller, over the selected interval.
Workqueue depth sum(rate(workqueue_depth{job="ack-kube-controller-manager"}[$interval])) by (name) Change in workqueue length per controller over the selected interval. A rising trend indicates a queue backlog.
Workqueue processing delay histogram_quantile($quantile, sum(rate(workqueue_queue_duration_seconds_bucket{job="ack-kube-controller-manager"}[5m])) by (name, le)) Time events spend waiting in the workqueue at the selected quantile.

Resources dashboard

kcm2
Panel PromQL What it shows
Memory Usage memory_utilization_byte{container="kube-controller-manager"} Memory used by kube-controller-manager. Unit: bytes.
CPU Usage cpu_utilization_core{container="kube-controller-manager"}*1000 CPU used by kube-controller-manager. Unit: millicore.

Kube API dashboard

kcm3
Panel PromQL What it shows
Kube API request QPS See below Queries per second (QPS) from kube-controller-manager to kube-apiserver, broken down by HTTP method and status code.
Kube API request delay histogram_quantile($quantile, sum(rate(rest_client_request_duration_seconds_bucket{job="ack-kube-controller-manager"}[$interval])) by (verb,url,le)) Round-trip latency between kube-controller-manager and kube-apiserver at the selected quantile, broken down by verb and URL.

The Kube API request QPS panel uses four queries to separate responses by status code class:

sum(rate(rest_client_requests_total{job="ack-kube-controller-manager",code=~"2.."}[$interval])) by (method,code)
sum(rate(rest_client_requests_total{job="ack-cloud-controller-manager",code=~"3.."}[$interval])) by (method,code)
sum(rate(rest_client_requests_total{job="ack-cloud-controller-manager",code=~"4.."}[$interval])) by (method,code)
sum(rate(rest_client_requests_total{job="ack-cloud-controller-manager",code=~"5.."}[$interval])) by (method,code)

Troubleshooting

Dashboard panels show no data

Check that the metric collection job is active. The workqueue and Kube API panels use job="ack-kube-controller-manager". If no data appears, verify that the Prometheus scrape job is running and that kube-controller-manager is exposing metrics on the expected port.

Histogram metrics show incomplete data

Histogram metrics like workqueue_queue_duration_seconds_bucket require all three series in the family (_bucket, _sum, _count) to be collected. If any series is missing, histogram_quantile returns no result. Confirm that the scrape configuration collects all three.

workqueue_depth stays elevated

A sustained non-zero workqueue depth means the controller is not processing events fast enough. Check controller goroutine count and API server latency. High rest_client_request_duration_seconds_bucket values at the same time point to API server bottlenecks as the root cause.

What's next

For metrics, dashboard usage, and troubleshooting guidance for other control plane components, see:

References