All Products
Search
Document Center

Container Service for Kubernetes:Monitoring Dashboard Overview

Last Updated:Mar 25, 2026

GPU Monitoring uses the Exporter, Prometheus, and Grafana stack to support richer GPU monitoring scenarios. This topic describes the panels on the monitoring dashboard.

Panel overview

GPU Monitoring includes three dashboards: GPUs - Cluster Dimension, GPUs - Nodes, and GPUs - Pods. The following sections describe each dashboard.

GPUs - Cluster Dimension

Panel nameDescription
Total GPU NodesTotal number of GPU nodes in the cluster or node pool.
Allocated GPUsTotal number of GPUs in the cluster or node pool, and how many are allocated.
Allocated GPU MemoryPercentage of total GPU memory that is allocated.
Used GPU MemoryPercentage of total GPU memory that is currently in use.
Average GPU UtilizationAverage GPU utilization across the cluster or node pool.
GPU Memory Copy UtilizationAverage memory copy utilization across the cluster or node pool.
The Last One XID ErrorMost recent XID error on a GPU card in the cluster.
GPU Node DetailsDetails for GPU nodes in the cluster, including: Node Name, GPU Index, GPU Utilization, GPU Memory Copy Utilization, Used GPU Memory, Allocated GPU Memory, Total GPU Memory, Power, GPU Temperature, and GPU Memory Temperature.

GPUs - Nodes

Overview

Summary metrics for a single node. Use these panels to assess overall GPU health and allocation status before drilling into detailed charts.

Panel nameDescription
GPU ModeGPU allocation mode on the node: Exclusive (resources allocated per GPU card), Share (resources allocated by GPU memory and computing power), or None (no GPU applications are running on the node; a node can switch between Exclusive and Share modes, but if no GPU program runs, the system cannot detect whether the node uses Exclusive or Share mode).
NVIDIA Driver VersionVersion of the NVIDIA driver installed on the node.
Allocated GPUsNumber of GPUs allocated on the node, out of the total number of GPUs on the node.
GPU UtilizationAverage GPU utilization across all GPU cards on the node.
Allocated GPU MemoryPercentage of total GPU memory that is allocated on the node.
Used GPU MemoryPercentage of total GPU memory that is currently in use on the node.
Allocated Computing Power (Valid in GPU Sharing)Allocated computing power. Applies only when GPU sharing is enabled and computing power scheduling is requested.
The Last One XID ErrorMost recent XID error on a GPU card on the node.

Utilization

Time-series charts for GPU compute and encoding/decoding engine activity. Use these panels to identify whether workloads are compute-bound or encoder/decoder-bound.

Panel nameDescription
GPU UtilizationGPU card utilization on the node.
GPU Memory Copy UtilizationMemory copy utilization on the GPU card.
Encoder Engine UtilizationEncoder engine utilization on the GPU card.
Decoder Engine UtilizationDecoder engine utilization on the GPU card.

Memory & BAR1

Memory usage and BAR1 details for GPU cards on the node. Use these panels to track memory pressure and identify cards approaching their memory limit.

Panel nameDescription
GPU Memory DetailsPer-card GPU memory breakdown, including: UUID, GPU Index, Mode Name (card model), Used Percentage, Used (amount of GPU memory in use), Allocated (percentage of total GPU memory allocated), and Total (total GPU memory on the card).
BAR1 UsedAmount of BAR1 memory in use on the GPU cards.
GPU Memory UsedAmount of GPU memory in use on the GPU cards on the node.
BAR1 TotalTotal BAR1 memory available on the GPU cards.

GPU process

Process-level GPU activity on the node. Use these panels to identify which pods and containers are consuming GPU resources and to detect unauthorized GPU usage.

Panel nameDescription
GPU Process DetailsDetails for GPU processes on the node, including: Pod Namespace, Pod Name, Container Name, Allocate Mode (Exclusive or Share), Process ID, Process Name, Process Type (compute (C) or graphics (G)), GPU Index, Used Memory, Streaming Multiprocessor (SM) Utilization, Memory Copy Utilization, Decode Utilization, and Encode Utilization.
Illegal GPU Process (GPU request not by k8s resources.limits) DetailsDetails for processes that access GPU resources without using Kubernetes resource limits. This includes: running a GPU application directly on a node; running a GPU application in a container started with the docker run command; requesting GPU resources by setting NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> directly in a Pod's env section; configuring privileged: true in a Pod's securityContext; or running a GPU program in a Pod whose container image has NVIDIA_VISIBLE_DEVICES=all set by default.

Profiling

Low-level GPU pipeline activity metrics collected by DCGM. Use these panels to diagnose memory-bound vs. compute-bound workloads and identify pipeline bottlenecks.

Panel nameDescription
Graphics Engine ActivePercentage of time during a monitoring cycle that the Graphics or Compute engine is active.
DRAM ActiveMemory bandwidth utilization.
SM ActivePercentage of time that SM units are active.
SM OccupancySM occupancy rate.
Tensor Core Engine ActivePercentage of time during a monitoring cycle that the Tensor Core pipeline is active.
FP32 Engine ActivePercentage of time during a monitoring cycle that the FP32 pipeline is active.
FP16 Engine ActivePercentage of time during a monitoring cycle that the FP16 pipeline is active.
FP64 Engine ActivePercentage of time during a monitoring cycle that the FP64 pipeline is active.
PCIE TX Bytes (Device to Host)Data transfer rate over the PCIe bus from the GPU device to the host.
PCIE RX Bytes (Host to Device)Data transfer rate over the PCIe bus from the host to the GPU device.
NVLINK TX BytesData sent over NVLink.
NVLINK RX BytesData received over NVLink.

Temperature & Energy

Thermal and power metrics. Monitor these panels to detect thermal throttling or sustained high power draw.

Panel nameDescription
Power UsagePower draw of the GPU card.
Total Energy Consumption (in J)Total energy consumed by the GPU card since the driver was loaded. Unit: joules.
Memory TemperatureGPU memory temperature.
GPU TemperatureGPU temperature (compute unit).

Clock

Clock frequencies and throttle reasons. A drop in clock frequency under high utilization typically indicates thermal or power throttling.

Panel nameDescription
SM CLOCKSM clock frequency.
Memory ClockMemory clock frequency.
APP SM ClockSM application clock frequency.
APP Memory ClockApplication memory clock frequency.
Video ClockVideo engine clock frequency.
Clock Throttle ReasonsReasons for clock throttling.

Retired pages

Memory pages retired due to hardware errors. Any non-zero value warrants investigation.

Panel nameDescription
Retired Pages (Single-bit Errors)Number of memory pages retired due to single-bit errors.
Retired Pages (Double-bit Errors)Number of memory pages retired due to double-bit errors.

Violation

Time spent violating hardware limits, measured in microseconds. Sustained violations indicate the node is operating beyond its safe operating envelope.

Panel nameDescription
Power ViolationTime spent violating the power limit. Unit: microseconds.
Thermal ViolationTime spent violating the thermal limit. Unit: microseconds.
Sync Boost ViolationTime spent violating the sync boost limit. Unit: microseconds.
Board Limit ViolationTime spent violating the board limit. Unit: microseconds.
Board Reliability ViolationTime spent violating the board reliability limit. Unit: microseconds.
Low Util ViolationTime spent violating the low utilization limit. Unit: microseconds.

GPUs - Pods

Overview

Panel nameDescription
GPU Pod DetailsDetails for pods that request GPU resources, including: Pod Namespace, Pod Name, Node Name, Pod Source, Allocated Mode, Used GPU Memory, Allocated GPU Memory, Allocated Computing Power (blank if the pod requests only GPU memory or uses exclusive GPU mode), SM Utilization, GPU Memory Copy Utilization, Encode Utilization, and Decode Utilization.

Pod Metrics (GPU Device)

GPU metrics scoped to individual pods. Use these panels to track per-pod GPU memory consumption and SM activity, and to detect pods approaching their GPU memory limit.

Panel nameDescription
Pods Used GPU MemoryAmount of GPU memory currently used by the pod.
Pods GPU Memory Used PercentagePercentage of total available GPU memory used by the pod.
Pods GPU Memory Copy UtilizationMemory copy utilization for the pod.
Pods Average SM UtilizationAverage SM utilization for the pod.
Pods GPU Decode UtilizationDecoder utilization for the pod.
Pods GPU Encode UtilizationEncoder utilization for the pod.

Pods Metrics (Host Resource)

Host-level resource metrics for pods running GPU workloads. Use these panels to identify CPU or memory bottlenecks that may limit GPU throughput.

Panel nameDescription
Memory PercentPercentage of host memory in use.
Memory UsageAmount of host memory in use.
CPU Usage By CoresCPU usage per core.
CPU Usage PercentPercentage of CPU in use.
Network Bandwidth UsageNetwork bandwidth usage.
Network SocketNumber of active network sockets.
File SystemFile system usage.
Process NumberNumber of processes.

GPU Utilization (Associated with Pod)

GPU compute and encoding/decoding engine activity for the GPU cards associated with the pod.

Panel nameDescription
GPU UtilizationGPU card utilization for the application.
GPU Memory Copy UtilizationMemory copy utilization for the application's GPU card.
Encoder Engine UtilizationEncoder engine utilization for the application's GPU card.
Decoder Engine UtilizationDecoder engine utilization for the application's GPU card.

GPU Memory & BAR1 (Associated with Pod)

GPU memory details for the GPU cards associated with the pod.

Panel nameDescription
GPU Memory DetailsPer-card GPU memory breakdown for the application, including: UUID, Pod Source, Model Name (GPU model), Driver version, Allocated Mode, Allocated Percentage, Used (amount of GPU memory in use), Used Percentage, and Total (total GPU memory on the card).
GPU Memory UsedAmount of GPU memory used by the application's GPU card.
GPU Memory Used PercentagePercentage of GPU memory in use by the application.
BAR1 UsedAmount of BAR1 memory in use.
BAR1 TotalTotal BAR1 memory available.

GPU Profiling (Associated with Pod)

Low-level GPU pipeline metrics for the GPU cards associated with the pod. Use these panels to diagnose memory-bound vs. compute-bound behavior at the pod level.

Panel nameDescription
Graphics Engine ActivePercentage of time during a monitoring cycle that the Graphics or Compute engine is active.
DRAM ActiveMemory bandwidth utilization.
SM ActivePercentage of time that SM units are active.
SM OccupancySM occupancy rate.
Tensor Core Engine ActivePercentage of time during a monitoring cycle that the Tensor Core pipeline is active.
FP32 Engine ActivePercentage of time during a monitoring cycle that the FP32 pipeline is active.
FP16 Engine ActivePercentage of time during a monitoring cycle that the FP16 pipeline is active.
FP64 Engine ActivePercentage of time during a monitoring cycle that the FP64 pipeline is active.
PCIE TX Bytes (Device to Host)Data transfer rate over the PCIe bus from the application's GPU device to the host.
PCIE RX Bytes (Host to Device)Data transfer rate over the PCIe bus from the host to the application's GPU device.
NVLINK TX BytesData sent over NVLink.
NVLINK RX BytesData received over NVLink.

GPU Temperature & Energy (Associated with Pod)

Thermal and power metrics for the GPU cards associated with the pod.

Panel nameDescription
Power UsagePower draw of the application's GPU card.
Total Energy Consumption (in J)Total energy consumed by the GPU card since the driver was loaded. Unit: joules.
Memory TemperatureGPU memory temperature for the application.
GPU TemperatureGPU temperature (compute unit) for the application.

GPU Clock (Associated with Pod)

Clock frequencies for the GPU cards associated with the pod.

Panel nameDescription
SM CLOCKSM clock frequency.
Memory ClockMemory clock frequency.
APP SM ClockSM application clock frequency.
APP Memory ClockApplication memory clock frequency.
Video ClockVideo engine clock frequency.
Clock Throttle ReasonsReasons for clock throttling.