GPU Monitoring uses the Exporter, Prometheus, and Grafana stack to support richer GPU monitoring scenarios. This topic describes the panels on the monitoring dashboard.
Panel overview
GPU Monitoring includes three dashboards: GPUs - Cluster Dimension, GPUs - Nodes, and GPUs - Pods. The following sections describe each dashboard.
GPUs - Cluster Dimension
| Panel name | Description |
|---|---|
| Total GPU Nodes | Total number of GPU nodes in the cluster or node pool. |
| Allocated GPUs | Total number of GPUs in the cluster or node pool, and how many are allocated. |
| Allocated GPU Memory | Percentage of total GPU memory that is allocated. |
| Used GPU Memory | Percentage of total GPU memory that is currently in use. |
| Average GPU Utilization | Average GPU utilization across the cluster or node pool. |
| GPU Memory Copy Utilization | Average memory copy utilization across the cluster or node pool. |
| The Last One XID Error | Most recent XID error on a GPU card in the cluster. |
| GPU Node Details | Details for GPU nodes in the cluster, including: Node Name, GPU Index, GPU Utilization, GPU Memory Copy Utilization, Used GPU Memory, Allocated GPU Memory, Total GPU Memory, Power, GPU Temperature, and GPU Memory Temperature. |
GPUs - Nodes
Overview
Summary metrics for a single node. Use these panels to assess overall GPU health and allocation status before drilling into detailed charts.
| Panel name | Description |
|---|---|
| GPU Mode | GPU allocation mode on the node: Exclusive (resources allocated per GPU card), Share (resources allocated by GPU memory and computing power), or None (no GPU applications are running on the node; a node can switch between Exclusive and Share modes, but if no GPU program runs, the system cannot detect whether the node uses Exclusive or Share mode). |
| NVIDIA Driver Version | Version of the NVIDIA driver installed on the node. |
| Allocated GPUs | Number of GPUs allocated on the node, out of the total number of GPUs on the node. |
| GPU Utilization | Average GPU utilization across all GPU cards on the node. |
| Allocated GPU Memory | Percentage of total GPU memory that is allocated on the node. |
| Used GPU Memory | Percentage of total GPU memory that is currently in use on the node. |
| Allocated Computing Power (Valid in GPU Sharing) | Allocated computing power. Applies only when GPU sharing is enabled and computing power scheduling is requested. |
| The Last One XID Error | Most recent XID error on a GPU card on the node. |
Utilization
Time-series charts for GPU compute and encoding/decoding engine activity. Use these panels to identify whether workloads are compute-bound or encoder/decoder-bound.
| Panel name | Description |
|---|---|
| GPU Utilization | GPU card utilization on the node. |
| GPU Memory Copy Utilization | Memory copy utilization on the GPU card. |
| Encoder Engine Utilization | Encoder engine utilization on the GPU card. |
| Decoder Engine Utilization | Decoder engine utilization on the GPU card. |
Memory & BAR1
Memory usage and BAR1 details for GPU cards on the node. Use these panels to track memory pressure and identify cards approaching their memory limit.
| Panel name | Description |
|---|---|
| GPU Memory Details | Per-card GPU memory breakdown, including: UUID, GPU Index, Mode Name (card model), Used Percentage, Used (amount of GPU memory in use), Allocated (percentage of total GPU memory allocated), and Total (total GPU memory on the card). |
| BAR1 Used | Amount of BAR1 memory in use on the GPU cards. |
| GPU Memory Used | Amount of GPU memory in use on the GPU cards on the node. |
| BAR1 Total | Total BAR1 memory available on the GPU cards. |
GPU process
Process-level GPU activity on the node. Use these panels to identify which pods and containers are consuming GPU resources and to detect unauthorized GPU usage.
| Panel name | Description |
|---|---|
| GPU Process Details | Details for GPU processes on the node, including: Pod Namespace, Pod Name, Container Name, Allocate Mode (Exclusive or Share), Process ID, Process Name, Process Type (compute (C) or graphics (G)), GPU Index, Used Memory, Streaming Multiprocessor (SM) Utilization, Memory Copy Utilization, Decode Utilization, and Encode Utilization. |
| Illegal GPU Process (GPU request not by k8s resources.limits) Details | Details for processes that access GPU resources without using Kubernetes resource limits. This includes: running a GPU application directly on a node; running a GPU application in a container started with the docker run command; requesting GPU resources by setting NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> directly in a Pod's env section; configuring privileged: true in a Pod's securityContext; or running a GPU program in a Pod whose container image has NVIDIA_VISIBLE_DEVICES=all set by default. |
Profiling
Low-level GPU pipeline activity metrics collected by DCGM. Use these panels to diagnose memory-bound vs. compute-bound workloads and identify pipeline bottlenecks.
| Panel name | Description |
|---|---|
| Graphics Engine Active | Percentage of time during a monitoring cycle that the Graphics or Compute engine is active. |
| DRAM Active | Memory bandwidth utilization. |
| SM Active | Percentage of time that SM units are active. |
| SM Occupancy | SM occupancy rate. |
| Tensor Core Engine Active | Percentage of time during a monitoring cycle that the Tensor Core pipeline is active. |
| FP32 Engine Active | Percentage of time during a monitoring cycle that the FP32 pipeline is active. |
| FP16 Engine Active | Percentage of time during a monitoring cycle that the FP16 pipeline is active. |
| FP64 Engine Active | Percentage of time during a monitoring cycle that the FP64 pipeline is active. |
| PCIE TX Bytes (Device to Host) | Data transfer rate over the PCIe bus from the GPU device to the host. |
| PCIE RX Bytes (Host to Device) | Data transfer rate over the PCIe bus from the host to the GPU device. |
| NVLINK TX Bytes | Data sent over NVLink. |
| NVLINK RX Bytes | Data received over NVLink. |
Temperature & Energy
Thermal and power metrics. Monitor these panels to detect thermal throttling or sustained high power draw.
| Panel name | Description |
|---|---|
| Power Usage | Power draw of the GPU card. |
| Total Energy Consumption (in J) | Total energy consumed by the GPU card since the driver was loaded. Unit: joules. |
| Memory Temperature | GPU memory temperature. |
| GPU Temperature | GPU temperature (compute unit). |
Clock
Clock frequencies and throttle reasons. A drop in clock frequency under high utilization typically indicates thermal or power throttling.
| Panel name | Description |
|---|---|
| SM CLOCK | SM clock frequency. |
| Memory Clock | Memory clock frequency. |
| APP SM Clock | SM application clock frequency. |
| APP Memory Clock | Application memory clock frequency. |
| Video Clock | Video engine clock frequency. |
| Clock Throttle Reasons | Reasons for clock throttling. |
Retired pages
Memory pages retired due to hardware errors. Any non-zero value warrants investigation.
| Panel name | Description |
|---|---|
| Retired Pages (Single-bit Errors) | Number of memory pages retired due to single-bit errors. |
| Retired Pages (Double-bit Errors) | Number of memory pages retired due to double-bit errors. |
Violation
Time spent violating hardware limits, measured in microseconds. Sustained violations indicate the node is operating beyond its safe operating envelope.
| Panel name | Description |
|---|---|
| Power Violation | Time spent violating the power limit. Unit: microseconds. |
| Thermal Violation | Time spent violating the thermal limit. Unit: microseconds. |
| Sync Boost Violation | Time spent violating the sync boost limit. Unit: microseconds. |
| Board Limit Violation | Time spent violating the board limit. Unit: microseconds. |
| Board Reliability Violation | Time spent violating the board reliability limit. Unit: microseconds. |
| Low Util Violation | Time spent violating the low utilization limit. Unit: microseconds. |
GPUs - Pods
Overview
| Panel name | Description |
|---|---|
| GPU Pod Details | Details for pods that request GPU resources, including: Pod Namespace, Pod Name, Node Name, Pod Source, Allocated Mode, Used GPU Memory, Allocated GPU Memory, Allocated Computing Power (blank if the pod requests only GPU memory or uses exclusive GPU mode), SM Utilization, GPU Memory Copy Utilization, Encode Utilization, and Decode Utilization. |
Pod Metrics (GPU Device)
GPU metrics scoped to individual pods. Use these panels to track per-pod GPU memory consumption and SM activity, and to detect pods approaching their GPU memory limit.
| Panel name | Description |
|---|---|
| Pods Used GPU Memory | Amount of GPU memory currently used by the pod. |
| Pods GPU Memory Used Percentage | Percentage of total available GPU memory used by the pod. |
| Pods GPU Memory Copy Utilization | Memory copy utilization for the pod. |
| Pods Average SM Utilization | Average SM utilization for the pod. |
| Pods GPU Decode Utilization | Decoder utilization for the pod. |
| Pods GPU Encode Utilization | Encoder utilization for the pod. |
Pods Metrics (Host Resource)
Host-level resource metrics for pods running GPU workloads. Use these panels to identify CPU or memory bottlenecks that may limit GPU throughput.
| Panel name | Description |
|---|---|
| Memory Percent | Percentage of host memory in use. |
| Memory Usage | Amount of host memory in use. |
| CPU Usage By Cores | CPU usage per core. |
| CPU Usage Percent | Percentage of CPU in use. |
| Network Bandwidth Usage | Network bandwidth usage. |
| Network Socket | Number of active network sockets. |
| File System | File system usage. |
| Process Number | Number of processes. |
GPU Utilization (Associated with Pod)
GPU compute and encoding/decoding engine activity for the GPU cards associated with the pod.
| Panel name | Description |
|---|---|
| GPU Utilization | GPU card utilization for the application. |
| GPU Memory Copy Utilization | Memory copy utilization for the application's GPU card. |
| Encoder Engine Utilization | Encoder engine utilization for the application's GPU card. |
| Decoder Engine Utilization | Decoder engine utilization for the application's GPU card. |
GPU Memory & BAR1 (Associated with Pod)
GPU memory details for the GPU cards associated with the pod.
| Panel name | Description |
|---|---|
| GPU Memory Details | Per-card GPU memory breakdown for the application, including: UUID, Pod Source, Model Name (GPU model), Driver version, Allocated Mode, Allocated Percentage, Used (amount of GPU memory in use), Used Percentage, and Total (total GPU memory on the card). |
| GPU Memory Used | Amount of GPU memory used by the application's GPU card. |
| GPU Memory Used Percentage | Percentage of GPU memory in use by the application. |
| BAR1 Used | Amount of BAR1 memory in use. |
| BAR1 Total | Total BAR1 memory available. |
GPU Profiling (Associated with Pod)
Low-level GPU pipeline metrics for the GPU cards associated with the pod. Use these panels to diagnose memory-bound vs. compute-bound behavior at the pod level.
| Panel name | Description |
|---|---|
| Graphics Engine Active | Percentage of time during a monitoring cycle that the Graphics or Compute engine is active. |
| DRAM Active | Memory bandwidth utilization. |
| SM Active | Percentage of time that SM units are active. |
| SM Occupancy | SM occupancy rate. |
| Tensor Core Engine Active | Percentage of time during a monitoring cycle that the Tensor Core pipeline is active. |
| FP32 Engine Active | Percentage of time during a monitoring cycle that the FP32 pipeline is active. |
| FP16 Engine Active | Percentage of time during a monitoring cycle that the FP16 pipeline is active. |
| FP64 Engine Active | Percentage of time during a monitoring cycle that the FP64 pipeline is active. |
| PCIE TX Bytes (Device to Host) | Data transfer rate over the PCIe bus from the application's GPU device to the host. |
| PCIE RX Bytes (Host to Device) | Data transfer rate over the PCIe bus from the host to the application's GPU device. |
| NVLINK TX Bytes | Data sent over NVLink. |
| NVLINK RX Bytes | Data received over NVLink. |
GPU Temperature & Energy (Associated with Pod)
Thermal and power metrics for the GPU cards associated with the pod.
| Panel name | Description |
|---|---|
| Power Usage | Power draw of the application's GPU card. |
| Total Energy Consumption (in J) | Total energy consumed by the GPU card since the driver was loaded. Unit: joules. |
| Memory Temperature | GPU memory temperature for the application. |
| GPU Temperature | GPU temperature (compute unit) for the application. |
GPU Clock (Associated with Pod)
Clock frequencies for the GPU cards associated with the pod.
| Panel name | Description |
|---|---|
| SM CLOCK | SM clock frequency. |
| Memory Clock | Memory clock frequency. |
| APP SM Clock | SM application clock frequency. |
| APP Memory Clock | Application memory clock frequency. |
| Video Clock | Video engine clock frequency. |
| Clock Throttle Reasons | Reasons for clock throttling. |