GPU monitoring is built on an Exporter, Prometheus, and Grafana stack for comprehensive GPU monitoring. This topic describes the panels on the GPU monitoring dashboard.
Panel descriptions
GPU monitoring includes three dashboards: Cluster-level GPU monitoring, Node-level GPU monitoring, and Pod-level GPU monitoring. The following sections describe each dashboard in detail.
GPUs - Cluster Dimension
Panel name | Description |
Total GPU Node Instance | The total number of GPU nodes in the cluster or node pool. |
Allocated GPUs | The total number of GPUs and the number of allocated GPUs in the cluster or node pool. |
Allocated GPU Memory | The percentage of GPU memory allocated out of the total available in the cluster or node pool. |
Used GPU Memory | The percentage of GPU memory in use out of the total available in the cluster or node pool. |
Average GPU Utilization | The average GPU utilization across the cluster or node pool. |
Used GPU Memory Copy Utilization | The average GPU memory copy utilization across the cluster or node pool. |
The Last one XID Error | The most recent XID error on a GPU in the cluster. |
GPU Node Details | Details for each GPU node in the cluster, including:
|
GPUs - Nodes
Panel group | Panel name | Description |
Overview | GPU Mode | The GPU mode, which can be Exclusive, Share, or None.
|
NVIDIA Driver Version | The version of the NVIDIA driver installed on the node. | |
Allocated GPUs | The number of allocated GPUs out of the total number of GPUs on the node. | |
GPU Utilization | The average GPU utilization of all GPUs on the node. | |
Allocated GPU Memory | The percentage of total GPU memory on the node that has been allocated. | |
Used GPU Memory | The percentage of total GPU memory on the node that is currently in use. | |
Allocated Computing Power (Valid in GPU Sharing) | The amount of allocated computing power. This metric only applies when using shared GPU scheduling with computing power requests. | |
The Last one XID Error | The most recent XID error on a GPU on this node. | |
Utilization | GPU Utilization | The utilization of each GPU on the node. |
GPU Memory Utilization | The GPU memory copy utilization of each GPU on the node. | |
Encoder Engine Utilization | The encoder engine utilization of each GPU on the node. | |
Decoder Engine Utilization | The decoder engine utilization of each GPU on the node. | |
Memory & BAR1 | GPU Memory Used | Details about the GPU memory on the node:
|
BAR1 Used | The amount of used BAR1 memory. | |
GPU Memory Used | The amount of GPU memory used by the GPUs on the node. | |
BAR1 Total | The total amount of BAR1 memory. | |
GPU Process | GPU Process Details | Details about the GPU processes running on the node:
|
Illegal GPU Process (GPU request not by Kubernetes resources limits) Details | Details about illegal GPU processes, which are processes that violate Kubernetes resource limits. This includes processes started in the following ways:
| |
Profiling | Graphics Engine Active | The percentage of time the graphics or compute engine was active during the sampling period. |
DRAM Active | The percentage of time the DRAM was active, which corresponds to memory bandwidth utilization. | |
SM Active | The percentage of time the Streaming Multiprocessors (SMs) were active. | |
SM Occupancy | The ratio of active warps on an SM to the maximum number of warps the SM supports. | |
Tensor Core Engine Active | The percentage of time the Tensor Core pipes were active during the sampling period. | |
FP32 Engine Active | The percentage of time the FP32 pipes were active during the sampling period. | |
FP16 Engine Active | The percentage of time the FP16 pipes were active during the sampling period. | |
FP64 Engine Active | The percentage of time the FP64 pipes were active during the sampling period. | |
PCIE TX Bytes (Device to Host) | The rate of data transmitted from the device (GPU) to the host over the PCIe bus. | |
PCIE RX Bytes (Host to Device) | The rate of data received by the device (GPU) from the host over the PCIe bus. | |
NVLINK TX Bytes | The rate of data transmitted over NVLink. | |
NVLINK RX Bytes | The rate of data received over NVLink. | |
Temperature & Energy | Power Usage | The power usage of the GPUs on the node. |
Total Energy Consumption (in J) | The total energy consumed by the GPU in joules (J) since the driver was loaded. | |
Memory Temperature | The temperature of the GPU memory on the node. | |
GPU Temperature | The temperature of the GPU compute units on the node. | |
Clock | SM CLOCK | The clock speed of the SM (Streaming Multiprocessor). |
Memory Clock | The clock speed of the memory. | |
App SM Clock | The application-level clock speed for the SM. | |
App Memory Clock | The application-level clock speed for the memory. | |
Video Clock | The clock speed of the video engine. | |
Clock Throttle Reasons | The reasons for clock throttling. | |
Retired Pages | Retired Pages (Single-bit Errors) | The number of memory pages retired due to single-bit errors. |
Retired Pages (Double-bit Errors) | The number of memory pages retired due to double-bit errors. | |
Violation | Power Violation | The duration, in microseconds, of throttling due to power limits. |
Thermal Violation | The duration, in microseconds, of throttling due to thermal limits. | |
Sync Boost Violation | The duration, in microseconds, of throttling due to sync boost limits. | |
Board Limit Violation | The duration, in microseconds, of throttling due to board power limits. | |
Board Reliability Violation | The duration, in microseconds, of throttling due to reliability limits. | |
Low Util Violation | The duration, in microseconds, of throttling due to low utilization. |
GPUs - Pods
Panel group | Panel name | Description |
Overview | GPU Pod Details | Displays information about Pods requesting GPU resources, including:
|
Pod Metrics (GPU Device) | Pods Used GPU Memory | The amount of GPU memory currently used by the Pod. |
Pods GPU Memory Used Percentage | The percentage of total available GPU memory used by the Pod. | |
Pods GPU Memory Copy Utilization | The Pod's GPU memory copy utilization. | |
Pods Average SM Utilization | The Pod's average SM utilization. | |
Pods GPU Decode Utilization | The Pod's decoder utilization. | |
Pods GPU Encode Utilization | The Pod's encoder utilization. | |
cGPU Pod Details | Memory Percent | The host memory usage as a percentage. |
Memory Usage | The amount of host memory used. | |
CPU Usage By Cores | The CPU usage per core. | |
CPU Usage Percent | The CPU usage as a percentage. | |
Network Bandwidth Usage | The network bandwidth usage. | |
Network Socket | Network socket information. | |
File System | File system usage. | |
Process Number | The number of processes. | |
GPU Utilization (Associated with Pod) | GPU Utilization | The utilization of the GPU used by the application. |
GPU Memory Copy Utilization | The GPU memory copy utilization of the GPU used by the application. | |
Encoder Engine Utilization | The encoder engine utilization of the GPU used by the application. | |
Decoder Engine Utilization | The decoder engine utilization of the GPU used by the application. | |
GPU Memory & BAR1 (GPU Cards Level) | GPU Memory Details | Memory details for the GPU used by the application:
|
GPU Memory Used | The amount of memory used on the GPU used by the application. | |
GPU Memory Used Percentage | The percentage of memory used on the GPU used by the application. | |
BAR1 Used | The amount of used BAR1 memory. | |
BAR1 Total | The total amount of BAR1 memory. | |
GPU Profiling (GPU Cards Level) | Graphics Engine Active | The percentage of time the graphics or compute engine was active during the sampling period. |
DRAM Active | The percentage of time the DRAM was active, which corresponds to memory bandwidth utilization. | |
SM Active | The percentage of time the Streaming Multiprocessors (SMs) were active. | |
SM Occupancy | The ratio of active warps on an SM to the maximum number of warps the SM supports. | |
Tensor Core Engine Active | The percentage of time the Tensor Core pipes were active during the sampling period. | |
FP32 Engine Active | The percentage of time the FP32 pipes were active during the sampling period. | |
FP16 Engine Active | The percentage of time the FP16 pipes were active during the sampling period. | |
FP64 Engine Active | The percentage of time the FP64 pipes were active during the sampling period. | |
PCIE TX Bytes (Device to Host) | The rate of data transmitted from the device (GPU) to the host over the PCIe bus. | |
PCIE RX Bytes (Host to Device) | The rate of data received by the device (GPU) from the host over the PCIe bus. | |
NVLINK TX Bytes | The rate of data transmitted over NVLink. | |
NVLINK RX Bytes | The rate of data received over NVLink. | |
GPU Temperature & Energy (GPU Cards Level) | Power Usage | The power usage of the GPU used by the application. |
Toal Energy Consumption (in J) | Total energy the GPU has consumed, in joules (J), since the driver was loaded. | |
Memory Temperature | The memory temperature of the GPU used by the application. | |
GPU Temperature | The GPU temperature of the compute units used by the application. | |
GPU Clock (Associated with Pod) | SM CLOCK | The clock speed of the SM (Streaming Multiprocessor). |
Memory Clock | The clock speed of the memory. | |
App SM Clock | The application-level clock speed for the SM. | |
App Memory Clock | The application-level clock speed for the memory. | |
Video Clock | The clock speed of the video engine. | |
Clock Throttle Reasons | The reasons for clock throttling. |