GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to meet business requirements in various scenarios. This topic describes the panels on the dashboards provided by GPU monitoring 2.0.
Introduction to panels
GPU monitoring 2.0 provides the cluster dashboard and node dashboard. The following sections describe the panels on the cluster dashboard and node dashboard.
Panels on the cluster dashboard
Panel | Description |
---|---|
Total GPU Nodes | The total number of GPU-accelerated nodes in a cluster or node pool. |
Allocated GPUs | The total number of GPUs and the number of allocated GPUs in a cluster or node pool. |
Allocated GPU Memory | The ratio of allocated GPU memory to total GPU memory in a cluster or node pool. |
Used GPU Memory | The ratio of occupied GPU memory to total GPU memory in a cluster or node pool. |
Average GPU Utilization | The average GPU utilization in a cluster or node pool. |
GPU Memory Copy Utilization | The average utilization of memory copies in a cluster or node pool. |
The Last one XID Error | The most recent XID error that occurred in a GPU. |
GPU Pod Details | The information about a pod that requests GPU resources.
|
GPU Node Details | The information about a GPU-accelerated node.
|
Panels on the node dashboard
Panel group | Panel | Description |
---|---|---|
Overview | GPU Mode | The GPU scheduling mode of a node. The following GPU modes are supported:
|
NVIDIA Driver Version | The version of the GPU driver installed on the node. | |
Allocated GPUs | The number of allocated GPUs and the total number of GPUs. | |
GPU Utilization | The average GPU utilization of a node, which equals the average of the utilization values of all GPUs on the node. | |
Allocated GPU Memory | The ratio of allocated GPU memory to total GPU memory on a node. | |
Used GPU Memory | The ratio of occupied GPU memory to total GPU memory on a node. | |
Allocated Computing Power | The amount of allocated computing power on a node. This metric is displayed when the share mode is enabled for GPU scheduling and the pods on the node request computing power. | |
The Last One XID Error | The most recent XID error that occurred in a GPU on a node. | |
Utilization | GPU Utilization | The GPU utilization of a node. |
Memory Copy Utilization | The utilization of memory copies of a node. | |
Encoder Engine Utilization | The utilization of GPU encoders on a node. | |
Decoder Engine Utilization | The utilization of GPU decoders on a node. | |
Memory and BAR1 | GPU Memory Details | The memory information about a GPU.
|
BAR1 Used | BAR1 memory is used. | |
Memory Used | The total amount of occupied GPU memory on a node. | |
BAR1 Total | The total amount of BAR1 memory. | |
Profiling | SM Occupancy | The SM occupancy. |
SM Active | The percentage of active SMs. | |
Tensor Core Engine Active | The percentage of time that a tensor core pipe remains active within a monitoring cycle. | |
FP32 Engine Active | The percentage of time that an FP32 pipe remains active within a monitoring cycle. | |
FP16 Engine Active | The percentage of time that an FP16 pipe remains active within a monitoring cycle. | |
FP64 Engine Active | The percentage of time that an FP64 pipe remains active within a monitoring cycle. | |
Graphics Engine Active | The percentage of time that the graphics or compute engine remains active within a monitoring cycle. | |
DRAM Active | The memory bandwidth utilization. | |
PCIE TX BYTES (Device to Host) | The Peripheral Component Interconnect Express (PCIe) TX rate of GPUs on a node. | |
PCIE RX BYTES (Host to Device) | The PCIe RX rate of GPUs on a node. | |
NVLINK Bandwidth Total | The size of the bandwidth for transmitting and receiving data. | |
NVLINK TX/RX BYTES | The NVLink TX or RX rate. | |
GPU Process | GPU Process Details | The information about a GPU process on a node.
|
Temperature and Energy | Power Usage | The GPU power of a node. |
Total Energy Consumption | The amount of energy consumed after the GPUs start to load the driver. Unit: joules. | |
Memory Temperature | The GPU memory temperature of a node. | |
GPU Temperature | The temperature of GPU compute units on a node. | |
Clock | SM CLOCK | The SM clock. |
Memory Clock | The memory clock. | |
APP SM Clock | The SM clock of an application. | |
APP Memory Clock | The memory clock of an application. | |
Video Clock | The video clock. | |
Clock Throttle Reasons | The reason for clock throttling. | |
Retired Pages | Retired Pages (Single-bit Errors) | The number of pages retired due to single-bit errors. |
Retired Pages (Double-bit Errors) | The number of pages retired due to double-bit errors. | |
Violation | POWER VIOLATION | A violation that occurred due to the power upper limit. The time when the violation occurred. Unit: microseconds. |
THERMAL VIOLATION | A violation that occurred due to the temperature upper limit. The time when the violation occurred. Unit: microseconds. | |
BOARD RELIABILITY VIOLATION | A violation that occurred due to the circuit board reliability limit. The time when the violation occurred. Unit: microseconds. | |
LOW UTIL VIOLATION | A violation that occurred due to low utilization. The time when the violation occurred. Unit: microseconds. | |
SYNC BOOST VIOLATION | A violation that occurred due to the synchronization boost limit. The time when the violation occurred. Unit: microseconds. | |
BOARD LIMIT VIOLATION | A violation that occurred due to the circuit board limit. The time when the violation occurred. Unit: microseconds. |