GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to meet business requirements in various scenarios. This topic describes the panels on the dashboards provided by GPU monitoring 2.0.

Introduction to panels

GPU monitoring 2.0 provides the cluster dashboard and node dashboard. The following sections describe the panels on the cluster dashboard and node dashboard.

Panels on the cluster dashboard

Panel Description
Total GPU Nodes The total number of GPU-accelerated nodes in a cluster or node pool.
Allocated GPUs The total number of GPUs and the number of allocated GPUs in a cluster or node pool.
Allocated GPU Memory The ratio of allocated GPU memory to total GPU memory in a cluster or node pool.
Used GPU Memory The ratio of occupied GPU memory to total GPU memory in a cluster or node pool.
Average GPU Utilization The average GPU utilization in a cluster or node pool.
GPU Memory Copy Utilization The average utilization of memory copies in a cluster or node pool.
The Last one XID Error The most recent XID error that occurred in a GPU.
GPU Pod Details The information about a pod that requests GPU resources.
  • Pod Namespace: the namespace to which the pod belongs.
  • Pod Name: the name of the pod.
  • Node Name: the node where the pod is deployed.
  • Used GPU Mem: the amount of GPU memory occupied by the pod.
  • Allocated GPU Mem: the amount of GPU memory allocated to the pod.
  • Allocated Computing Power: the amount of computing power requested by the pod when the share mode is enabled for GPU scheduling. This metric is not displayed if the pod requests only GPU resources or uses the GPU exclusive mode.
  • SM Utilization: the utilization of streaming multiprocessors (SMs).
  • Mem Copy Utilization: the utilization of memory copies.
  • Encode Utilization: the utilization of GPU encoders.
  • Decode Utilization: the utilization of GPU decoders.
GPU Node Details The information about a GPU-accelerated node.
  • Node Name: the name of the node.
  • GPU Index: the GPU index of the node.
  • GPU Utilization: the GPU utilization of the node.
  • Memory Copy Utilization: the memory copy utilization of the node.
  • Used GPU Memory: the amount of occupied GPU memory of the node.
  • Allocated GPU Memory: the ratio of allocated GPU memory to total GPU memory.
  • Total GPU Memory: the total amount of GPU memory.
  • Power: the power of the GPU.
  • GPU Temperature: the temperature of the GPU.
  • Memory Temperature: the temperature of the GPU memory.

Panels on the node dashboard

Panel group Panel Description
Overview GPU Mode The GPU scheduling mode of a node. The following GPU modes are supported:
  • Exclusive: In exclusive mode, pods on the node request GPUs.
  • Share: In share mode, pods on the node request GPU memory and computing power.
  • None: No GPU-accelerated application runs on the node. A node can switch between the exclusive and share modes. When no GPU-accelerated application runs on the node, the system cannot identify the mode that is enabled for the node.
NVIDIA Driver Version The version of the GPU driver installed on the node.
Allocated GPUs The number of allocated GPUs and the total number of GPUs.
GPU Utilization The average GPU utilization of a node, which equals the average of the utilization values of all GPUs on the node.
Allocated GPU Memory The ratio of allocated GPU memory to total GPU memory on a node.
Used GPU Memory The ratio of occupied GPU memory to total GPU memory on a node.
Allocated Computing Power The amount of allocated computing power on a node. This metric is displayed when the share mode is enabled for GPU scheduling and the pods on the node request computing power.
The Last One XID Error The most recent XID error that occurred in a GPU on a node.
Utilization GPU Utilization The GPU utilization of a node.
Memory Copy Utilization The utilization of memory copies of a node.
Encoder Engine Utilization The utilization of GPU encoders on a node.
Decoder Engine Utilization The utilization of GPU decoders on a node.
Memory and BAR1 GPU Memory Details The memory information about a GPU.
  • UUID: the UUID of the GPU.
  • GPU Index: the index of the GPU.
  • Mode Name: the model of the GPU.
  • Used: the amount of occupied GPU memory.
  • Allocated: the ratio of allocated GPU memory to total GPU memory.
  • Total: the total amount of GPU memory.
BAR1 Used BAR1 memory is used.
Memory Used The total amount of occupied GPU memory on a node.
BAR1 Total The total amount of BAR1 memory.
Profiling SM Occupancy The SM occupancy.
SM Active The percentage of active SMs.
Tensor Core Engine Active The percentage of time that a tensor core pipe remains active within a monitoring cycle.
FP32 Engine Active The percentage of time that an FP32 pipe remains active within a monitoring cycle.
FP16 Engine Active The percentage of time that an FP16 pipe remains active within a monitoring cycle.
FP64 Engine Active The percentage of time that an FP64 pipe remains active within a monitoring cycle.
Graphics Engine Active The percentage of time that the graphics or compute engine remains active within a monitoring cycle.
DRAM Active The memory bandwidth utilization.
PCIE TX BYTES (Device to Host) The Peripheral Component Interconnect Express (PCIe) TX rate of GPUs on a node.
PCIE RX BYTES (Host to Device) The PCIe RX rate of GPUs on a node.
NVLINK Bandwidth Total The size of the bandwidth for transmitting and receiving data.
NVLINK TX/RX BYTES The NVLink TX or RX rate.
GPU Process GPU Process Details The information about a GPU process on a node.
  • Pod Namespace: the namespace to which the pod of the GPU process belongs.
  • Pod Name: the pod name of the GPU process.
  • Container Name: the container name of the GPU process.
  • Allocate Mode: the GPU scheduling mode used by the pod of the GPU process to request GPU resources. Pods can request GPU resources in exclusive or share mode.
  • Process Id: the ID of the GPU process.
  • Process Name: the name of the GPU process.
  • Process Type: the type of GPU process. Valid values: C (compute) and G (graphics).
  • GPU Index: the GPU to which the GPU process is scheduled.
  • Used Memory: the amount of GPU memory occupied by the GPU process.
  • SM Utilization: the SM utilization of the GPU process.
  • Memory Copy Utilization: the utilization of memory copies.
  • Decode Utilization: the utilization of GPU decoders.
  • Encode Utilization: the utilization of GPU encoders.
Temperature and Energy Power Usage The GPU power of a node.
Total Energy Consumption The amount of energy consumed after the GPUs start to load the driver. Unit: joules.
Memory Temperature The GPU memory temperature of a node.
GPU Temperature The temperature of GPU compute units on a node.
Clock SM CLOCK The SM clock.
Memory Clock The memory clock.
APP SM Clock The SM clock of an application.
APP Memory Clock The memory clock of an application.
Video Clock The video clock.
Clock Throttle Reasons The reason for clock throttling.
Retired Pages Retired Pages (Single-bit Errors) The number of pages retired due to single-bit errors.
Retired Pages (Double-bit Errors) The number of pages retired due to double-bit errors.
Violation POWER VIOLATION A violation that occurred due to the power upper limit. The time when the violation occurred. Unit: microseconds.
THERMAL VIOLATION A violation that occurred due to the temperature upper limit. The time when the violation occurred. Unit: microseconds.
BOARD RELIABILITY VIOLATION A violation that occurred due to the circuit board reliability limit. The time when the violation occurred. Unit: microseconds.
LOW UTIL VIOLATION A violation that occurred due to low utilization. The time when the violation occurred. Unit: microseconds.
SYNC BOOST VIOLATION A violation that occurred due to the synchronization boost limit. The time when the violation occurred. Unit: microseconds.
BOARD LIMIT VIOLATION A violation that occurred due to the circuit board limit. The time when the violation occurred. Unit: microseconds.