All Products
Search
Document Center

Container Service for Kubernetes:Panels

Last Updated:Nov 24, 2025

GPU monitoring is built on an Exporter, Prometheus, and Grafana stack for comprehensive GPU monitoring. This topic describes the panels on the GPU monitoring dashboard.

Panel descriptions

GPU monitoring includes three dashboards: Cluster-level GPU monitoring, Node-level GPU monitoring, and Pod-level GPU monitoring. The following sections describe each dashboard in detail.

GPUs - Cluster Dimension

Panel name

Description

Total GPU Node Instance

The total number of GPU nodes in the cluster or node pool.

Allocated GPUs

The total number of GPUs and the number of allocated GPUs in the cluster or node pool.

Allocated GPU Memory

The percentage of GPU memory allocated out of the total available in the cluster or node pool.

Used GPU Memory

The percentage of GPU memory in use out of the total available in the cluster or node pool.

Average GPU Utilization

The average GPU utilization across the cluster or node pool.

Used GPU Memory Copy Utilization

The average GPU memory copy utilization across the cluster or node pool.

The Last one XID Error

The most recent XID error on a GPU in the cluster.

GPU Node Details

Details for each GPU node in the cluster, including:

  • Node Name: The name of the node.

  • GPU Index: The index number of the GPU on the node.

  • GPU Utilization: The utilization of the GPU.

  • GPU Memory Copy Utilization: The GPU memory copy utilization.

  • GPU Memory Used: The amount of GPU memory currently in use.

  • Allocated GPU memory: The percentage of GPU memory allocated out of the total available.

  • Used GPU memory: The total amount of GPU memory.

  • Power Usage: The current power usage.

  • GPU Temperature: The temperature of the GPU.

  • Memory Temperature: The temperature of the GPU memory.

GPUs - Nodes

Panel group

Panel name

Description

Overview

GPU Mode

The GPU mode, which can be Exclusive, Share, or None.

  • Exclusive mode: Resources are requested on a per-GPU basis.

  • Share mode: Resources are requested based on GPU memory and computing power.

  • None: No GPU applications are currently running on the node. A node can switch between Exclusive and Share modes. If no GPU applications are running, the mode is undetermined.

NVIDIA Driver Version

The version of the NVIDIA driver installed on the node.

Allocated GPUs

The number of allocated GPUs out of the total number of GPUs on the node.

GPU Utilization

The average GPU utilization of all GPUs on the node.

Allocated GPU Memory

The percentage of total GPU memory on the node that has been allocated.

Used GPU Memory

The percentage of total GPU memory on the node that is currently in use.

Allocated Computing Power (Valid in GPU Sharing)

The amount of allocated computing power. This metric only applies when using shared GPU scheduling with computing power requests.

The Last one XID Error

The most recent XID error on a GPU on this node.

Utilization

GPU Utilization

The utilization of each GPU on the node.

GPU Memory Utilization

The GPU memory copy utilization of each GPU on the node.

Encoder Engine Utilization

The encoder engine utilization of each GPU on the node.

Decoder Engine Utilization

The decoder engine utilization of each GPU on the node.

Memory & BAR1

GPU Memory Used

Details about the GPU memory on the node:

  • UUID: The UUID of the GPU.

  • GPU index: The index number of the GPU.

  • GPU model: The model of the GPU.

  • Used percentage: The percentage of GPU memory in use.

  • Used: The amount of GPU memory currently used on this GPU.

  • Allocated: The percentage of GPU memory allocated out of the total.

  • Total: The total amount of GPU memory on this GPU.

BAR1 Used

The amount of used BAR1 memory.

GPU Memory Used

The amount of GPU memory used by the GPUs on the node.

BAR1 Total

The total amount of BAR1 memory.

GPU Process

GPU Process Details

Details about the GPU processes running on the node:

  • Pod namespace: The namespace of the Pod that owns the process.

  • Pod name: The name of the Pod that owns the process.

  • Container name: The name of the container that owns the process.

  • Allocate mode: The mode the Pod uses to request GPU resources, such as Exclusive Mode or Share Mode.

  • Process ID: The ID of the process.

  • Process name: The name of the process.

  • Process type: The type of the process, such as Compute (C) or Graphics (G).

  • GPU index: The index of the GPU on which the process is running.

  • Used memory: The amount of GPU memory used by the process.

  • SM utilization: The SM utilization of the process.

  • Memory copy utilization: The GPU memory copy utilization of the process.

  • Decode utilization: The decoder utilization of the process.

  • Encode utilization: The encoder utilization of the process.

Illegal GPU Process (GPU request not by Kubernetes resources limits) Details

Details about illegal GPU processes, which are processes that violate Kubernetes resource limits. This includes processes started in the following ways:

  • Run a GPU application directly on a Node.

  • Run a GPU application in a container started directly with the docker run command.

  • Request GPU resources for a Pod by setting the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable directly in the Pod's env section and running a GPU program.

  • Configure privileged: true in a Pod's securityContext and run a GPU program.

  • Run a GPU program in a Pod where the NVIDIA_VISIBLE_DEVICES environment variable is not set, but the container image used by the Pod has NVIDIA_VISIBLE_DEVICES=all configured by default.

Profiling

Graphics Engine Active

The percentage of time the graphics or compute engine was active during the sampling period.

DRAM Active

The percentage of time the DRAM was active, which corresponds to memory bandwidth utilization.

SM Active

The percentage of time the Streaming Multiprocessors (SMs) were active.

SM Occupancy

The ratio of active warps on an SM to the maximum number of warps the SM supports.

Tensor Core Engine Active

The percentage of time the Tensor Core pipes were active during the sampling period.

FP32 Engine Active

The percentage of time the FP32 pipes were active during the sampling period.

FP16 Engine Active

The percentage of time the FP16 pipes were active during the sampling period.

FP64 Engine Active

The percentage of time the FP64 pipes were active during the sampling period.

PCIE TX Bytes (Device to Host)

The rate of data transmitted from the device (GPU) to the host over the PCIe bus.

PCIE RX Bytes (Host to Device)

The rate of data received by the device (GPU) from the host over the PCIe bus.

NVLINK TX Bytes

The rate of data transmitted over NVLink.

NVLINK RX Bytes

The rate of data received over NVLink.

Temperature & Energy

Power Usage

The power usage of the GPUs on the node.

Total Energy Consumption (in J)

The total energy consumed by the GPU in joules (J) since the driver was loaded.

Memory Temperature

The temperature of the GPU memory on the node.

GPU Temperature

The temperature of the GPU compute units on the node.

Clock

SM CLOCK

The clock speed of the SM (Streaming Multiprocessor).

Memory Clock

The clock speed of the memory.

App SM Clock

The application-level clock speed for the SM.

App Memory Clock

The application-level clock speed for the memory.

Video Clock

The clock speed of the video engine.

Clock Throttle Reasons

The reasons for clock throttling.

Retired Pages

Retired Pages (Single-bit Errors)

The number of memory pages retired due to single-bit errors.

Retired Pages (Double-bit Errors)

The number of memory pages retired due to double-bit errors.

Violation

Power Violation

The duration, in microseconds, of throttling due to power limits.

Thermal Violation

The duration, in microseconds, of throttling due to thermal limits.

Sync Boost Violation

The duration, in microseconds, of throttling due to sync boost limits.

Board Limit Violation

The duration, in microseconds, of throttling due to board power limits.

Board Reliability Violation

The duration, in microseconds, of throttling due to reliability limits.

Low Util Violation

The duration, in microseconds, of throttling due to low utilization.

GPUs - Pods

Panel group

Panel name

Description

Overview

GPU Pod Details

Displays information about Pods requesting GPU resources, including:

  • Pod namespace: The namespace of the Pod.

  • Pod name: The name of the Pod.

  • Node name: The node where the Pod is running.

  • Pod source: The source of the Pod.

  • Allocated mode: The allocation mode of the Pod.

  • Used GPU memory: The amount of GPU memory currently used by the Pod.

  • Allocated GPU memory: The amount of GPU memory allocated to the Pod.

  • Allocated computing power: The amount of computing power requested by the Pod in a shared GPU scheduling environment. This value is not displayed for Exclusive Mode Pods or Pods that only request GPU memory.

  • SM utilization: The SM (Streaming Multiprocessor) utilization.

  • GPU memory copy utilization: The GPU memory copy utilization.

  • Encode utilization: The encoder utilization.

  • Decode utilization: The decoder utilization.

Pod Metrics (GPU Device)

Pods Used GPU Memory

The amount of GPU memory currently used by the Pod.

Pods GPU Memory Used Percentage

The percentage of total available GPU memory used by the Pod.

Pods GPU Memory Copy Utilization

The Pod's GPU memory copy utilization.

Pods Average SM Utilization

The Pod's average SM utilization.

Pods GPU Decode Utilization

The Pod's decoder utilization.

Pods GPU Encode Utilization

The Pod's encoder utilization.

cGPU Pod Details

Memory Percent

The host memory usage as a percentage.

Memory Usage

The amount of host memory used.

CPU Usage By Cores

The CPU usage per core.

CPU Usage Percent

The CPU usage as a percentage.

Network Bandwidth Usage

The network bandwidth usage.

Network Socket

Network socket information.

File System

File system usage.

Process Number

The number of processes.

GPU Utilization (Associated with Pod)

GPU Utilization

The utilization of the GPU used by the application.

GPU Memory Copy Utilization

The GPU memory copy utilization of the GPU used by the application.

Encoder Engine Utilization

The encoder engine utilization of the GPU used by the application.

Decoder Engine Utilization

The decoder engine utilization of the GPU used by the application.

GPU Memory & BAR1 (GPU Cards Level)

GPU Memory Details

Memory details for the GPU used by the application:

  • UUID: The UUID of the GPU.

  • Pod source: The source of the Pod.

  • GPU model: The model of the GPU.

  • Driver version: The driver version.

  • Allocated mode: The allocation mode of the Pod.

  • Allocated percentage: The percentage of GPU memory allocated.

  • Used: The amount of GPU memory currently in use.

  • Used percentage: The percentage of GPU memory in use.

  • Total: The total amount of GPU memory.

GPU Memory Used

The amount of memory used on the GPU used by the application.

GPU Memory Used Percentage

The percentage of memory used on the GPU used by the application.

BAR1 Used

The amount of used BAR1 memory.

BAR1 Total

The total amount of BAR1 memory.

GPU Profiling (GPU Cards Level)

Graphics Engine Active

The percentage of time the graphics or compute engine was active during the sampling period.

DRAM Active

The percentage of time the DRAM was active, which corresponds to memory bandwidth utilization.

SM Active

The percentage of time the Streaming Multiprocessors (SMs) were active.

SM Occupancy

The ratio of active warps on an SM to the maximum number of warps the SM supports.

Tensor Core Engine Active

The percentage of time the Tensor Core pipes were active during the sampling period.

FP32 Engine Active

The percentage of time the FP32 pipes were active during the sampling period.

FP16 Engine Active

The percentage of time the FP16 pipes were active during the sampling period.

FP64 Engine Active

The percentage of time the FP64 pipes were active during the sampling period.

PCIE TX Bytes (Device to Host)

The rate of data transmitted from the device (GPU) to the host over the PCIe bus.

PCIE RX Bytes (Host to Device)

The rate of data received by the device (GPU) from the host over the PCIe bus.

NVLINK TX Bytes

The rate of data transmitted over NVLink.

NVLINK RX Bytes

The rate of data received over NVLink.

GPU Temperature & Energy (GPU Cards Level)

Power Usage

The power usage of the GPU used by the application.

Toal Energy Consumption (in J)

Total energy the GPU has consumed, in joules (J), since the driver was loaded.

Memory Temperature

The memory temperature of the GPU used by the application.

GPU Temperature

The GPU temperature of the compute units used by the application.

GPU Clock (Associated with Pod)

SM CLOCK

The clock speed of the SM (Streaming Multiprocessor).

Memory Clock

The clock speed of the memory.

App SM Clock

The application-level clock speed for the SM.

App Memory Clock

The application-level clock speed for the memory.

Video Clock

The clock speed of the video engine.

Clock Throttle Reasons

The reasons for clock throttling.