Panels - Container Service for Kubernetes - Alibaba Cloud Documentation Center

GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to meet business requirements in various scenarios. This topic describes the panels on the dashboards provided by GPU monitoring 2.0.

Introduction to panels

GPU monitoring 2.0 provides the cluster dashboard and node dashboard. The following sections describe the panels on the cluster dashboard and node dashboard.

Panels on the cluster dashboard


Panel	Description
Total GPU Nodes	The total number of GPU-accelerated nodes in a cluster or node pool.
Allocated GPUs	The total number of GPUs and the number of allocated GPUs in a cluster or node pool.
Allocated GPU Memory	The ratio of allocated GPU memory to total GPU memory in a cluster or node pool.
Used GPU Memory	The ratio of occupied GPU memory to total GPU memory in a cluster or node pool.
Average GPU Utilization	The average GPU utilization in a cluster or node pool.
GPU Memory Copy Utilization	The average utilization of memory copies in a cluster or node pool.
The Last one XID Error	The most recent XID error that occurred in a GPU.
GPU Pod Details	The information about a pod that requests GPU resources. Pod Namespace: the namespace to which the pod belongs. Pod Name: the name of the pod. Node Name: the node where the pod is deployed. Used GPU Mem: the amount of GPU memory occupied by the pod. Allocated GPU Mem: the amount of GPU memory allocated to the pod. Allocated Computing Power: the amount of computing power requested by the pod when the share mode is enabled for GPU scheduling. This metric is not displayed if the pod requests only GPU resources or uses the GPU exclusive mode. SM Utilization: the utilization of streaming multiprocessors (SMs). Mem Copy Utilization: the utilization of memory copies. Encode Utilization: the utilization of GPU encoders. Decode Utilization: the utilization of GPU decoders.
GPU Node Details	The information about a GPU-accelerated node. Node Name: the name of the node. GPU Index: the GPU index of the node. GPU Utilization: the GPU utilization of the node. Memory Copy Utilization: the memory copy utilization of the node. Used GPU Memory: the amount of occupied GPU memory of the node. Allocated GPU Memory: the ratio of allocated GPU memory to total GPU memory. Total GPU Memory: the total amount of GPU memory. Power: the power of the GPU. GPU Temperature: the temperature of the GPU. Memory Temperature: the temperature of the GPU memory.

Panels on the node dashboard


Panel group	Panel	Description
Overview	GPU Mode	The GPU scheduling mode of a node. The following GPU modes are supported: Exclusive: In exclusive mode, pods on the node request GPUs. Share: In share mode, pods on the node request GPU memory and computing power. None: No GPU-accelerated application runs on the node. A node can switch between the exclusive and share modes. When no GPU-accelerated application runs on the node, the system cannot identify the mode that is enabled for the node.
	NVIDIA Driver Version	The version of the GPU driver installed on the node.
	Allocated GPUs	The number of allocated GPUs and the total number of GPUs.
	GPU Utilization	The average GPU utilization of a node, which equals the average of the utilization values of all GPUs on the node.
	Allocated GPU Memory	The ratio of allocated GPU memory to total GPU memory on a node.
	Used GPU Memory	The ratio of occupied GPU memory to total GPU memory on a node.
	Allocated Computing Power	The amount of allocated computing power on a node. This metric is displayed when the share mode is enabled for GPU scheduling and the pods on the node request computing power.
	The Last One XID Error	The most recent XID error that occurred in a GPU on a node.
Utilization	GPU Utilization	The GPU utilization of a node.
	Memory Copy Utilization	The utilization of memory copies of a node.
	Encoder Engine Utilization	The utilization of GPU encoders on a node.
	Decoder Engine Utilization	The utilization of GPU decoders on a node.
Memory and BAR1	GPU Memory Details	The memory information about a GPU. UUID: the UUID of the GPU. GPU Index: the index of the GPU. Mode Name: the model of the GPU. Used: the amount of occupied GPU memory. Allocated: the ratio of allocated GPU memory to total GPU memory. Total: the total amount of GPU memory.
	BAR1 Used	BAR1 memory is used.
	Memory Used	The total amount of occupied GPU memory on a node.
	BAR1 Total	The total amount of BAR1 memory.
Profiling	SM Occupancy	The SM occupancy.
	SM Active	The percentage of active SMs.
	Tensor Core Engine Active	The percentage of time that a tensor core pipe remains active within a monitoring cycle.
	FP32 Engine Active	The percentage of time that an FP32 pipe remains active within a monitoring cycle.
	FP16 Engine Active	The percentage of time that an FP16 pipe remains active within a monitoring cycle.
	FP64 Engine Active	The percentage of time that an FP64 pipe remains active within a monitoring cycle.
	Graphics Engine Active	The percentage of time that the graphics or compute engine remains active within a monitoring cycle.
	DRAM Active	The memory bandwidth utilization.
	PCIE TX BYTES (Device to Host)	The Peripheral Component Interconnect Express (PCIe) TX rate of GPUs on a node.
	PCIE RX BYTES (Host to Device)	The PCIe RX rate of GPUs on a node.
	NVLINK Bandwidth Total	The size of the bandwidth for transmitting and receiving data.
	NVLINK TX/RX BYTES	The NVLink TX or RX rate.
GPU Process	GPU Process Details	The information about a GPU process on a node. Pod Namespace: the namespace to which the pod of the GPU process belongs. Pod Name: the pod name of the GPU process. Container Name: the container name of the GPU process. Allocate Mode: the GPU scheduling mode used by the pod of the GPU process to request GPU resources. Pods can request GPU resources in exclusive or share mode. Process Id: the ID of the GPU process. Process Name: the name of the GPU process. Process Type: the type of GPU process. Valid values: C (compute) and G (graphics). GPU Index: the GPU to which the GPU process is scheduled. Used Memory: the amount of GPU memory occupied by the GPU process. SM Utilization: the SM utilization of the GPU process. Memory Copy Utilization: the utilization of memory copies. Decode Utilization: the utilization of GPU decoders. Encode Utilization: the utilization of GPU encoders.
Temperature and Energy	Power Usage	The GPU power of a node.
	Total Energy Consumption	The amount of energy consumed after the GPUs start to load the driver. Unit: joules.
	Memory Temperature	The GPU memory temperature of a node.
	GPU Temperature	The temperature of GPU compute units on a node.
Clock	SM CLOCK	The SM clock.
	Memory Clock	The memory clock.
	APP SM Clock	The SM clock of an application.
	APP Memory Clock	The memory clock of an application.
	Video Clock	The video clock.
	Clock Throttle Reasons	The reason for clock throttling.
Retired Pages	Retired Pages (Single-bit Errors)	The number of pages retired due to single-bit errors.
Retired Pages	Retired Pages (Double-bit Errors)	The number of pages retired due to double-bit errors.
Violation	POWER VIOLATION	A violation that occurred due to the power upper limit. The time when the violation occurred. Unit: microseconds.
	THERMAL VIOLATION	A violation that occurred due to the temperature upper limit. The time when the violation occurred. Unit: microseconds.
	BOARD RELIABILITY VIOLATION	A violation that occurred due to the circuit board reliability limit. The time when the violation occurred. Unit: microseconds.
	LOW UTIL VIOLATION	A violation that occurred due to low utilization. The time when the violation occurred. Unit: microseconds.
	SYNC BOOST VIOLATION	A violation that occurred due to the synchronization boost limit. The time when the violation occurred. Unit: microseconds.
	BOARD LIMIT VIOLATION	A violation that occurred due to the circuit board limit. The time when the violation occurred. Unit: microseconds.