ACS GPU pod monitoring metrics - Container Compute Service

GPU monitoring uses an architecture that consists of Exporter, Prometheus, and Grafana to provide comprehensive GPU observability. You can use the monitoring metrics from the GPU Exporter for Container Service to build Grafana dashboards. This topic describes the GPU monitoring metrics.

Metric billing

GPU monitoring uses the GPU Exporter, which is compatible with the monitoring metrics provided by the open source DCGM Exporter. The following GPU monitoring metrics are basic metrics. No additional fees are charged for using these metrics in Prometheus. If you use other custom metrics, additional fees are charged. For more information about the billing policy, see Billing overview.

Metrics

DCGM metrics

You can filter DCGM-related metrics using the following resource dimensions:

namespace="{{pod_namespace}}"
pod="{{pod_name}}"
Hostname="{{pod_name}}"
NodeName="cn-wulanchabu-c.cr-xxx" (For GPU-HPN pods only)
UUID="GPU-example-uuid-abcd"
device="nvidia0"
gpu="0"
modelName="example-model"

Metric dimension	Metric name	Type	Unit	Description
GPU resource metrics	DCGM_FI_DEV_GPU_UTIL	Gauge	%	The GPU utilization. This is the percentage of time that one or more kernel functions are in an Active state during a sample period. The sample period is 1 second or 1/6 of a second, depending on the GPU product. This metric shows that a kernel function is using the GPU, but does not show specific usage details.
	DCGM_FI_DEV_FB_USED	Gauge	MiB	The amount of used frame buffer (video memory).
	DCGM_FI_DEV_FB_TOTAL	Gauge	MiB	The total amount of frame buffer (video memory).
	DCGM_FI_DEV_ENC_UTIL	Gauge	%	The encoder utilization.
	DCGM_FI_DEV_DEC_UTIL	Gauge	%	The decoder utilization.
	DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	%	The memory bandwidth utilization. For example, the maximum memory bandwidth of an NVIDIA V100 GPU is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth utilization is 50%.
Profiling	DCGM_FI_PROF_SM_ACTIVE	Gauge	%	The percentage of time that at least one warp is active on a Streaming Multiprocessor (SM) over a time interval. The value is the average for all SMs and is not sensitive to the number of threads per block. A warp is active after it is scheduled and its resources are allocated. It can be in a computing state or a non-computing state, such as waiting for a memory request. A value below 0.5 indicates inefficient GPU use. A value above 0.8 is necessary for high efficiency. For example, assume a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs for the entire time interval. The value is 1 (100%). A kernel function runs N/5 thread blocks during the time interval. The value is 0.2. A kernel function uses N thread blocks but runs for only 1/5 of the time interval. The value is 0.2.
	DCGM_FI_PROF_SM_OCCUPANCY	Gauge	%	The ratio of resident warps on an SM to the maximum number of warps that the SM can support over a time interval. The value is the average for all SMs over the time interval. Higher occupancy does not always mean higher GPU utilization. For workloads limited by GPU memory bandwidth (DCGM_FI_PROF_DRAM_ACTIVE), higher occupancy indicates more effective GPU use.
	DCGM_FI_PROF_DRAM_ACTIVE	Gauge	%	The fraction of cycles that the device memory is busy sending or receiving data. This is also known as Memory Bandwidth Utilization. The value is an average over the time interval, not an instantaneous value. A higher value indicates higher device memory utilization. A value of 1 (100%) means a DRAM instruction is executed every cycle during the time interval. In practice, the maximum achievable peak is about 0.8 (80%). A value of 0.2 (20%) means that 20% of the cycles in the time interval were spent reading from or writing to device memory.
	DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES	Counter	B/s	The rate of data transmitted or received over NVLink, excluding protocol headers. The value is an average over the time interval, not an instantaneous value. The rate is averaged over the interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is at a constant rate or in a burst. The theoretical maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.
	DCGM_FI_PROF_PCIE_RX_BYTES DCGM_FI_PROF_PCIE_TX_BYTES	Counter	B/s	The rate of data transmitted or received over the PCIe bus, including protocol headers and data payload. The value is an average over the time interval, not an instantaneous value. The rate is averaged over the interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is at a constant rate or in a burst. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel.
	DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Gauge	%	The fraction of cycles that the Tensor (HMMA/IMMA) Pipe is active. The value is an average over the time interval, not an instantaneous value. A higher value indicates higher utilization of Tensor Cores. A value of 1 (100%) means a Tensor instruction is issued every other instruction cycle during the time interval. One instruction takes two cycles to complete. A value of 0.2 (20%) can mean one of the following: The Tensor Cores of 20% of the SMs run at 100% utilization for the entire time interval. The Tensor Cores of 100% of the SMs run at 20% utilization for the entire time interval. The Tensor Cores on 100% of the SMs run at 100% utilization for 1/5 of the time interval. Other combinations.
Frequency (Clock)	DCGM_FI_DEV_SM_CLOCK	Gauge	MHz	The SM clock frequency.
GPU exceptions/XID errors	DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS	Gauge	Error code	NVSwitch fatal error information. The value is the SXid error code.
	DCGM_FI_DEV_ROW_REMAP_FAILURE	Gauge	-	A row remap error occurred.
	DCGM_FI_DEV_ROW_REMAP_PENDING	Gauge	-	A row remap is pending.
Temperature & Power	DCGM_FI_DEV_GPU_TEMP	Gauge	℃	The GPU temperature.
	DCGM_FI_DEV_MEMORY_TEMP	Gauge	℃	The memory temperature.
	DCGM_FI_DEV_POWER_USAGE	Gauge	W	The power consumption.
Retired Pages	DCGM_FI_DEV_RETIRED_SBE	Gauge	-	The number of retired pages due to single-bit errors (SBE).
Retired Pages	DCGM_FI_DEV_RETIRED_DBE	Gauge	-	The number of retired pages due to double-bit errors (DBE).

RDMA metrics

You can filter RDMA-related metrics using the following resource dimensions:

app="nusa-exporter"
hostname="{{pod_name}}"
ip="172.16.17.114"
namespace="{{pod_namespace}}"
node="{{virtual-kubelet-nodename}}"
pod="{{pod_name}}"

Metric name	Type	Unit	Description
rdma_service_monitor_tx_bytes_rate rdma_service_monitor_rx_bytes_rate	Gauge	bytes	The instantaneous outbound/inbound traffic of the pod RDMA network.

rdma_service_monitor_tx_bytes rdma_service_monitor_rx_bytes	Counter	bytes	The cumulative outbound/inbound traffic of the pod RDMA network.

rdma_service_monitor_tx_packets_rate rdma_service_monitor_rx_packets_rate	Gauge	packets	The instantaneous number of outbound/inbound packets on the pod RDMA network.
rdma_service_monitor_tx_packets rdma_service_monitor_rx_packets	Counter	packets	The cumulative number of outbound/inbound packets on the pod RDMA network.