GPU monitoring uses an architecture that consists of Exporter, Prometheus, and Grafana to provide comprehensive GPU observability. You can use the monitoring metrics from the GPU Exporter for Container Service to build Grafana dashboards. This topic describes the GPU monitoring metrics.
Metric billing
GPU monitoring uses the GPU Exporter, which is compatible with the monitoring metrics provided by the open source DCGM Exporter. The following GPU monitoring metrics are basic metrics. No additional fees are charged for using these metrics in Prometheus. If you use other custom metrics, additional fees are charged. For more information about the billing policy, see Billing overview.
Metrics
DCGM metrics
You can filter DCGM-related metrics using the following resource dimensions:
namespace="{{pod_namespace}}"
pod="{{pod_name}}"
Hostname="{{pod_name}}"
NodeName="cn-wulanchabu-c.cr-xxx" (For GPU-HPN pods only)
UUID="GPU-example-uuid-abcd"
device="nvidia0"
gpu="0"
modelName="example-model"
Metric dimension | Metric name | Type | Unit | Description |
GPU resource metrics | DCGM_FI_DEV_GPU_UTIL | Gauge | % | The GPU utilization. This is the percentage of time that one or more kernel functions are in an Active state during a sample period. The sample period is 1 second or 1/6 of a second, depending on the GPU product. This metric shows that a kernel function is using the GPU, but does not show specific usage details. |
DCGM_FI_DEV_FB_USED | Gauge | MiB | The amount of used frame buffer (video memory). | |
DCGM_FI_DEV_FB_TOTAL | Gauge | MiB | The total amount of frame buffer (video memory). | |
DCGM_FI_DEV_ENC_UTIL | Gauge | % | The encoder utilization. | |
DCGM_FI_DEV_DEC_UTIL | Gauge | % | The decoder utilization. | |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge | % | The memory bandwidth utilization. For example, the maximum memory bandwidth of an NVIDIA V100 GPU is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth utilization is 50%. | |
Profiling | DCGM_FI_PROF_SM_ACTIVE | Gauge | % | The percentage of time that at least one warp is active on a Streaming Multiprocessor (SM) over a time interval. The value is the average for all SMs and is not sensitive to the number of threads per block. A warp is active after it is scheduled and its resources are allocated. It can be in a computing state or a non-computing state, such as waiting for a memory request. A value below 0.5 indicates inefficient GPU use. A value above 0.8 is necessary for high efficiency. For example, assume a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs for the entire time interval. The value is 1 (100%). A kernel function runs N/5 thread blocks during the time interval. The value is 0.2. A kernel function uses N thread blocks but runs for only 1/5 of the time interval. The value is 0.2. |
DCGM_FI_PROF_SM_OCCUPANCY | Gauge | % | The ratio of resident warps on an SM to the maximum number of warps that the SM can support over a time interval. The value is the average for all SMs over the time interval. Higher occupancy does not always mean higher GPU utilization. For workloads limited by GPU memory bandwidth (DCGM_FI_PROF_DRAM_ACTIVE), higher occupancy indicates more effective GPU use. | |
DCGM_FI_PROF_DRAM_ACTIVE | Gauge | % | The fraction of cycles that the device memory is busy sending or receiving data. This is also known as Memory Bandwidth Utilization. The value is an average over the time interval, not an instantaneous value. A higher value indicates higher device memory utilization. A value of 1 (100%) means a DRAM instruction is executed every cycle during the time interval. In practice, the maximum achievable peak is about 0.8 (80%). A value of 0.2 (20%) means that 20% of the cycles in the time interval were spent reading from or writing to device memory. | |
| Counter | B/s | The rate of data transmitted or received over NVLink, excluding protocol headers. The value is an average over the time interval, not an instantaneous value. The rate is averaged over the interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is at a constant rate or in a burst. The theoretical maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. | |
| Counter | B/s | The rate of data transmitted or received over the PCIe bus, including protocol headers and data payload. The value is an average over the time interval, not an instantaneous value. The rate is averaged over the interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is at a constant rate or in a burst. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel. | |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Gauge | % | The fraction of cycles that the Tensor (HMMA/IMMA) Pipe is active. The value is an average over the time interval, not an instantaneous value. A higher value indicates higher utilization of Tensor Cores. A value of 1 (100%) means a Tensor instruction is issued every other instruction cycle during the time interval. One instruction takes two cycles to complete. A value of 0.2 (20%) can mean one of the following:
| |
Frequency (Clock) | DCGM_FI_DEV_SM_CLOCK | Gauge | MHz | The SM clock frequency. |
GPU exceptions/XID errors | DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS | Gauge | Error code | NVSwitch fatal error information. The value is the SXid error code. |
DCGM_FI_DEV_ROW_REMAP_FAILURE | Gauge | - | A row remap error occurred. | |
DCGM_FI_DEV_ROW_REMAP_PENDING | Gauge | - | A row remap is pending. | |
Temperature & Power | DCGM_FI_DEV_GPU_TEMP | Gauge | ℃ | The GPU temperature. |
DCGM_FI_DEV_MEMORY_TEMP | Gauge | ℃ | The memory temperature. | |
DCGM_FI_DEV_POWER_USAGE | Gauge | W | The power consumption. | |
Retired Pages | DCGM_FI_DEV_RETIRED_SBE | Gauge | - | The number of retired pages due to single-bit errors (SBE). |
DCGM_FI_DEV_RETIRED_DBE | Gauge | - | The number of retired pages due to double-bit errors (DBE). |
RDMA metrics
You can filter RDMA-related metrics using the following resource dimensions:
app="nusa-exporter"
hostname="{{pod_name}}"
ip="172.16.17.114"
namespace="{{pod_namespace}}"
node="{{virtual-kubelet-nodename}}"
pod="{{pod_name}}"
Metric name | Type | Unit | Description |
| Gauge | bytes | The instantaneous outbound/inbound traffic of the pod RDMA network. |
| Counter | bytes | The cumulative outbound/inbound traffic of the pod RDMA network. |
| Gauge | packets | The instantaneous number of outbound/inbound packets on the pod RDMA network. |
| Counter | packets | The cumulative number of outbound/inbound packets on the pod RDMA network. |