All Products
Search
Document Center

Container Compute Service:ACS GPU pod monitoring metrics

Last Updated:Sep 03, 2025

GPU monitoring uses an architecture that consists of Exporter, Prometheus, and Grafana to provide comprehensive GPU observability. You can use the monitoring metrics from the GPU Exporter for Container Service to build Grafana dashboards. This topic describes the GPU monitoring metrics.

Metric billing

GPU monitoring uses the GPU Exporter, which is compatible with the monitoring metrics provided by the open source DCGM Exporter. The following GPU monitoring metrics are basic metrics. No additional fees are charged for using these metrics in Prometheus. If you use other custom metrics, additional fees are charged. For more information about the billing policy, see Billing overview.

Metrics

DCGM metrics

You can filter DCGM-related metrics using the following resource dimensions:

  • namespace="{{pod_namespace}}"

  • pod="{{pod_name}}"

  • Hostname="{{pod_name}}"

  • NodeName="cn-wulanchabu-c.cr-xxx" (For GPU-HPN pods only)

  • UUID="GPU-example-uuid-abcd"

  • device="nvidia0"

  • gpu="0"

  • modelName="example-model"

Metric dimension

Metric name

Type

Unit

Description

GPU resource metrics

DCGM_FI_DEV_GPU_UTIL

Gauge

%

The GPU utilization. This is the percentage of time that one or more kernel functions are in an Active state during a sample period. The sample period is 1 second or 1/6 of a second, depending on the GPU product.

This metric shows that a kernel function is using the GPU, but does not show specific usage details.

DCGM_FI_DEV_FB_USED

Gauge

MiB

The amount of used frame buffer (video memory).

DCGM_FI_DEV_FB_TOTAL

Gauge

MiB

The total amount of frame buffer (video memory).

DCGM_FI_DEV_ENC_UTIL

Gauge

%

The encoder utilization.

DCGM_FI_DEV_DEC_UTIL

Gauge

%

The decoder utilization.

DCGM_FI_DEV_MEM_COPY_UTIL

Gauge

%

The memory bandwidth utilization.

For example, the maximum memory bandwidth of an NVIDIA V100 GPU is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth utilization is 50%.

Profiling

DCGM_FI_PROF_SM_ACTIVE

Gauge

%

The percentage of time that at least one warp is active on a Streaming Multiprocessor (SM) over a time interval.

The value is the average for all SMs and is not sensitive to the number of threads per block.

A warp is active after it is scheduled and its resources are allocated. It can be in a computing state or a non-computing state, such as waiting for a memory request.

A value below 0.5 indicates inefficient GPU use. A value above 0.8 is necessary for high efficiency.

For example, assume a GPU has N SMs:

A kernel function uses N thread blocks to run on all SMs for the entire time interval. The value is 1 (100%).

A kernel function runs N/5 thread blocks during the time interval. The value is 0.2.

A kernel function uses N thread blocks but runs for only 1/5 of the time interval. The value is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

Gauge

%

The ratio of resident warps on an SM to the maximum number of warps that the SM can support over a time interval.

The value is the average for all SMs over the time interval.

Higher occupancy does not always mean higher GPU utilization. For workloads limited by GPU memory bandwidth (DCGM_FI_PROF_DRAM_ACTIVE), higher occupancy indicates more effective GPU use.

DCGM_FI_PROF_DRAM_ACTIVE

Gauge

%

The fraction of cycles that the device memory is busy sending or receiving data. This is also known as Memory Bandwidth Utilization.

The value is an average over the time interval, not an instantaneous value.

A higher value indicates higher device memory utilization.

A value of 1 (100%) means a DRAM instruction is executed every cycle during the time interval. In practice, the maximum achievable peak is about 0.8 (80%).

A value of 0.2 (20%) means that 20% of the cycles in the time interval were spent reading from or writing to device memory.

  • DCGM_FI_PROF_NVLINK_RX_BYTES

  • DCGM_FI_PROF_NVLINK_TX_BYTES

Counter

B/s

The rate of data transmitted or received over NVLink, excluding protocol headers.

The value is an average over the time interval, not an instantaneous value.

The rate is averaged over the interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is at a constant rate or in a burst. The theoretical maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

  • DCGM_FI_PROF_PCIE_RX_BYTES

  • DCGM_FI_PROF_PCIE_TX_BYTES

Counter

B/s

The rate of data transmitted or received over the PCIe bus, including protocol headers and data payload.

The value is an average over the time interval, not an instantaneous value.

The rate is averaged over the interval. For example, if 1 GB of data is transferred in 1 second, the rate is 1 GB/s, regardless of whether the transfer is at a constant rate or in a burst. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Gauge

%

The fraction of cycles that the Tensor (HMMA/IMMA) Pipe is active.

The value is an average over the time interval, not an instantaneous value.

A higher value indicates higher utilization of Tensor Cores.

A value of 1 (100%) means a Tensor instruction is issued every other instruction cycle during the time interval. One instruction takes two cycles to complete.

A value of 0.2 (20%) can mean one of the following:

  • The Tensor Cores of 20% of the SMs run at 100% utilization for the entire time interval.

  • The Tensor Cores of 100% of the SMs run at 20% utilization for the entire time interval.

  • The Tensor Cores on 100% of the SMs run at 100% utilization for 1/5 of the time interval.

  • Other combinations.

Frequency (Clock)

DCGM_FI_DEV_SM_CLOCK

Gauge

MHz

The SM clock frequency.

GPU exceptions/XID errors

DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS

Gauge

Error code

NVSwitch fatal error information.

The value is the SXid error code.

DCGM_FI_DEV_ROW_REMAP_FAILURE

Gauge

-

A row remap error occurred.

DCGM_FI_DEV_ROW_REMAP_PENDING

Gauge

-

A row remap is pending.

Temperature & Power

DCGM_FI_DEV_GPU_TEMP

Gauge

The GPU temperature.

DCGM_FI_DEV_MEMORY_TEMP

Gauge

The memory temperature.

DCGM_FI_DEV_POWER_USAGE

Gauge

W

The power consumption.

Retired Pages

DCGM_FI_DEV_RETIRED_SBE

Gauge

-

The number of retired pages due to single-bit errors (SBE).

DCGM_FI_DEV_RETIRED_DBE

Gauge

-

The number of retired pages due to double-bit errors (DBE).

RDMA metrics

You can filter RDMA-related metrics using the following resource dimensions:

  • app="nusa-exporter"

  • hostname="{{pod_name}}"

  • ip="172.16.17.114"

  • namespace="{{pod_namespace}}"

  • node="{{virtual-kubelet-nodename}}"

  • pod="{{pod_name}}"

Metric name

Type

Unit

Description

  • rdma_service_monitor_tx_bytes_rate

  • rdma_service_monitor_rx_bytes_rate

Gauge

bytes

The instantaneous outbound/inbound traffic of the pod RDMA network.

  • rdma_service_monitor_tx_bytes

  • rdma_service_monitor_rx_bytes

Counter

bytes

The cumulative outbound/inbound traffic of the pod RDMA network.

  • rdma_service_monitor_tx_packets_rate

  • rdma_service_monitor_rx_packets_rate

Gauge

packets

The instantaneous number of outbound/inbound packets on the pod RDMA network.

  • rdma_service_monitor_tx_packets

  • rdma_service_monitor_rx_packets

Counter

packets

The cumulative number of outbound/inbound packets on the pod RDMA network.