GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to build a GPU monitoring system that supports diverse scenarios. You can create Grafana dashboards that consist of GPU exporter metrics to monitor your Container Service for Kubernetes (ACK) clusters. This topic describes the metrics supported by GPU monitoring 2.0.

Metrics

The GPU exporter used by GPU monitoring 2.0 is compatible with the metrics provided by the DCGM exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.

The GPU metrics used by GPU monitoring 2.0 include Metrics supported by the DCGM exporter and Custom metrics.

Metrics supported by the DCGM exporter

Utilization metrics

Metric Type Unit Description
DCGM_FI_DEV_GPU_UTIL Gauge % The GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active.

This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.

DCGM_FI_DEV_MEM_COPY_UTIL Gauge % The memory bandwidth utilization.

For example, the maximum memory bandwidth of GPU V100 is 900 GB/second. If the current memory bandwidth usage is 450 GB/second, the memory bandwidth utilization is 50%.

DCGM_FI_DEV_ENC_UTIL Gauge % The encoder utilization.
DCGM_FI_DEV_DEC_UTIL Gauge % The decoder utilization.

Memory metrics

Metric Type Unit Description
DCGM_FI_DEV_FB_FREE Gauge MiB The amount of free framebuffer memory.
Note Framebuffer memory is also known as GPU memory.
DCGM_FI_DEV_FB_USED Gauge MiB The amount of occupied framebuffer memory.

The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.

Profiling metrics

Metric Type Unit Description
DCGM_FI_PROF_GR_ENGINE_ACTIVE Gauge % The ratio of cycles during which a graphics engine or compute engine remains active.

The value is an average of all graphics engines or compute engines.

A graphics engine or compute engine is active if a graphics context or compute context is bound to the thread and the context is busy.

DCGM_FI_PROF_SM_ACTIVE Gauge % The ratio of cycles during which at least one warp on a streaming multiprocessor (SM) remains active.

The value is an average of all SMs. The value does not vary with the number of warps included in the thread block.

When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this scenario, the status of the warp may be Computing or may not be Computing. For example, the warp may be waiting for memory requests.

If the value of this metric drops below 0.5, the GPU utilization is low. To ensure high GPU utilization, make sure that the value is greater than 0.8.

For example, a GPU has N SMs:
  • If all SMs in N thread blocks run a kernel function within a cycle, the value of this metric is 1 (100%).
  • If N/5 thread blocks run a kernel function within a cycle, the value of this metric is 0.2.
  • If N thread blocks run a kernel function during 20% of the cycle, the value of this metric is 0.2.
DCGM_FI_PROF_SM_OCCUPANCY Gauge % The ratio of the number of warps reside on an SM to the maximum number of warps supported by the SM within a cycle.

The value is an average of all SMs with a cycle.

A larger value of this metric does not indicate higher GPU utilization. Only when the DCGM_FI_PROF_DRAM_ACTIVE metric indicates that the GPU memory bandwidth is limited, a larger value of this metric indicates higher GPU utilization.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Gauge % The ratio of cycles during which the tensor (HMMA/IMMA) pipe remains active.

The value is an average calculated within a cycle.

A larger value of this metric indicates higher tensor core utilization.

If the value is 1 (100%), tensor instructions are submitted at intervals within the cycle. Each instruction is executed within two intervals.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:
  • The tensor core utilization of 20% of the SMs within the cycle is 100%.
  • The tensor core utilization of all SMs within the cycle is 20%.
  • The tensor core utilization of all SMs within 20% of the cycle is 100%.
  • Other conditions.
DCGM_FI_PROF_PIPE_FP64_ACTIVE Gauge % The ratio of cycles during which the fp64 (double-precision) pipe remains active.

The value is an average calculated within a cycle.

A larger value of this metric indicates higher fp64 core utilization.

If the value is 1 (100%), an fp64 instruction is executed every four weeks within the cycle when a Volta GPU is used.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:
  • The fp64 core utilization of 20% of the SMs within the cycle is 100%.
  • The fp64 core utilization of all SMs within the cycle is 20%.
  • The fp64 core utilization of all SMs within 20% of the cycle is 100%.
  • Other conditions.
DCGM_FI_PROF_PIPE_FP32_ACTIVE Gauge % The ratio of cycles during which the Fused Multiply-Add (FMA) operation pipe remains active. FMA operations include FP32 (single-precision) operations and integer operations.

The value is an average calculated within a cycle.

A larger value of this metric indicates higher fp32 core utilization.

If the value is 1 (100%), an fp32 instruction is executed every two weeks within the cycle when a Volta GPU is used.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:
  • The fp32 core utilization of 20% of the SMs within the cycle is 100%.
  • The fp32 core utilization of all SMs within the cycle is 20%.
  • The fp32 core utilization of all SMs within 20% of the cycle is 100%.
  • Other conditions.
DCGM_FI_PROF_PIPE_FP16_ACTIVE Gauge % The ratio of cycles during which the fp16 pipe (half-precision) remains active.

The value is an average calculated within a cycle.

A larger value of this metric indicates higher fp16 core utilization.

If the value is 1 (100%), an fp16 instruction is executed every two weeks within the cycle when a Volta GPU is used.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:
  • The fp16 core utilization of 20% of the SMs within the cycle is 100%.
  • The fp16 core utilization of all SMs within the cycle is 20%.
  • The fp16 core utilization of all SMs within 20% of the cycle is 100%.
  • Other conditions.
DCGM_FI_PROF_DRAM_ACTIVE Gauge % The ratio of cycles during which the device memory interface remains active to send or receive data.

The value is an average calculated within a cycle.

A larger value of this metric indicates higher device memory utilization.

If the value is 1 (100%), a DRAM instruction is executed every week within the cycle. The peak value of the metric can reach 0.8 (80%).

If the value of this metric is 0.2 (20%), the device memory interface sends or receives data within 20% of the cycle.

  • DCGM_FI_PROF_PCIE_TX_BYTES
  • DCGM_FI_PROF_PCIE_RX_BYTES
Counter B/s The TX rate of Peripheral Component Interconnect Express (PCIe) and the RX rate of PCIe. The bytes transmitted or received include both the header and payload.

The values are averages calculated within a cycle.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum PCIe Gen 3 bandwidth is 985 MB/second per lane.

  • DCGM_FI_PROF_NVLINK_RX_BYTES
  • DCGM_FI_PROF_NVLINK_TX_BYTES
Counter B/s The TX rate of NvLink and the RX rate of NvLink. The bytes transmitted or received include both the header and payload.

The values are averages calculated within a cycle.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum NvLink Gen 2 bandwidth is 25 GB/second per lane in each direction.

Clock metrics

Metric Type Unit Description
DCGM_FI_DEV_SM_CLOCK Gauge MHz The SM clock.
DCGM_FI_DEV_MEM_CLOCK Gauge MHz The memory clock.
DCGM_FI_DEV_APP_SM_CLOCK Gauge MHz The SM application clock.
DCGM_FI_DEV_APP_MEM_CLOCK Gauge MHz The memory application clock.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS Gauge MHz The clock throttle reason.

XID error and violation metrics

Metric Type Unit Description
DCGM_FI_DEV_XID_ERRORS Gauge - The most recent XID error that occurred within a period of time.
DCGM_FI_DEV_POWER_VIOLATION Counter μs The power violation time.
DCGM_FI_DEV_THERMAL_VIOLATION Counter μs The thermal violation time.
DCGM_FI_DEV_SYNC_BOOST_VIOLATION Counter μs The sync boost violation time.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION Counter μs The board violation time.
DCGM_FI_DEV_LOW_UTIL_VIOLATION Counter μs The low utilization violation time.
DCGM_FI_DEV_RELIABILITY_VIOLATION Counter μs The board reliability violation time.

BAR1

Metric Type Unit Description
DCGM_FI_DEV_BAR1_USED Gauge MB The amount of occupied BAR1.
DCGM_FI_DEV_BAR1_FREE Gauge MB The amount of free BAR1.

Temperature and power metrics

Metric Type Unit Description
DCGM_FI_DEV_MEMORY_TEMP Gauge C The memory temperature.
DCGM_FI_DEV_GPU_TEMP Gauge C The GPU temperature.
DCGM_FI_DEV_POWER_USAGE Gauge W The power usage.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Counter J The total energy consumption since the driver was last reloaded.

Retired page metrics

Metric Type Unit Description
DCGM_FI_DEV_RETIRED_SBE Gauge - The number of pages retired because of single bit errors.
DCGM_FI_DEV_RETIRED_DBE Gauge - The number of pages retired because of double bit errors.

Custom metrics

Metric Type Unit Description
DCGM_CUSTOM_PROCESS_SM_UTIL Gauge % The SM utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL Gauge % The memory copy utilization of GPU threads.
DCGM_CUSTOM_PROCESS_ENCODE_UTIL Gauge % The encoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_DECODE_UTIL Gauge % The decoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_USED Gauge MiB The amount of GPU memory occupied by GPU threads.
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED Gauge MiB The amount of GPU memory allocated to containers.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED Gauge - The ratio of GPU computing power allocated to a container to total GPU computing power provided by the GPU. The value ranges from 0 to 1.

The value of this metric is 0 in exclusive GPU mode or shared GPU mode because containers in these modes request only GPU memory. The allocation of GPU computing power is unlimited.

If a GPU can provide 100 CUs of GPU computing power and allocates 30 CUs to a container, the ratio of GPU computing power allocated to the container is 0.3 (30/100).

DCGM_CUSTOM_DEV_FB_TOTAL Gauge MiB The total memory of the GPU.
DCGM_CUSTOM_DEV_FB_ALLOCATED Gauge - The ratio of allocated GPU memory to total GPU memory. The value ranges from 0 to 1.
DCGM_CUSTOM_ALLOCATE_MODE Gauge - The mode in which the node runs. Valid values:
  • 0: No GPU-accelerated pods are running on the node.
  • 1: GPU-accelerated pods are running in exclusive GPU mode on the node.
  • 2: GPU-accelerated pods are running in shared GPU mode on the node.

Deprecated metrics

Deprecated metric Metric for replacement
nvidia_gpu_temperature_celsius DCGM_FI_DEV_GPU_TEMP
nvidia_gpu_power_usage_milliwatts DCGM_FI_DEV_POWER_USAGE
nvidia_gpu_sharing_memory DCGM_CUSTOM_DEV_FB_ALLOCATED
nvidia_gpu_memory_used_bytes DCGM_FI_DEV_FB_USED
nvidia_gpu_memory_total_bytes DCGM_CUSTOM_DEV_FB_TOTAL
nvidia_gpu_memory_allocated_bytes DCGM_CUSTOM_DEV_FB_ALLOCATED
nvidia_gpu_duty_cycle DCGM_FI_DEV_GPU_UTIL
nvidia_gpu_allocated_num_devices DCGM_CUSTOM_DEV_FB_ALLOCATED
nvidia_gpu_num_devices DCGM_FI_DEV_COUNT