GPU Monitoring 2.0 uses a combination of GPU exporter, Prometheus, and Grafana to provide GPU observability for your Container Service for Kubernetes (ACK) clusters. The GPU exporter is compatible with DCGM exporter metrics and also provides custom metrics for GPU sharing scenarios.
Available metrics are organized into three categories: DCGM exporter metrics, custom metrics, and deprecated metrics.
Billing
Custom metrics are fee-charged. Fees vary based on cluster size and number of applications. Before enabling custom metrics, read Billing overview to understand the pricing model. To monitor your usage, see View resource usage.
Metrics supported by the DCGM exporter
The GPU exporter is compatible with DCGM exporter metrics. For details on the upstream DCGM exporter, see DCGM exporter.
Utilization metrics
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_GPU_UTIL | Gauge | % | GPU utilization sampled over a cycle of 1 second or 1/6 second, depending on GPU model. A cycle is a period during which one or more kernel functions remain active. This metric indicates whether the GPU is occupied, not how intensively it is being used. |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge | % | Memory bandwidth utilization. For example, GPU V100 has a maximum memory bandwidth of 900 GB/s. If current usage is 450 GB/s, this metric reports 50%. |
DCGM_FI_DEV_ENC_UTIL | Gauge | % | Encoder utilization. |
DCGM_FI_DEV_DEC_UTIL | Gauge | % | Decoder utilization. |
Memory metrics
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_FB_FREE | Gauge | MiB | Free framebuffer memory (GPU memory). |
DCGM_FI_DEV_FB_USED | Gauge | MiB | Used framebuffer memory. Matches the Memory-Usage value returned by nvidia-smi. |
Profiling metrics
All values are cycle averages, not instantaneous readings.
| Metric | Type | Unit | Description |
|---|
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Gauge | % | Ratio of cycles during which a graphics engine or compute engine is active, averaged across all engines. An engine is active when a graphics or compute context is bound to the thread and the context is busy. |
DCGM_FI_PROF_SM_ACTIVE | Gauge | % | Ratio of cycles during which at least one warp on a streaming multiprocessor (SM) is active, averaged across all SMs. A warp is considered active when it is scheduled and resources are allocated to it, regardless of whether it is computing or waiting (for example, waiting on memory). If this value drops below 0.5, GPU utilization is low. Target a value greater than 0.8 for effective GPU use. |
DCGM_FI_PROF_SM_OCCUPANCY | Gauge | % | Ratio of warps resident on an SM to the maximum warps the SM supports per cycle, averaged across all SMs. A higher value does not necessarily mean higher GPU utilization — it indicates higher utilization only when DCGM_FI_PROF_DRAM_ACTIVE shows that GPU memory bandwidth is the bottleneck. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Gauge | % | Ratio of cycles during which the tensor (HMMA/IMMA) pipe is active. A higher value indicates higher tensor core utilization. At 100%, tensor instructions execute at intervals within the cycle, each taking two intervals to complete. Use DCGM_FI_PROF_SM_ACTIVE to disambiguate whether low utilization is caused by fewer active SMs or shorter active periods. |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | Gauge | % | Ratio of cycles during which the FP64 (double-precision) pipe is active. Higher values indicate higher FP64 core utilization. On Volta GPUs at 100%, one FP64 instruction executes every four weeks within the cycle. Use DCGM_FI_PROF_SM_ACTIVE to disambiguate partial utilization scenarios. |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | Gauge | % | Ratio of cycles during which the FMA (Fused Multiply-Add) pipe is active. FMA operations include FP32 (single-precision) and integer operations. On Volta GPUs at 100%, one FP32 instruction executes every two weeks within the cycle. |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | Gauge | % | Ratio of cycles during which the FP16 (half-precision) pipe is active. On Volta GPUs at 100%, one FP16 instruction executes every two weeks within the cycle. |
DCGM_FI_PROF_DRAM_ACTIVE | Gauge | % | Ratio of cycles during which the device memory interface is active sending or receiving data. The peak achievable value is approximately 0.8 (80%), not 1.0 (100%). Use this metric together with DCGM_FI_PROF_SM_OCCUPANCY to determine whether a workload is memory-bandwidth-limited. |
DCGM_FI_PROF_PCIE_TX_BYTES / DCGM_FI_PROF_PCIE_RX_BYTES | Counter | B/s | PCIe (Peripheral Component Interconnect Express) transmit and receive rates, including both protocol headers and data payloads, averaged over a cycle. The theoretical maximum for PCIe Gen 3 is 985 MB/s per lane. |
DCGM_FI_PROF_NVLINK_TX_BYTES / DCGM_FI_PROF_NVLINK_RX_BYTES | Counter | B/s | NVLink transmit and receive rates, including both protocol headers and data payloads, averaged over a cycle. The theoretical maximum for NVLink Gen 2 is 25 GB/s per lane per direction. |
Clock metrics
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_SM_CLOCK | Gauge | MHz | SM clock speed. |
DCGM_FI_DEV_MEM_CLOCK | Gauge | MHz | Memory clock speed. |
DCGM_FI_DEV_APP_SM_CLOCK | Gauge | MHz | SM application clock speed. |
DCGM_FI_DEV_APP_MEM_CLOCK | Gauge | MHz | Memory application clock speed. |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | Gauge | MHz | Reason for clock speed throttling. |
XID error and violation metrics
XID errors are NVIDIA driver error codes reported when hardware or software faults occur. Violation metrics count the cumulative time (in microseconds) that the GPU clock has been throttled due to each cause.
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_XID_ERRORS | Gauge | - | Most recent XID error code that occurred within a period of time. |
DCGM_FI_DEV_POWER_VIOLATION | Counter | μs | Cumulative clock throttle time due to power limits. |
DCGM_FI_DEV_THERMAL_VIOLATION | Counter | μs | Cumulative clock throttle time due to thermal limits. |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION | Counter | μs | Cumulative clock throttle time due to sync boost constraints. |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION | Counter | μs | Cumulative clock throttle time due to board-level limits. |
DCGM_FI_DEV_LOW_UTIL_VIOLATION | Counter | μs | Cumulative clock throttle time due to low utilization. |
DCGM_FI_DEV_RELIABILITY_VIOLATION | Counter | μs | Cumulative clock throttle time due to reliability constraints. |
BAR1 metrics
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_BAR1_USED | Gauge | MB | Used BAR1 memory. |
DCGM_FI_DEV_BAR1_FREE | Gauge | MB | Free BAR1 memory. |
Temperature and power metrics
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_MEMORY_TEMP | Gauge | C | Memory temperature. |
DCGM_FI_DEV_GPU_TEMP | Gauge | C | GPU temperature. |
DCGM_FI_DEV_POWER_USAGE | Gauge | W | Current power draw. |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | Counter | J | Total energy consumed since the driver was last reloaded. |
Retired page metrics
GPU memory pages are retired when uncorrectable errors are detected. A rising count indicates degrading GPU memory health.
| Metric | Type | Unit | Description |
|---|
DCGM_FI_DEV_RETIRED_SBE | Gauge | - | Pages retired due to single-bit errors. |
DCGM_FI_DEV_RETIRED_DBE | Gauge | - | Pages retired due to double-bit errors. |
Custom metrics
Custom metrics are provided by the ACK GPU exporter for GPU sharing scenarios and are not part of the upstream DCGM exporter. These metrics are fee-charged.
Process-level metrics
| Metric | Type | Unit | Description |
|---|
DCGM_CUSTOM_PROCESS_SM_UTIL | Gauge | % | SM utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL | Gauge | % | Memory copy utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_ENCODE_UTIL | Gauge | % | Encoder utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_DECODE_UTIL | Gauge | % | Decoder utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_MEM_USED | Gauge | MiB | GPU memory used by GPU threads. |
Container- and node-level metrics
| Metric | Type | Unit | Description |
|---|
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED | Gauge | MiB | GPU memory allocated to a container. |
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED | Gauge | - | Ratio of GPU compute allocated to a container relative to total GPU compute. Value range: [0, 1]. For example, if the GPU provides 100 compute units and 30 are allocated to a container, this metric reports 0.3. In exclusive GPU mode or shared GPU mode, this value is 0 because containers in those modes request only GPU memory, with compute allocation unconstrained. |
DCGM_CUSTOM_DEV_FB_TOTAL | Gauge | MiB | Total GPU memory. |
DCGM_CUSTOM_DEV_FB_ALLOCATED | Gauge | - | Ratio of allocated GPU memory to total GPU memory. Value range: [0, 1]. |
DCGM_CUSTOM_ALLOCATE_MODE | Gauge | - | Scheduling mode of the node. Values: 0 = no GPU-accelerated pods running; 1 = exclusive GPU mode; 2 = shared GPU mode. |
Deprecated metrics
The following metrics have been removed from GPU Monitoring 2.0. Replace them with the metrics listed below.
| Deprecated metric | Replacement | Notes |
|---|
nvidia_gpu_temperature_celsius | DCGM_FI_DEV_GPU_TEMP | |
nvidia_gpu_power_usage_milliwatts | DCGM_FI_DEV_POWER_USAGE | |
nvidia_gpu_sharing_memory | DCGM_CUSTOM_DEV_FB_ALLOCATED x DCGM_CUSTOM_DEV_FB_TOTAL | GPU memory ratio x total GPU memory = GPU memory requested. |
nvidia_gpu_memory_used_bytes | DCGM_FI_DEV_FB_USED | |
nvidia_gpu_memory_total_bytes | DCGM_CUSTOM_DEV_FB_TOTAL | |
nvidia_gpu_memory_allocated_bytes | DCGM_CUSTOM_DEV_FB_ALLOCATED x DCGM_CUSTOM_DEV_FB_TOTAL | GPU memory ratio x total GPU memory = GPU memory requested. |
nvidia_gpu_duty_cycle | DCGM_FI_DEV_GPU_UTIL | |
nvidia_gpu_allocated_num_devices | sum(DCGM_CUSTOM_DEV_FB_ALLOCATED) | Sum of GPU memory ratios across all GPUs on a node = total number of GPUs requested on the node. |
nvidia_gpu_num_devices | DCGM_FI_DEV_COUNT | |