All Products
Search
Document Center

Container Service for Kubernetes:Introduction to metrics

Last Updated:Mar 25, 2026

GPU Monitoring 2.0 uses a combination of GPU exporter, Prometheus, and Grafana to provide GPU observability for your Container Service for Kubernetes (ACK) clusters. The GPU exporter is compatible with DCGM exporter metrics and also provides custom metrics for GPU sharing scenarios.

Available metrics are organized into three categories: DCGM exporter metrics, custom metrics, and deprecated metrics.

Billing

Custom metrics are fee-charged. Fees vary based on cluster size and number of applications. Before enabling custom metrics, read Billing overview to understand the pricing model. To monitor your usage, see View resource usage.

Metrics supported by the DCGM exporter

The GPU exporter is compatible with DCGM exporter metrics. For details on the upstream DCGM exporter, see DCGM exporter.

Utilization metrics

MetricTypeUnitDescription
DCGM_FI_DEV_GPU_UTILGauge%GPU utilization sampled over a cycle of 1 second or 1/6 second, depending on GPU model. A cycle is a period during which one or more kernel functions remain active. This metric indicates whether the GPU is occupied, not how intensively it is being used.
DCGM_FI_DEV_MEM_COPY_UTILGauge%Memory bandwidth utilization. For example, GPU V100 has a maximum memory bandwidth of 900 GB/s. If current usage is 450 GB/s, this metric reports 50%.
DCGM_FI_DEV_ENC_UTILGauge%Encoder utilization.
DCGM_FI_DEV_DEC_UTILGauge%Decoder utilization.

Memory metrics

MetricTypeUnitDescription
DCGM_FI_DEV_FB_FREEGaugeMiBFree framebuffer memory (GPU memory).
DCGM_FI_DEV_FB_USEDGaugeMiBUsed framebuffer memory. Matches the Memory-Usage value returned by nvidia-smi.

Profiling metrics

All values are cycle averages, not instantaneous readings.

MetricTypeUnitDescription
DCGM_FI_PROF_GR_ENGINE_ACTIVEGauge%Ratio of cycles during which a graphics engine or compute engine is active, averaged across all engines. An engine is active when a graphics or compute context is bound to the thread and the context is busy.
DCGM_FI_PROF_SM_ACTIVEGauge%Ratio of cycles during which at least one warp on a streaming multiprocessor (SM) is active, averaged across all SMs. A warp is considered active when it is scheduled and resources are allocated to it, regardless of whether it is computing or waiting (for example, waiting on memory). If this value drops below 0.5, GPU utilization is low. Target a value greater than 0.8 for effective GPU use.
DCGM_FI_PROF_SM_OCCUPANCYGauge%Ratio of warps resident on an SM to the maximum warps the SM supports per cycle, averaged across all SMs. A higher value does not necessarily mean higher GPU utilization — it indicates higher utilization only when DCGM_FI_PROF_DRAM_ACTIVE shows that GPU memory bandwidth is the bottleneck.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEGauge%Ratio of cycles during which the tensor (HMMA/IMMA) pipe is active. A higher value indicates higher tensor core utilization. At 100%, tensor instructions execute at intervals within the cycle, each taking two intervals to complete. Use DCGM_FI_PROF_SM_ACTIVE to disambiguate whether low utilization is caused by fewer active SMs or shorter active periods.
DCGM_FI_PROF_PIPE_FP64_ACTIVEGauge%Ratio of cycles during which the FP64 (double-precision) pipe is active. Higher values indicate higher FP64 core utilization. On Volta GPUs at 100%, one FP64 instruction executes every four weeks within the cycle. Use DCGM_FI_PROF_SM_ACTIVE to disambiguate partial utilization scenarios.
DCGM_FI_PROF_PIPE_FP32_ACTIVEGauge%Ratio of cycles during which the FMA (Fused Multiply-Add) pipe is active. FMA operations include FP32 (single-precision) and integer operations. On Volta GPUs at 100%, one FP32 instruction executes every two weeks within the cycle.
DCGM_FI_PROF_PIPE_FP16_ACTIVEGauge%Ratio of cycles during which the FP16 (half-precision) pipe is active. On Volta GPUs at 100%, one FP16 instruction executes every two weeks within the cycle.
DCGM_FI_PROF_DRAM_ACTIVEGauge%Ratio of cycles during which the device memory interface is active sending or receiving data. The peak achievable value is approximately 0.8 (80%), not 1.0 (100%). Use this metric together with DCGM_FI_PROF_SM_OCCUPANCY to determine whether a workload is memory-bandwidth-limited.
DCGM_FI_PROF_PCIE_TX_BYTES / DCGM_FI_PROF_PCIE_RX_BYTESCounterB/sPCIe (Peripheral Component Interconnect Express) transmit and receive rates, including both protocol headers and data payloads, averaged over a cycle. The theoretical maximum for PCIe Gen 3 is 985 MB/s per lane.
DCGM_FI_PROF_NVLINK_TX_BYTES / DCGM_FI_PROF_NVLINK_RX_BYTESCounterB/sNVLink transmit and receive rates, including both protocol headers and data payloads, averaged over a cycle. The theoretical maximum for NVLink Gen 2 is 25 GB/s per lane per direction.

Clock metrics

MetricTypeUnitDescription
DCGM_FI_DEV_SM_CLOCKGaugeMHzSM clock speed.
DCGM_FI_DEV_MEM_CLOCKGaugeMHzMemory clock speed.
DCGM_FI_DEV_APP_SM_CLOCKGaugeMHzSM application clock speed.
DCGM_FI_DEV_APP_MEM_CLOCKGaugeMHzMemory application clock speed.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONSGaugeMHzReason for clock speed throttling.

XID error and violation metrics

XID errors are NVIDIA driver error codes reported when hardware or software faults occur. Violation metrics count the cumulative time (in microseconds) that the GPU clock has been throttled due to each cause.

MetricTypeUnitDescription
DCGM_FI_DEV_XID_ERRORSGauge-Most recent XID error code that occurred within a period of time.
DCGM_FI_DEV_POWER_VIOLATIONCounterμsCumulative clock throttle time due to power limits.
DCGM_FI_DEV_THERMAL_VIOLATIONCounterμsCumulative clock throttle time due to thermal limits.
DCGM_FI_DEV_SYNC_BOOST_VIOLATIONCounterμsCumulative clock throttle time due to sync boost constraints.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATIONCounterμsCumulative clock throttle time due to board-level limits.
DCGM_FI_DEV_LOW_UTIL_VIOLATIONCounterμsCumulative clock throttle time due to low utilization.
DCGM_FI_DEV_RELIABILITY_VIOLATIONCounterμsCumulative clock throttle time due to reliability constraints.

BAR1 metrics

MetricTypeUnitDescription
DCGM_FI_DEV_BAR1_USEDGaugeMBUsed BAR1 memory.
DCGM_FI_DEV_BAR1_FREEGaugeMBFree BAR1 memory.

Temperature and power metrics

MetricTypeUnitDescription
DCGM_FI_DEV_MEMORY_TEMPGaugeCMemory temperature.
DCGM_FI_DEV_GPU_TEMPGaugeCGPU temperature.
DCGM_FI_DEV_POWER_USAGEGaugeWCurrent power draw.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONCounterJTotal energy consumed since the driver was last reloaded.

Retired page metrics

GPU memory pages are retired when uncorrectable errors are detected. A rising count indicates degrading GPU memory health.

MetricTypeUnitDescription
DCGM_FI_DEV_RETIRED_SBEGauge-Pages retired due to single-bit errors.
DCGM_FI_DEV_RETIRED_DBEGauge-Pages retired due to double-bit errors.

Custom metrics

Custom metrics are provided by the ACK GPU exporter for GPU sharing scenarios and are not part of the upstream DCGM exporter. These metrics are fee-charged.

Process-level metrics

MetricTypeUnitDescription
DCGM_CUSTOM_PROCESS_SM_UTILGauge%SM utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTILGauge%Memory copy utilization of GPU threads.
DCGM_CUSTOM_PROCESS_ENCODE_UTILGauge%Encoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_DECODE_UTILGauge%Decoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_USEDGaugeMiBGPU memory used by GPU threads.

Container- and node-level metrics

MetricTypeUnitDescription
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATEDGaugeMiBGPU memory allocated to a container.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATEDGauge-Ratio of GPU compute allocated to a container relative to total GPU compute. Value range: [0, 1]. For example, if the GPU provides 100 compute units and 30 are allocated to a container, this metric reports 0.3. In exclusive GPU mode or shared GPU mode, this value is 0 because containers in those modes request only GPU memory, with compute allocation unconstrained.
DCGM_CUSTOM_DEV_FB_TOTALGaugeMiBTotal GPU memory.
DCGM_CUSTOM_DEV_FB_ALLOCATEDGauge-Ratio of allocated GPU memory to total GPU memory. Value range: [0, 1].
DCGM_CUSTOM_ALLOCATE_MODEGauge-Scheduling mode of the node. Values: 0 = no GPU-accelerated pods running; 1 = exclusive GPU mode; 2 = shared GPU mode.

Deprecated metrics

The following metrics have been removed from GPU Monitoring 2.0. Replace them with the metrics listed below.

Deprecated metricReplacementNotes
nvidia_gpu_temperature_celsiusDCGM_FI_DEV_GPU_TEMP
nvidia_gpu_power_usage_milliwattsDCGM_FI_DEV_POWER_USAGE
nvidia_gpu_sharing_memoryDCGM_CUSTOM_DEV_FB_ALLOCATED x DCGM_CUSTOM_DEV_FB_TOTALGPU memory ratio x total GPU memory = GPU memory requested.
nvidia_gpu_memory_used_bytesDCGM_FI_DEV_FB_USED
nvidia_gpu_memory_total_bytesDCGM_CUSTOM_DEV_FB_TOTAL
nvidia_gpu_memory_allocated_bytesDCGM_CUSTOM_DEV_FB_ALLOCATED x DCGM_CUSTOM_DEV_FB_TOTALGPU memory ratio x total GPU memory = GPU memory requested.
nvidia_gpu_duty_cycleDCGM_FI_DEV_GPU_UTIL
nvidia_gpu_allocated_num_devicessum(DCGM_CUSTOM_DEV_FB_ALLOCATED)Sum of GPU memory ratios across all GPUs on a node = total number of GPUs requested on the node.
nvidia_gpu_num_devicesDCGM_FI_DEV_COUNT