All Products
Search
Document Center

Container Service for Kubernetes:Enable GPU monitoring for a cluster

Last Updated:Dec 03, 2025

GPU monitoring leverages NVIDIA Data Center GPU Manager (DCGM) to create a powerful monitoring system for GPUs. This topic describes how to enable GPU monitoring for a cluster.

Prerequisites

Background information

To manage large-scale GPU devices in a Kubernetes cluster, you need a comprehensive monitoring system. By monitoring GPU metrics, you can understand the GPU usage, health status, and workload performance of the entire cluster. This helps you quickly diagnose issues, optimize GPU resource allocation, and improve resource utilization. In addition to O&M engineers, other roles, such as data scientists and AI algorithm engineers, can also use these metrics to understand the GPU usage of their services. This information aids in capacity planning and task scheduling.

NVIDIA provides DCGM to manage GPUs in large-scale clusters. A GPU monitoring system built on NVIDIA DCGM offers powerful features and a variety of GPU monitoring metrics. Its main features include the following:

  • GPU behavior monitoring

  • GPU configuration management

  • GPU policy management

  • GPU health diagnostics

  • GPU-level and thread-level statistics

  • NVSwitch configuration and monitoring

Limits

  • The NVIDIA driver on the node must be version 418.87.01 or later. You can log on to a GPU node and run the nvidia-smi command to check the driver version.

  • To use GPU Profiling Metrics, the NVIDIA driver on the node must be version 450.80.02 or later. For more information about GPU Profiling Metrics, see Feature Overview.

  • Monitoring for NVIDIA MIG is not supported.

Billing

For more information about the billing policy for Alibaba Cloud Prometheus, see Billing overview.

1. Enable Prometheus monitoring

Important

Ensure that the ack-arms-prometheus component is version 1.1.7 or later. You can view the version of the ack-arms-prometheus component and upgrade it if necessary.

Enable monitoring for an existing cluster

  1. (Optional) For an ACK dedicated cluster, you must first grant authorization for monitoring policies to the cluster.

  2. On the Clusters page, click the name of the target cluster. In the navigation pane on the left of the cluster details page, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, select a container monitoring version and click Install.

    After you enable monitoring, default basic metrics are automatically collected. For information about collecting custom metrics, see Collect custom metrics. You can also view several preset monitoring dashboards on this page, such as Cluster Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.

Enable monitoring when creating a cluster

  • ACK managed cluster Pro Edition:

    On the Component Configuration page, in the Container Monitoring section, select Container Cluster Monitoring Pro Edition or Container Cluster Monitoring Basic Edition. For more information, see Create an ACK managed cluster.

    Auto Mode for smart hosting enables Container Monitoring Basic Edition by default.
  • ACK managed cluster Basic Edition, ACS clusters, and ACK Serverless clusters:

    On the Component Configurations page of the create cluster wizard, in the Monitor containers section, select Enable Managed Service for Prometheus to install Container Monitoring Basic Edition.

    After monitoring is enabled, default basic metrics are automatically collected. To collect custom metrics, see Collect custom metrics. On the details page of the target cluster, in the navigation pane on the left, select Operations Management > Prometheus Monitoring. You can then view pre-configured monitoring dashboards such as Cluster Monitoring Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.

For more information about how to enable Prometheus monitoring, see Enable Prometheus monitoring for ACK.

If you use a self-managed, open-source Prometheus service and require GPU monitoring capabilities, you must install the ack-gpu-exporter component.

2. Deploy a sample application

  1. Create a file named tensorflow-benchmark.yaml with the following content.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-benchmark
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-benchmark
        spec:
          containers:
          - name: tensorflow-benchmark
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
            command:
            - bash
            - run.sh
            - --num_batches=50000
            - --batch_size=8
            resources:
              limits:
                nvidia.com/gpu: 1 # Request one GPU.
            workingDir: /root
          restartPolicy: Never
  2. Run the following command to deploy the tensorflow-benchmark application on a GPU node.

    kubectl apply -f tensorflow-benchmark.yaml
  3. Run the following command to check the pod status.

    kubectl get pod

    Expected output:

    NAME                         READY   STATUS    RESTARTS   AGE
    tensorflow-benchmark-k***    1/1     Running   0          114s

3. View GPU monitoring data for the cluster

  1. Log on to the ACK console. In the navigation pane on the left, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the GPU Monitoring tab and then the GPUs-Pods tab.

    The monitoring data shows that the GPU pod is running on the node cn-beijing.10.131.xx.xxx.

    image

  4. Click the GPUs-Nodes tab and set GPUNode to cn-beijing.10.131.xx.xxx to view the detailed GPU information for the node. For more information about the parameters, see Dashboard description.image

FAQ

DCGM memory leak

  • Background: DCGM is a tool provided by NVIDIA for managing and monitoring GPUs. The ack-prometheus-gpu-exporter is a DaemonSet pod that starts after you install the Managed Service for Prometheus component.

  • Cause: A DCGM memory leak occurs when the memory occupied by DCGM is not released correctly during runtime, causing memory usage to increase continuously.

  • Solution: DCGM may experience memory leaks. To work around this issue, a resources.limits setting is configured for the pod where ack-prometheus-gpu-exporter runs. When memory usage reaches the limit, ack-prometheus-gpu-exporter restarts, which typically occurs about once a month. After the restart, it reports metrics as normal. However, for a few minutes after the restart, Grafana might display some metrics abnormally, such as a sudden increase in the number of nodes. The display returns to normal afterward. For more information about this issue, see The DCGM has a memory leak?.

ack-prometheus-gpu-exporter experiences an OOM kill

  • Background: The ack-prometheus-gpu-exporter is a DaemonSet pod that starts after you install the Managed Service for Prometheus component. It might cause issues when you enable monitoring.

  • Cause: The ack-prometheus-gpu-exporter on an ACK cluster uses DCGM in embedded mode. In this mode, DCGM consumes a large amount of memory on multi-GPU nodes and is prone to memory leaks. Therefore, if you run multiple GPU processes on an instance with multiple GPUs and allocate a small amount of memory to ack-prometheus-gpu-exporter, the exporter pod might be killed by an out-of-memory (OOM) event.

  • Solution: In this case, the pod typically resumes reporting metrics after it restarts. If OOM kills occur frequently, you can manually increase the memory limits for the ack-prometheus-gpu-exporter DaemonSet in the arms-prom namespace to resolve the issue.

ack-prometheus-gpu-exporter reports an error

  • Background: The ack-prometheus-gpu-exporter is a DaemonSet pod that starts after you install the Managed Service for Prometheus component. An error from this pod can cause monitoring issues.

  • Cause: The issue occurs if the pod logs for ack-prometheus-gpu-exporter contain an error message similar to the following:

    failed to get all process informations of gpu nvidia1,reason: failed to get gpu utilizations for all processes on device 1,reason: Not Found

    This error occurs because older versions of ack-prometheus-gpu-exporter cannot retrieve GPU metrics for the relevant containers when no specific tasks are running on certain GPU cards.

  • Solution: This issue is fixed in the latest version. To resolve this issue, upgrade the ack-arms-prometheus component to the latest version.