All Products
Search
Document Center

Container Service for Kubernetes:Enable GPU monitoring for ACK clusters

Last Updated:Apr 16, 2025

GPU monitoring 2.0 is a sophisticated GPU monitoring system developed based on NVIDIA Data Center GPU Manager (DCGM). This topic describes how to enable GPU monitoring for a Container Service for Kubernetes (ACK) cluster.

Prerequisites

Background information

Monitoring large numbers of GPU devices in Kubernetes is important to O&M engineers. By collecting GPU metrics from a cluster, you can gain insights into the GPU usage, health status, workloads, and performance of the cluster. The monitoring data can help you quickly diagnose issues, optimize GPU resource allocation, and increase resource utilization. GPU monitoring also helps data scientists and AI algorithm engineers optimize GPU resource allocation and task scheduling.

GPU monitoring 1.0 uses NVIDIA Management Library (NVML) to collect GPU metrics, and uses Prometheus and Grafana to visualize the collected metrics. You can use GPU monitoring 1.0 to monitor the usage of GPU resources in your cluster. The new-generation NVIDIA GPU uses a more complex architecture to meet user requirements in diverse scenarios. The GPU metrics provided by GPU monitoring 1.0 can no longer meet the growing demand.

The new-generation NVIDIA GPU supports Data Center GPU Manager (DCGM), which can be used to manage a large number of GPUs. GPU monitoring 2.0 is developed based on the powerful NVIDIA DCGM. DCGM provides various GPU metrics and supports the following features:

  • GPU behavior monitoring

  • GPU configuration management

  • GPU policy management

  • GPU health diagnostics

  • GPU statistics and thread statistics

  • NVSwitch configuration and monitoring

Limits

  • The version of the NVIDIA GPU driver must be 418.87.01 or later. To use GPU profiling metrics, make sure that NVIDIA GPU driver 450.80.02 or later is installed.

    Note

    To check the version of the GPU driver installed on a node, use SSH to log on to the node and run the nvidia-smi command. For more information, see Connect to the master node of an ACK dedicated cluster by using SSH.

  • You cannot use GPU monitoring 2.0 to monitor the NVIDIA Multi-Instance GPU (MIG) feature.

Billing rules

By default, the Managed Service for Prometheus metrics collected by the ack-gpu-exporter component in an ACK cluster are considered basic metrics and free of charge. However, if you have increased the default retention period of monitoring data defined by Alibaba Cloud for basic monitoring services, additional fees may be charged. For more information about the billing of custom metrics in Managed Service for Prometheus, see Billing overview.

Procedure

  1. Enable Managed Service for Prometheus

    Make sure the ack-arms-prometheus version is 1.1.7 or later and the GPU dashboard version is V2 or later.

    Note
    • Check and update the ack-arms-prometheus version: Log on to the ACK console and go to the details page of the cluster. In the left-side navigation pane, choose Operations > Add-ons. On the Add-ons page, enter arms in the search box and click the search icon. After the search result appears, you can check and update the ack-arms-prometheus version.

    • Check and update the GPU dashboard version: Log on to the ACK console and go to the details page of the cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring. In the upper-right corner of the Prometheus Monitoring page, click Go to Prometheus Service. On the Dashboards page, you can check and update the GPU dashboard version.

  2. Verify the GPU monitoring capability of Managed Service for Prometheus.

    1. Deploy an application named tensorflow-benchmark.

      1. Create a YAML file named tensorflow-benchmark.yaml and add the following content to the file:

        apiVersion: batch/v1
        kind: Job
        metadata:
          name: tensorflow-benchmark
        spec:
          parallelism: 1
          template:
            metadata:
              labels:
                app: tensorflow-benchmark
            spec:
              containers:
              - name: tensorflow-benchmark
                image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
                command:
                - bash
                - run.sh
                - --num_batches=50000
                - --batch_size=8
                resources:
                  limits:
                    nvidia.com/gpu: 1 #Apply for a GPU. 
                workingDir: /root
              restartPolicy: Never
      2. Run the following command to deploy the tensorflow-benchmark application on a GPU-accelerated node.

        kubectl apply -f tensorflow-benchmark.yaml
      3. Run the following command to query the status of the pod that runs the application:

        kubectl get po

        Expected output:

        NAME                         READY   STATUS    RESTARTS   AGE
        tensorflow-benchmark-k***   1/1     Running   0          114s

        The output indicates that the pod is in the Running state.

    2. View GPU dashboards.

      1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

      2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

      3. On the Prometheus Monitoring page, click the GPU Monitoring tab and then click the GPUs - Cluster Dimension tab.

        The cluster dashboard shows that the GPU pod runs on the cn-beijing.192.168.10.163 node.

      4. Click the GPUs - Nodes tab, and then select cn-beijing.192.168.10.163 from the gpu_node drop-down list to view the GPU information of the node.

Solutions to some known issues

Memory leaks in a DCGM cluster

Memory leaks may occur in a cluster that uses DCGM to manage GPUs. This issue can be avoided by setting the resources.limits parameter for the exporter pod. When the memory usage of the exporter pod reaches the limit, the exporter is restarted (usually once in a month). After the exporter pod is restarted, metrics can be collected as normal. However, metrics on the Grafana dashboard may become abnormal within a few minutes after the exporter is restarted. For example, the number of nodes may suddenly increase and change to the correct value in a few minutes. For more information, see The DCGM has a memory leak?

OOM errors in ack-prometheus-gpu-exporter

The ack-prometheus-gpu-exporter component in ACK clusters is deployed in the embedding mode of DCGM. In the embedding mode, DCGM consumes a large amount of memory and memory leaks may occur when multiple GPUs are used. In this case, if you run multiple processes on an instance to which multiple GPUs are attached and a small amount of memory is allocated to ack-prometheus-gpu-exporter, the exporter pod may be terminated by an out-of-memory (OOM) killer.

When this issue occurs, wait for the exported to be restarted. If this issue frequently occurs, you can manually increase the value of the limits parameter of the ack-prometheus-gpu-exporter DaemonSet in the arms-prom namespace.

Errors in the logs of ack-prometheus-gpu-exporter

The following error may be reported in the logs of the ack-prometheus-gpu-exporter pod:

failed to get all process informations of gpu nvidia1,reason: failed to get gpu utilizations for all processes on device 1,reason: Not Found

The cause may be that no task runs on the relevant GPUs. In this case, no GPU metrics of containers can be collected by ack-prometheus-gpu-exporter of earlier versions. This issue is fixed in the latest version of arms-prometheus. Update the component to the latest version to fix this issue.