How to enable GPU monitoring for an ACK cluster - Container Service for Kubernetes

GPU monitoring built on NVIDIA Data Center GPU Manager (DCGM) gives you visibility into GPU utilization, health, and workload performance across your cluster. Use GPU metrics to diagnose issues, optimize resource allocation, and inform capacity planning.

How it works

DCGM is NVIDIA's tool for managing GPUs in large-scale clusters. A monitoring system built on DCGM provides:

GPU behavior monitoring
GPU configuration management
GPU policy management
GPU health diagnostics
GPU-level and thread-level statistics
NVSwitch configuration and monitoring

Prerequisites

Before you begin, make sure that you have:

Limitations

The NVIDIA driver on each node must be version 418.87.01 or later. Log on to a GPU node and run nvidia-smi to check the driver version.
Profiling Metrics require NVIDIA driver version 450.80.02 or later. For details, see Feature Overview.
NVIDIA Multi-Instance GPU (MIG) monitoring is not supported.

Billing

GPU monitoring metrics are collected through Alibaba Cloud Managed Service for Prometheus. For pricing details, see Billing overview.

Step 1: Enable Prometheus monitoring

Important

The ack-arms-prometheus component must be version 1.1.7 or later. Check the component version and upgrade if necessary.

Option A: Enable monitoring for an existing cluster

(Optional) For an ACK dedicated cluster, first grant authorization for monitoring policies to the cluster.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, select a container monitoring version and click Install.

After monitoring is enabled, default basic metrics are collected automatically. To collect custom metrics, see Collect custom metrics. Several preset dashboards are also available on this page, including Cluster Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.

Option B: Enable monitoring when creating a cluster

ACK managed cluster Pro Edition:

On the Component Configuration page, in the Container Monitoring section, select Container Cluster Monitoring Pro Edition or Container Cluster Monitoring Basic Edition. For details, see Create an ACK managed cluster.

Note: Auto Mode for smart hosting enables Container Monitoring Basic Edition by default.

ACK managed cluster Basic Edition, ACS clusters, and ACK Serverless clusters:

On the Component Configurations page of the cluster creation wizard, in the Monitor containers section, select Enable Managed Service for Prometheus to install Container Monitoring Basic Edition.

After monitoring is enabled, default basic metrics are collected automatically. To collect custom metrics, see Collect custom metrics. On the cluster details page, choose Operations > Prometheus Monitoring to view preset dashboards such as Cluster Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.

For more information about enabling Prometheus monitoring, see Enable Prometheus monitoring for ACK.

If you use a self-managed, open-source Prometheus service and need GPU monitoring, install the ack-gpu-exporter component.

Step 2: Verify GPU monitoring components

After enabling Prometheus monitoring, verify that the GPU exporter pods are running:

kubectl get pods -n arms-prom -l k8s-app=ack-prometheus-gpu-exporter

Each GPU node in the cluster should have a running ack-prometheus-gpu-exporter pod. This DaemonSet collects DCGM metrics from your GPU nodes.

Step 3: Deploy a sample application

To generate GPU metrics, deploy a sample workload on a GPU node.

Create a file named tensorflow-benchmark.yaml with the following content:

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=50000
        - --batch_size=8
        resources:
          limits:
            nvidia.com/gpu: 1 # Request one GPU.
        workingDir: /root
      restartPolicy: Never

Deploy the application:

kubectl apply -f tensorflow-benchmark.yaml

Check the pod status:

kubectl get pod

Expected output:

NAME                         READY   STATUS    RESTARTS   AGE
tensorflow-benchmark-k***    1/1     Running   0          114s

Step 4: View GPU monitoring data

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring.
Click the GPU Monitoring tab, then click the GPUs-Pods tab. The monitoring data shows which node each GPU pod is running on (for example, cn-beijing.10.131.xx.xxx).
Click the GPUs-Nodes tab and set GPUNode to a specific node (for example, cn-beijing.10.131.xx.xxx) to view detailed GPU information for that node. For parameter descriptions, see Dashboard description.

FAQ

DCGM memory leak

The ack-prometheus-gpu-exporter DaemonSet starts automatically when you install Managed Service for Prometheus. DCGM sometimes does not release memory correctly during runtime, causing memory usage to increase over time.

A resources.limits setting is configured on the exporter pod to mitigate this. When the memory limit is reached, the pod restarts automatically. This typically happens about once a month. After a restart, metrics resume reporting normally. Grafana may display anomalies for a few minutes (for example, a sudden spike in node count), but the display corrects itself.

For more information, see The DCGM has a memory leak? on GitHub.

`ack-prometheus-gpu-exporter` is killed by an out-of-memory event

The ack-prometheus-gpu-exporter uses DCGM in embedded mode, which consumes a large amount of memory on multi-GPU nodes and is prone to memory leaks. If you run multiple GPU processes on an instance with multiple GPUs and allocate too little memory to the exporter, the pod may be killed by an out-of-memory (OOM) event.

The pod typically resumes reporting metrics after it restarts. If out-of-memory kills happen frequently, increase the memory limits for the ack-prometheus-gpu-exporter DaemonSet in the arms-prom namespace.

`ack-prometheus-gpu-exporter` reports an error

If the pod logs contain an error similar to the following:

failed to get all process informations of gpu nvidia1,reason: failed to get gpu utilizations for all processes on device 1,reason: Not Found

This error occurs because older versions of ack-prometheus-gpu-exporter cannot retrieve GPU metrics when no tasks are running on certain GPU cards.

Upgrade the ack-arms-prometheus component to the latest version to resolve this issue.