GPU monitoring leverages NVIDIA Data Center GPU Manager (DCGM) to create a powerful monitoring system for GPUs. This topic describes how to enable GPU monitoring for a cluster.
Prerequisites
You have added GPU nodes to the cluster.
You have activated ARMS.
Background information
To manage large-scale GPU devices in a Kubernetes cluster, you need a comprehensive monitoring system. By monitoring GPU metrics, you can understand the GPU usage, health status, and workload performance of the entire cluster. This helps you quickly diagnose issues, optimize GPU resource allocation, and improve resource utilization. In addition to O&M engineers, other roles, such as data scientists and AI algorithm engineers, can also use these metrics to understand the GPU usage of their services. This information aids in capacity planning and task scheduling.
NVIDIA provides DCGM to manage GPUs in large-scale clusters. A GPU monitoring system built on NVIDIA DCGM offers powerful features and a variety of GPU monitoring metrics. Its main features include the following:
GPU behavior monitoring
GPU configuration management
GPU policy management
GPU health diagnostics
GPU-level and thread-level statistics
NVSwitch configuration and monitoring
Limits
The NVIDIA driver on the node must be version 418.87.01 or later. You can log on to a GPU node and run the
nvidia-smicommand to check the driver version.To use GPU Profiling Metrics, the NVIDIA driver on the node must be version 450.80.02 or later. For more information about GPU Profiling Metrics, see Feature Overview.
Monitoring for NVIDIA MIG is not supported.
Billing
For more information about the billing policy for Alibaba Cloud Prometheus, see Billing overview.
1. Enable Prometheus monitoring
Ensure that the ack-arms-prometheus component is version 1.1.7 or later. You can view the version of the ack-arms-prometheus component and upgrade it if necessary.
Enable monitoring for an existing cluster
(Optional) For an ACK dedicated cluster, you must first grant authorization for monitoring policies to the cluster.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left of the cluster details page, choose .
On the Prometheus Monitoring page, select a container monitoring version and click Install.
After you enable monitoring, default basic metrics are automatically collected. For information about collecting custom metrics, see Collect custom metrics. You can also view several preset monitoring dashboards on this page, such as Cluster Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.
Enable monitoring when creating a cluster
ACK managed cluster Pro Edition:
On the Component Configuration page, in the Container Monitoring section, select Container Cluster Monitoring Pro Edition or Container Cluster Monitoring Basic Edition. For more information, see Create an ACK managed cluster.
Auto Mode for smart hosting enables Container Monitoring Basic Edition by default.
ACK managed cluster Basic Edition, ACS clusters, and ACK Serverless clusters:
On the Component Configurations page of the create cluster wizard, in the Monitor containers section, select Enable Managed Service for Prometheus to install Container Monitoring Basic Edition.
After monitoring is enabled, default basic metrics are automatically collected. To collect custom metrics, see Collect custom metrics. On the details page of the target cluster, in the navigation pane on the left, select . You can then view pre-configured monitoring dashboards such as Cluster Monitoring Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.
For more information about how to enable Prometheus monitoring, see Enable Prometheus monitoring for ACK.
If you use a self-managed, open-source Prometheus service and require GPU monitoring capabilities, you must install the ack-gpu-exporter component.
2. Deploy a sample application
Create a file named tensorflow-benchmark.yaml with the following content.
apiVersion: batch/v1 kind: Job metadata: name: tensorflow-benchmark spec: parallelism: 1 template: metadata: labels: app: tensorflow-benchmark spec: containers: - name: tensorflow-benchmark image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=50000 - --batch_size=8 resources: limits: nvidia.com/gpu: 1 # Request one GPU. workingDir: /root restartPolicy: NeverRun the following command to deploy the tensorflow-benchmark application on a GPU node.
kubectl apply -f tensorflow-benchmark.yamlRun the following command to check the pod status.
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE tensorflow-benchmark-k*** 1/1 Running 0 114s
3. View GPU monitoring data for the cluster
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
On the Prometheus Monitoring page, click the GPU Monitoring tab and then the GPUs-Pods tab.
The monitoring data shows that the GPU pod is running on the node cn-beijing.10.131.xx.xxx.

Click the GPUs-Nodes tab and set GPUNode to cn-beijing.10.131.xx.xxx to view the detailed GPU information for the node. For more information about the parameters, see Dashboard description.

FAQ
DCGM memory leak
Background: DCGM is a tool provided by NVIDIA for managing and monitoring GPUs. The
ack-prometheus-gpu-exporteris a DaemonSet pod that starts after you install the Managed Service for Prometheus component.Cause: A DCGM memory leak occurs when the memory occupied by DCGM is not released correctly during runtime, causing memory usage to increase continuously.
Solution: DCGM may experience memory leaks. To work around this issue, a
resources.limitssetting is configured for the pod whereack-prometheus-gpu-exporterruns. When memory usage reaches the limit,ack-prometheus-gpu-exporterrestarts, which typically occurs about once a month. After the restart, it reports metrics as normal. However, for a few minutes after the restart, Grafana might display some metrics abnormally, such as a sudden increase in the number of nodes. The display returns to normal afterward. For more information about this issue, see The DCGM has a memory leak?.
ack-prometheus-gpu-exporter experiences an OOM kill
Background: The
ack-prometheus-gpu-exporteris a DaemonSet pod that starts after you install the Managed Service for Prometheus component. It might cause issues when you enable monitoring.Cause: The
ack-prometheus-gpu-exporteron an ACK cluster uses DCGM in embedded mode. In this mode, DCGM consumes a large amount of memory on multi-GPU nodes and is prone to memory leaks. Therefore, if you run multiple GPU processes on an instance with multiple GPUs and allocate a small amount of memory toack-prometheus-gpu-exporter, the exporter pod might be killed by an out-of-memory (OOM) event.Solution: In this case, the pod typically resumes reporting metrics after it restarts. If OOM kills occur frequently, you can manually increase the memory
limitsfor theack-prometheus-gpu-exporterDaemonSet in thearms-promnamespace to resolve the issue.
ack-prometheus-gpu-exporter reports an error
Background: The
ack-prometheus-gpu-exporteris a DaemonSet pod that starts after you install the Managed Service for Prometheus component. An error from this pod can cause monitoring issues.Cause: The issue occurs if the pod logs for
ack-prometheus-gpu-exportercontain an error message similar to the following:failed to get all process informations of gpu nvidia1,reason: failed to get gpu utilizations for all processes on device 1,reason: Not FoundThis error occurs because older versions of
ack-prometheus-gpu-exportercannot retrieve GPU metrics for the relevant containers when no specific tasks are running on certain GPU cards.Solution: This issue is fixed in the latest version. To resolve this issue, upgrade the ack-arms-prometheus component to the latest version.