GPU monitoring uses NVIDIA Data Center GPU Manager (DCGM) to monitor the GPU Nodes in your cluster. This document shows you how to view monitoring results for applications that request GPU resources in three different ways.
Prerequisites
You have created an ACK managed cluster.
You have enabled GPU monitoring for the cluster.
You have installed the GPU sharing component.
Background information
GPU monitoring comprehensively monitors the GPU Nodes in your cluster and provides dashboards at the Cluster, Node, and Pod levels. For more details, see Dashboard description.
The Cluster-level GPU Monitoring Dashboard shows information for the entire cluster or a specific Node Pool, such as cluster-wide utilization, GPU memory usage, and XID error detection.
The Node-level GPU Monitoring Dashboard shows node-specific information, such as GPU details, utilization, and GPU memory usage for a particular Node.
The Pod-level GPU Monitoring Dashboard shows Pod-specific information, such as the GPU resources requested by a Pod and its utilization.
This document uses the following example workflow to show how different GPU request methods affect monitoring results.
Important notes
GPU monitoring metrics are collected at a 15-second interval, which can cause a slight data delay on the Grafana dashboard. As a result, the dashboard might show no available GPU memory on a Node, but a Pod is still successfully scheduled to it. This can happen if a Pod completes its task and releases GPU resources within a 15-second collection interval (between two scrapes), allowing the scheduler to place a pending Pod on that Node before the next metric update.
The Monitoring Dashboard only monitors GPU resources requested through
resources.limitsin a Pod. For more information, see Resource Management for Pods and Containers.The data on the Monitoring Dashboard may be inaccurate if you use GPU resources in the following ways:
Run a GPU application directly on a Node.
Run a GPU application in a container started directly with the
docker runcommand.Request GPU resources for a Pod by setting the
NVIDIA_VISIBLE_DEVICES=allorNVIDIA_VISIBLE_DEVICES=<GPU ID>environment variable directly in the Pod'senvsection and running a GPU program.Configure
privileged: truein a Pod'ssecurityContextand run a GPU program.Run a GPU program in a Pod where the
NVIDIA_VISIBLE_DEVICESenvironment variable is not set, but the container image used by the Pod hasNVIDIA_VISIBLE_DEVICES=allconfigured by default.
The allocated GPU memory and used GPU memory are not always the same. For example, a GPU card has 16 GiB of total GPU memory. You allocate 5 GiB of it to a Pod whose startup command is
sleep 1000. In this case, the Pod is in aRunningstate but will not use the GPU for 1000 seconds. As a result, 5 GiB of GPU memory is allocated, but the used GPU memory is 0 GiB.
Step 1: Create a node pool
The GPU Monitoring Dashboard displays metrics for Pods that request GPU resources either as a full card or by a specific amount of GPU memory, optionally with computing power.
This example creates three Node Pools in a cluster to demonstrate Pod scheduling and resource usage for different GPU request models. For detailed instructions on creating a Node Pool, see Create a node pool. The configurations for the Node Pools are as follows:
Configuration item | Description | Example value |
Node Pool Name | Name for the first Node Pool. | exclusive |
Name for the second Node Pool. | share-mem | |
Name for the third Node Pool. | share-mem-core | |
Instance Type | The instance type for the nodes. This example uses a TensorFlow Benchmark project that requires 10 GiB of GPU memory, so the node's instance type must provide more than 10 GiB. | ecs.gn7i-c16g1.4xlarge |
Expected Node Count | The total number of nodes that the Node Pool should maintain. | 1 |
Node Labels | The label added to the first Node Pool. This indicates that GPU resources are requested as a full card. | None |
The label added to the second Node Pool. This indicates that GPU resources are requested by GPU memory. | ack.node.gpu.schedule=cgpu | |
The label added to the third Node Pool. This indicates that GPU resources are requested by GPU memory and supports computing power requests. | ack.node.gpu.schedule=core_mem |
Step 2: Deploy GPU applications
After creating the Node Pools, run GPU test Jobs on the Nodes to verify that GPU metrics are collected correctly. For information about the labels and scheduling relationships required for each Job, see GPU node types and scheduling labels. The configurations for the three Jobs are as follows:
Job name | Node Pool for the task | GPU resource request |
| exclusive |
Requests 1 full GPU card. |
| share-mem |
Requests 10 GiB of GPU memory. |
| share-mem-core |
Requests 10 GiB of GPU memory and 30% of the computing power of one GPU card. |
Create the Job manifest files.
Create a file named
tensorflow-benchmark-exclusive.yamlwith the following YAML content.apiVersion: batch/v1 kind: Job metadata: name: tensorflow-benchmark-exclusive spec: parallelism: 1 template: metadata: labels: app: tensorflow-benchmark-exclusive spec: containers: - name: tensorflow-benchmark image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=5000000 - --batch_size=8 resources: limits: nvidia.com/gpu: 1 #Apply for a GPU. workingDir: /root restartPolicy: NeverCreate a file named
tensorflow-benchmark-share-mem.yamlwith the following YAML content.apiVersion: batch/v1 kind: Job metadata: name: tensorflow-benchmark-share-mem spec: parallelism: 1 template: metadata: labels: app: tensorflow-benchmark-share-mem spec: containers: - name: tensorflow-benchmark image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=5000000 - --batch_size=8 resources: limits: aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory. workingDir: /root restartPolicy: NeverCreate a file named
tensorflow-benchmark-share-mem-core.yamlwith the following YAML content.apiVersion: batch/v1 kind: Job metadata: name: tensorflow-benchmark-share-mem-core spec: parallelism: 1 template: metadata: labels: app: tensorflow-benchmark-share-mem-core spec: containers: - name: tensorflow-benchmark image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=5000000 - --batch_size=8 resources: limits: aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory. aliyun.com/gpu-core.percentage: 30 # Apply for 30% of the computing power of a GPU. workingDir: /root restartPolicy: Never
Run the following commands to deploy the Jobs.
kubectl apply -f tensorflow-benchmark-exclusive.yaml kubectl apply -f tensorflow-benchmark-share-mem.yaml kubectl apply -f tensorflow-benchmark-share-mem-core.yamlRun the following command to check the status of the Pods.
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE tensorflow-benchmark-exclusive-7dff2 1/1 Running 0 3m13s tensorflow-benchmark-share-mem-core-k24gz 1/1 Running 0 4m22s tensorflow-benchmark-share-mem-shmpj 1/1 Running 0 3m46sThe output shows that all Pods are in the
Runningstate, which means the Jobs were deployed successfully.
Step 3: View the GPU monitoring dashboard
View the GPUs - Cluster Dimension dashboard
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Cluster Dimension tab. The information on the cluster-level Monitoring Dashboard is described below. For more information, see Cluster dimension monitoring dashboard.

No.
Panel name
Description
1
Total GPU Nodes
The cluster has 3 GPU nodes.
2
Allocated GPUs
1.9 of 3 total GPUs are allocated.
NoteIf a Pod requests a GPU as a full card, the allocation for one card is 1. For shared GPU scheduling, the allocation is the ratio of the allocated GPU memory to the total GPU memory for that card.
3
Allocated GPU Memory
63.0% of the total GPU memory is allocated.
4
Used GPU Memory
35.5% of the total GPU memory is used.
5
Average GPU Utilization
The average utilization rate across all cards is 74%.
6
GPU Memory Copy Utilization
The average memory copy utilization rate across all cards is 43.7%.
7
GPU Node Details
Information about the GPU Nodes in the cluster, including node name, GPU card index, GPU utilization, and memory controller utilization.
View the GPUs - Nodes dashboard
On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Nodes tab. Select the target Node from the GPUNode dropdown list. This example uses cn-hangzhou.10.166.154.xxx. The information on the node-level Monitoring Dashboard is described below:




Panel group | No. | Panel name | Description |
Overview | 1 | GPU Mode | The GPU mode is shared mode, where Pods request GPU resources by GPU memory and computing power. |
2 | NVIDIA Driver Version | The installed GPU driver version is 535.161.07. | |
3 | Allocated GPUs | The total number of GPUs is 1, and the number of allocated GPUs is 0.45. | |
4 | GPU Utilization | The average GPU utilization rate is 26%. | |
5 | Allocated GPU Memory | The allocated GPU memory is 45.5% of the total GPU memory. | |
6 | Used GPU Memory | The currently used GPU memory is 36.4% of the total GPU memory. | |
7 | Allocated Computing Power | 30% of the computing power of GPU card 0 is allocated. Note The Allocated Computing Power panel only displays data if you enable computing power allocation on the Node. Therefore, among the three nodes in this example, only the Node with the | |
Utilization | 8 | GPU Utilization | For GPU card 0, the minimum utilization is 0%, the maximum is 33%, and the average is 12%. |
9 | Memory Copy Utilization | For GPU card 0, the minimum memory copy utilization is 0%, the maximum is 22%, and the average is 8%. | |
Memory&BAR1 | 10 | GPU Memory Details | Details about GPU memory, including the GPU card's UUID, index number, and model. |
11 | BAR1 Used | The used BAR1 memory is 4 MB. | |
12 | Memory Used | The used GPU memory on the card is 8.17 GB. | |
13 | BAR1 Total | Total BAR1 memory: 32.8 GB. | |
GPU Process | 14 | GPU Process Details | Details about each GPU process, including its Pod namespace and name. |
You can also view more advanced metrics at the bottom of the page. For details, see Node-level monitoring dashboard.

View the GPUs - Application Pod Dimension dashboard
On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Application Pod Dimension tab. The information on the Pod-level Monitoring Dashboard is described below:
No. | Panel name | Description |
1 | GPU Pod Details | Information about Pods in the cluster that have requested GPU resources, including the Pod's namespace, name, Node name, and used GPU memory. Note
|
You can also view more advanced metrics at the bottom of the page. For details, see GPUs - Application Pod Dimension.
