All Products
Search
Document Center

Container Service for Kubernetes:Best practices for monitoring GPU resources

Last Updated:Nov 18, 2025

GPU monitoring uses NVIDIA Data Center GPU Manager (DCGM) to monitor the GPU Nodes in your cluster. This document shows you how to view monitoring results for applications that request GPU resources in three different ways.

Prerequisites

Background information

GPU monitoring comprehensively monitors the GPU Nodes in your cluster and provides dashboards at the Cluster, Node, and Pod levels. For more details, see Dashboard description.

  • The Cluster-level GPU Monitoring Dashboard shows information for the entire cluster or a specific Node Pool, such as cluster-wide utilization, GPU memory usage, and XID error detection.

  • The Node-level GPU Monitoring Dashboard shows node-specific information, such as GPU details, utilization, and GPU memory usage for a particular Node.

  • The Pod-level GPU Monitoring Dashboard shows Pod-specific information, such as the GPU resources requested by a Pod and its utilization.

This document uses the following example workflow to show how different GPU request methods affect monitoring results.

image

Important notes

  • GPU monitoring metrics are collected at a 15-second interval, which can cause a slight data delay on the Grafana dashboard. As a result, the dashboard might show no available GPU memory on a Node, but a Pod is still successfully scheduled to it. This can happen if a Pod completes its task and releases GPU resources within a 15-second collection interval (between two scrapes), allowing the scheduler to place a pending Pod on that Node before the next metric update.

  • The Monitoring Dashboard only monitors GPU resources requested through resources.limits in a Pod. For more information, see Resource Management for Pods and Containers.

    The data on the Monitoring Dashboard may be inaccurate if you use GPU resources in the following ways:

    • Run a GPU application directly on a Node.

    • Run a GPU application in a container started directly with the docker run command.

    • Request GPU resources for a Pod by setting the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable directly in the Pod's env section and running a GPU program.

    • Configure privileged: true in a Pod's securityContext and run a GPU program.

    • Run a GPU program in a Pod where the NVIDIA_VISIBLE_DEVICES environment variable is not set, but the container image used by the Pod has NVIDIA_VISIBLE_DEVICES=all configured by default.

  • The allocated GPU memory and used GPU memory are not always the same. For example, a GPU card has 16 GiB of total GPU memory. You allocate 5 GiB of it to a Pod whose startup command is sleep 1000. In this case, the Pod is in a Running state but will not use the GPU for 1000 seconds. As a result, 5 GiB of GPU memory is allocated, but the used GPU memory is 0 GiB.

Step 1: Create a node pool

The GPU Monitoring Dashboard displays metrics for Pods that request GPU resources either as a full card or by a specific amount of GPU memory, optionally with computing power.

This example creates three Node Pools in a cluster to demonstrate Pod scheduling and resource usage for different GPU request models. For detailed instructions on creating a Node Pool, see Create a node pool. The configurations for the Node Pools are as follows:

Configuration item

Description

Example value

Node Pool Name

Name for the first Node Pool.

exclusive

Name for the second Node Pool.

share-mem

Name for the third Node Pool.

share-mem-core

Instance Type

The instance type for the nodes. This example uses a TensorFlow Benchmark project that requires 10 GiB of GPU memory, so the node's instance type must provide more than 10 GiB.

ecs.gn7i-c16g1.4xlarge

Expected Node Count

The total number of nodes that the Node Pool should maintain.

1

Node Labels

The label added to the first Node Pool. This indicates that GPU resources are requested as a full card.

None

The label added to the second Node Pool. This indicates that GPU resources are requested by GPU memory.

ack.node.gpu.schedule=cgpu

The label added to the third Node Pool. This indicates that GPU resources are requested by GPU memory and supports computing power requests.

ack.node.gpu.schedule=core_mem

Step 2: Deploy GPU applications

After creating the Node Pools, run GPU test Jobs on the Nodes to verify that GPU metrics are collected correctly. For information about the labels and scheduling relationships required for each Job, see GPU node types and scheduling labels. The configurations for the three Jobs are as follows:

Job name

Node Pool for the task

GPU resource request

tensorflow-benchmark-exclusive

exclusive

nvidia.com/gpu: 1

Requests 1 full GPU card.

tensorflow-benchmark-share-mem

share-mem

aliyun.com/gpu-mem: 10

Requests 10 GiB of GPU memory.

tensorflow-benchmark-share-mem-core

share-mem-core

  • aliyun.com/gpu-mem: 10

  • aliyun.com/gpu-core.percentage: 30

Requests 10 GiB of GPU memory and 30% of the computing power of one GPU card.

  1. Create the Job manifest files.

    • Create a file named tensorflow-benchmark-exclusive.yaml with the following YAML content.

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-benchmark-exclusive
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-benchmark-exclusive
          spec:
            containers:
            - name: tensorflow-benchmark
              image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
              command:
              - bash
              - run.sh
              - --num_batches=5000000
              - --batch_size=8
              resources:
                limits:
                  nvidia.com/gpu: 1 #Apply for a GPU.
              workingDir: /root
            restartPolicy: Never
    • Create a file named tensorflow-benchmark-share-mem.yaml with the following YAML content.

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-benchmark-share-mem
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-benchmark-share-mem
          spec:
            containers:
            - name: tensorflow-benchmark
              image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
              command:
              - bash
              - run.sh
              - --num_batches=5000000
              - --batch_size=8
              resources:
                limits:
                  aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory.
              workingDir: /root
            restartPolicy: Never
    • Create a file named tensorflow-benchmark-share-mem-core.yaml with the following YAML content.

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-benchmark-share-mem-core
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-benchmark-share-mem-core
          spec:
            containers:
            - name: tensorflow-benchmark
              image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
              command:
              - bash
              - run.sh
              - --num_batches=5000000
              - --batch_size=8
              resources:
                limits:
                  aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory.
                  aliyun.com/gpu-core.percentage: 30  # Apply for 30% of the computing power of a GPU.
              workingDir: /root
            restartPolicy: Never
  2. Run the following commands to deploy the Jobs.

    kubectl apply -f tensorflow-benchmark-exclusive.yaml
    kubectl apply -f tensorflow-benchmark-share-mem.yaml
    kubectl apply -f tensorflow-benchmark-share-mem-core.yaml
  3. Run the following command to check the status of the Pods.

    kubectl get pod

    Expected output:

    NAME                                        READY   STATUS    RESTARTS   AGE
    tensorflow-benchmark-exclusive-7dff2        1/1     Running   0          3m13s
    tensorflow-benchmark-share-mem-core-k24gz   1/1     Running   0          4m22s
    tensorflow-benchmark-share-mem-shmpj        1/1     Running   0          3m46s

    The output shows that all Pods are in the Running state, which means the Jobs were deployed successfully.

Step 3: View the GPU monitoring dashboard

View the GPUs - Cluster Dimension dashboard

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Cluster Dimension tab. The information on the cluster-level Monitoring Dashboard is described below. For more information, see Cluster dimension monitoring dashboard.

    image

    No.

    Panel name

    Description

    1

    Total GPU Nodes

    The cluster has 3 GPU nodes.

    2

    Allocated GPUs

    1.9 of 3 total GPUs are allocated.

    Note

    If a Pod requests a GPU as a full card, the allocation for one card is 1. For shared GPU scheduling, the allocation is the ratio of the allocated GPU memory to the total GPU memory for that card.

    3

    Allocated GPU Memory

    63.0% of the total GPU memory is allocated.

    4

    Used GPU Memory

    35.5% of the total GPU memory is used.

    5

    Average GPU Utilization

    The average utilization rate across all cards is 74%.

    6

    GPU Memory Copy Utilization

    The average memory copy utilization rate across all cards is 43.7%.

    7

    GPU Node Details

    Information about the GPU Nodes in the cluster, including node name, GPU card index, GPU utilization, and memory controller utilization.

View the GPUs - Nodes dashboard

On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Nodes tab. Select the target Node from the GPUNode dropdown list. This example uses cn-hangzhou.10.166.154.xxx. The information on the node-level Monitoring Dashboard is described below:

image

image

image

image

Panel group

No.

Panel name

Description

Overview

1

GPU Mode

The GPU mode is shared mode, where Pods request GPU resources by GPU memory and computing power.

2

NVIDIA Driver Version

The installed GPU driver version is 535.161.07.

3

Allocated GPUs

The total number of GPUs is 1, and the number of allocated GPUs is 0.45.

4

GPU Utilization

The average GPU utilization rate is 26%.

5

Allocated GPU Memory

The allocated GPU memory is 45.5% of the total GPU memory.

6

Used GPU Memory

The currently used GPU memory is 36.4% of the total GPU memory.

7

Allocated Computing Power

30% of the computing power of GPU card 0 is allocated.

Note

The Allocated Computing Power panel only displays data if you enable computing power allocation on the Node. Therefore, among the three nodes in this example, only the Node with the ack.node.gpu.schedule=core_mem label shows data for this metric.

Utilization

8

GPU Utilization

For GPU card 0, the minimum utilization is 0%, the maximum is 33%, and the average is 12%.

9

Memory Copy Utilization

For GPU card 0, the minimum memory copy utilization is 0%, the maximum is 22%, and the average is 8%.

Memory&BAR1

10

GPU Memory Details

Details about GPU memory, including the GPU card's UUID, index number, and model.

11

BAR1 Used

The used BAR1 memory is 4 MB.

12

Memory Used

The used GPU memory on the card is 8.17 GB.

13

BAR1 Total

Total BAR1 memory: 32.8 GB.

GPU Process

14

GPU Process Details

Details about each GPU process, including its Pod namespace and name.

You can also view more advanced metrics at the bottom of the page. For details, see Node-level monitoring dashboard.

image

View the GPUs - Application Pod Dimension dashboard

On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Application Pod Dimension tab. The information on the Pod-level Monitoring Dashboard is described below:image

No.

Panel name

Description

1

GPU Pod Details

Information about Pods in the cluster that have requested GPU resources, including the Pod's namespace, name, Node name, and used GPU memory.

Note
  • Allocated GPU Memory represents the GPU memory allocated to the Pod. In shared GPU scheduling, a Node can only report an integer value for total GPU memory to the API Server. This means the Node rounds down the reported value from the actual value.

    • For example, if a card has 31.7 GiB of GPU memory, the Node reports 31 GiB to the API Server. If a Pod requests 10 GiB, the actual GPU memory allocated to the Pod is 31.7 * (10 / 31) = 10.2 GiB.

  • Allocated Computing Power represents the computing power allocated to the Pod. If a Pod does not request computing power, this value is "-". The figure shows that the Pod named tensorflow-benchmark-share-mem-core-k24gz requested 30% of the computing power.

You can also view more advanced metrics at the bottom of the page. For details, see GPUs - Application Pod Dimension.

image