All Products
Search
Document Center

Container Service for Kubernetes:Best practices for monitoring GPU resources

Last Updated:Mar 26, 2026

GPU monitoring uses NVIDIA Data Center GPU Manager (DCGM) to collect metrics from GPU nodes in your cluster and surfaces them in Grafana dashboards at the cluster, node, and Pod level. This guide walks through a complete example: you'll create three node pools with different GPU scheduling modes, deploy test Jobs on each, and then read the monitoring dashboards to see how each request mode affects the metrics you observe.

Prerequisites

Before you begin, ensure that you have:

GPU request modes

ACK supports three GPU scheduling modes. Each affects how GPU resources appear on the monitoring dashboard. Choose your mode before creating node pools.

Mode Node label Resource request Use when
Full card None nvidia.com/gpu: <count> A workload needs exclusive access to one or more physical GPU cards
Memory only ack.node.gpu.schedule=cgpu aliyun.com/gpu-mem: <GiB> Multiple workloads share a GPU card; you want to limit GPU memory per workload
Memory + computing power ack.node.gpu.schedule=core_mem aliyun.com/gpu-mem: <GiB> + aliyun.com/gpu-core.percentage: <pct> Multiple workloads share a GPU card; you want to limit both memory and compute percentage

Limitations

GPU monitoring only tracks resources declared in resources.limits. Dashboard data may be inaccurate if GPU resources are accessed through any of the following methods:

  • Running a GPU application directly on a node (outside a container)

  • Starting a container with docker run rather than Kubernetes

  • Setting NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> directly in the Pod's env section

  • Setting privileged: true in the Pod's securityContext

  • Using a container image that sets NVIDIA_VISIBLE_DEVICES=all by default, without declaring GPU resources in resources.limits

Allocated GPU memory vs. used GPU memory: These two values are not the same. For example, if you allocate 5 GiB to a Pod whose only command is sleep 1000, the allocated GPU memory is 5 GiB but the used GPU memory is 0 GiB for the duration of the sleep.

15-second scrape interval: Metrics are collected every 15 seconds. During this window, a Pod may complete its task, release GPU resources, and allow the scheduler to place another Pod on that node — all before the next metric update. As a result, the dashboard may briefly show no available GPU memory on a node even though a new Pod was successfully scheduled there.

How it works

GPU monitoring provides dashboards at three levels:

  • Cluster level — Cluster-wide GPU utilization, memory usage, and XID error detection for the entire cluster or a specific node pool.

  • Node level — Per-node details: GPU mode, driver version, utilization, memory, and running processes.

  • Pod level — Per-Pod GPU resources requested and actual memory used.

This guide uses the following workflow to show how different GPU request methods affect monitoring results.

image

Step 1: Create node pools

Create three node pools — one for each GPU scheduling mode. For detailed instructions, see Create a node pool.

Configure each node pool with the following settings. This example uses a TensorFlow Benchmark workload that requires 10 GiB of GPU memory, so the instance type must provide more than 10 GiB.

Configuration item Node pool 1 Node pool 2 Node pool 3
Node pool name exclusive share-mem share-mem-core
Instance type ecs.gn7i-c16g1.4xlarge ecs.gn7i-c16g1.4xlarge ecs.gn7i-c16g1.4xlarge
Expected node count 1 1 1
Node labels None ack.node.gpu.schedule=cgpu ack.node.gpu.schedule=core_mem

For the label reference and scheduling behavior of each mode, see GPU node types and scheduling labels.

Step 2: Deploy GPU applications

Deploy three test Jobs — one per node pool — to generate GPU metrics.

Job name Target node pool GPU resource request
tensorflow-benchmark-exclusive exclusive nvidia.com/gpu: 1 — requests 1 full GPU card
tensorflow-benchmark-share-mem share-mem aliyun.com/gpu-mem: 10 — requests 10 GiB of GPU memory
tensorflow-benchmark-share-mem-core share-mem-core aliyun.com/gpu-mem: 10 + aliyun.com/gpu-core.percentage: 30 — requests 10 GiB and 30% of compute
  1. Create the Job manifest files.

    • Create tensorflow-benchmark-exclusive.yaml:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-benchmark-exclusive
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-benchmark-exclusive
          spec:
            containers:
            - name: tensorflow-benchmark
              image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
              command:
              - bash
              - run.sh
              - --num_batches=5000000
              - --batch_size=8
              resources:
                limits:
                  nvidia.com/gpu: 1 #Apply for a GPU.
              workingDir: /root
            restartPolicy: Never
    • Create tensorflow-benchmark-share-mem.yaml:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-benchmark-share-mem
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-benchmark-share-mem
          spec:
            containers:
            - name: tensorflow-benchmark
              image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
              command:
              - bash
              - run.sh
              - --num_batches=5000000
              - --batch_size=8
              resources:
                limits:
                  aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory.
              workingDir: /root
            restartPolicy: Never
    • Create tensorflow-benchmark-share-mem-core.yaml:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: tensorflow-benchmark-share-mem-core
      spec:
        parallelism: 1
        template:
          metadata:
            labels:
              app: tensorflow-benchmark-share-mem-core
          spec:
            containers:
            - name: tensorflow-benchmark
              image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
              command:
              - bash
              - run.sh
              - --num_batches=5000000
              - --batch_size=8
              resources:
                limits:
                  aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory.
                  aliyun.com/gpu-core.percentage: 30  # Apply for 30% of the computing power of a GPU.
              workingDir: /root
            restartPolicy: Never
  2. Deploy all three Jobs:

    kubectl apply -f tensorflow-benchmark-exclusive.yaml
    kubectl apply -f tensorflow-benchmark-share-mem.yaml
    kubectl apply -f tensorflow-benchmark-share-mem-core.yaml
  3. Verify that all Pods are running:

    kubectl get pod

    Expected output:

    NAME                                        READY   STATUS    RESTARTS   AGE
    tensorflow-benchmark-exclusive-7dff2        1/1     Running   0          3m13s
    tensorflow-benchmark-share-mem-core-k24gz   1/1     Running   0          4m22s
    tensorflow-benchmark-share-mem-shmpj        1/1     Running   0          3m46s

    All three Pods in Running state confirms the Jobs were deployed successfully and are generating GPU metrics.

Step 3: View the GPU monitoring dashboard

View the GPUs - Cluster Dimension dashboard

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left-side pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the GPU Monitoring tab, then click the GPUs - Cluster Dimension tab. The panels on the cluster-level dashboard show the following information. For the full panel reference, see Cluster dimension monitoring dashboard. If all three Jobs are running and GPU metrics appear for all three nodes, the cluster-level dashboard is working correctly.

    image

    No. Panel name Description Example value
    1 Total GPU Nodes Number of GPU nodes in the cluster 3
    2 Allocated GPUs Number of allocated GPUs. For full-card requests, 1 card = 1 allocation. For shared scheduling, allocation = allocated GPU memory / total GPU memory on that card. 1.9 of 3
    3 Allocated GPU Memory Percentage of total GPU memory allocated across the cluster 63.0%
    4 Used GPU Memory Percentage of total GPU memory currently in use 35.5%
    5 Average GPU Utilization Average utilization rate across all GPU cards 74%
    6 GPU Memory Copy Utilization Average memory copy utilization rate across all GPU cards 43.7%
    7 GPU Node Details Per-node table: node name, GPU card index, GPU utilization, and memory controller utilization

View the GPUs - Nodes dashboard

  1. On the Prometheus Monitoring page, click the GPU Monitoring tab, then click the GPUs - Nodes tab.

  2. From the GPUNode dropdown list, select the node you want to inspect. This example uses cn-hangzhou.10.166.154.xxx. The node-level dashboard is organized into four panel groups: The Allocated Computing Power panel only shows data on the node with the ack.node.gpu.schedule=core_mem label. If you select one of the other two nodes, this panel will be empty — that is expected behavior. For advanced metrics available at the bottom of the page, see Node-level monitoring dashboard.

    Panel group No. Panel name Description Example value
    Overview 1 GPU Mode The GPU scheduling mode on this node Shared mode
    2 NVIDIA Driver Version Installed GPU driver version 535.161.07
    3 Allocated GPUs Total GPUs on the node and how many are allocated 1 total, 0.45 allocated
    4 GPU Utilization Average GPU utilization across all cards on this node 26%
    5 Allocated GPU Memory Percentage of total GPU memory allocated on this node 45.5%
    6 Used GPU Memory Percentage of total GPU memory currently in use on this node 36.4%
    7 Allocated Computing Power Percentage of computing power allocated on this node. Only appears on nodes with the ack.node.gpu.schedule=core_mem label. 30% of GPU card 0
    Utilization 8 GPU Utilization Per-card utilization: min, max, and average over the selected time range Card 0: min 0%, max 33%, avg 12%
    9 Memory Copy Utilization Per-card memory copy utilization: min, max, and average Card 0: min 0%, max 22%, avg 8%
    Memory & BAR1 10 GPU Memory Details Per-card memory information including UUID, index, and model
    11 BAR1 Used Used BAR1 memory on this card 4 MB
    12 Memory Used Used GPU memory on this card 8.17 GB
    13 BAR1 Total Total BAR1 memory capacity on this card 32.8 GB
    GPU Process 14 GPU Process Details Per-process information including Pod namespace and name

    image

    image

    image

    image

    image

View the GPUs - Application Pod Dimension dashboard

On the Prometheus Monitoring page, click the GPU Monitoring tab, then click the GPUs - Application Pod Dimension tab.

image
No. Panel name Description
1 GPU Pod Details A table of all Pods in the cluster that have requested GPU resources, showing namespace, Pod name, node name, allocated GPU memory, allocated computing power, and used GPU memory.

Two notes on the values in this panel:

  • Allocated GPU Memory: In shared GPU scheduling, a node can only report an integer value for total GPU memory to the API Server (rounded down from the actual value). For example, if a card has 31.7 GiB of GPU memory, the node reports 31 GiB to the API Server. If a Pod requests 10 GiB, the actual allocated memory is 31.7 x (10 / 31) = 10.2 GiB.

  • Allocated Computing Power: Shown as a percentage for Pods that requested compute via aliyun.com/gpu-core.percentage. Pods that did not request computing power show - in this column. In this example, the Pod named tensorflow-benchmark-share-mem-core-k24gz shows 30%.

If all three test Pods appear in the GPU Pod Details table with non-zero allocated GPU memory, the Pod-level dashboard is capturing resources across all three scheduling modes correctly.

For advanced Pod-level metrics available at the bottom of the page, see GPUs - Application Pod Dimension.

image

What's next