Best practices for monitoring GPU resources - Container Service for Kubernetes

GPU monitoring uses NVIDIA Data Center GPU Manager (DCGM) to monitor the GPU Nodes in your cluster. This document shows you how to view monitoring results for applications that request GPU resources in three different ways.

Prerequisites

You have created an ACK managed cluster.
You have enabled GPU monitoring for the cluster.
You have installed the GPU sharing component.

Background information

GPU monitoring comprehensively monitors the GPU Nodes in your cluster and provides dashboards at the Cluster, Node, and Pod levels. For more details, see Dashboard description.

The Cluster-level GPU Monitoring Dashboard shows information for the entire cluster or a specific Node Pool, such as cluster-wide utilization, GPU memory usage, and XID error detection.
The Node-level GPU Monitoring Dashboard shows node-specific information, such as GPU details, utilization, and GPU memory usage for a particular Node.
The Pod-level GPU Monitoring Dashboard shows Pod-specific information, such as the GPU resources requested by a Pod and its utilization.

This document uses the following example workflow to show how different GPU request methods affect monitoring results.

Important notes

GPU monitoring metrics are collected at a 15-second interval, which can cause a slight data delay on the Grafana dashboard. As a result, the dashboard might show no available GPU memory on a Node, but a Pod is still successfully scheduled to it. This can happen if a Pod completes its task and releases GPU resources within a 15-second collection interval (between two scrapes), allowing the scheduler to place a pending Pod on that Node before the next metric update.
The Monitoring Dashboard only monitors GPU resources requested through resources.limits in a Pod. For more information, see Resource Management for Pods and Containers.
The data on the Monitoring Dashboard may be inaccurate if you use GPU resources in the following ways:
- Run a GPU application directly on a Node.
- Run a GPU application in a container started directly with the docker run command.
- Request GPU resources for a Pod by setting the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable directly in the Pod's env section and running a GPU program.
- Configure privileged: true in a Pod's securityContext and run a GPU program.
- Run a GPU program in a Pod where the NVIDIA_VISIBLE_DEVICES environment variable is not set, but the container image used by the Pod has NVIDIA_VISIBLE_DEVICES=all configured by default.
The allocated GPU memory and used GPU memory are not always the same. For example, a GPU card has 16 GiB of total GPU memory. You allocate 5 GiB of it to a Pod whose startup command is sleep 1000. In this case, the Pod is in a Running state but will not use the GPU for 1000 seconds. As a result, 5 GiB of GPU memory is allocated, but the used GPU memory is 0 GiB.

Step 1: Create a node pool

The GPU Monitoring Dashboard displays metrics for Pods that request GPU resources either as a full card or by a specific amount of GPU memory, optionally with computing power.

This example creates three Node Pools in a cluster to demonstrate Pod scheduling and resource usage for different GPU request models. For detailed instructions on creating a Node Pool, see Create a node pool. The configurations for the Node Pools are as follows:

Configuration item	Description	Example value
Node Pool Name	Name for the first Node Pool.	exclusive
	Name for the second Node Pool.	share-mem
	Name for the third Node Pool.	share-mem-core
Instance Type	The instance type for the nodes. This example uses a TensorFlow Benchmark project that requires 10 GiB of GPU memory, so the node's instance type must provide more than 10 GiB.	ecs.gn7i-c16g1.4xlarge
Expected Node Count	The total number of nodes that the Node Pool should maintain.	1
Node Labels	The label added to the first Node Pool. This indicates that GPU resources are requested as a full card.	None
	The label added to the second Node Pool. This indicates that GPU resources are requested by GPU memory.	ack.node.gpu.schedule=cgpu
	The label added to the third Node Pool. This indicates that GPU resources are requested by GPU memory and supports computing power requests.	ack.node.gpu.schedule=core_mem

Step 2: Deploy GPU applications

After creating the Node Pools, run GPU test Jobs on the Nodes to verify that GPU metrics are collected correctly. For information about the labels and scheduling relationships required for each Job, see GPU node types and scheduling labels. The configurations for the three Jobs are as follows:

Job name	Node Pool for the task	GPU resource request
`tensorflow-benchmark-exclusive`	exclusive	`nvidia.com/gpu: 1` Requests 1 full GPU card.
`tensorflow-benchmark-share-mem`	share-mem	`aliyun.com/gpu-mem: 10` Requests 10 GiB of GPU memory.
`tensorflow-benchmark-share-mem-core`	share-mem-core	`aliyun.com/gpu-mem: 10` `aliyun.com/gpu-core.percentage: 30` Requests 10 GiB of GPU memory and 30% of the computing power of one GPU card.

Create the Job manifest files.

Create a file named tensorflow-benchmark-exclusive.yaml with the following YAML content.

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark-exclusive
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark-exclusive
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=5000000
        - --batch_size=8
        resources:
          limits:
            nvidia.com/gpu: 1 #Apply for a GPU.
        workingDir: /root
      restartPolicy: Never

Create a file named tensorflow-benchmark-share-mem.yaml with the following YAML content.

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark-share-mem
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark-share-mem
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=5000000
        - --batch_size=8
        resources:
          limits:
            aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory.
        workingDir: /root
      restartPolicy: Never

Create a file named tensorflow-benchmark-share-mem-core.yaml with the following YAML content.

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark-share-mem-core
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark-share-mem-core
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=5000000
        - --batch_size=8
        resources:
          limits:
            aliyun.com/gpu-mem: 10 #Apply for 10 GiB of GPU memory.
            aliyun.com/gpu-core.percentage: 30  # Apply for 30% of the computing power of a GPU.
        workingDir: /root
      restartPolicy: Never

Run the following commands to deploy the Jobs.

kubectl apply -f tensorflow-benchmark-exclusive.yaml
kubectl apply -f tensorflow-benchmark-share-mem.yaml
kubectl apply -f tensorflow-benchmark-share-mem-core.yaml

Run the following command to check the status of the Pods.

kubectl get pod

Expected output:

NAME                                        READY   STATUS    RESTARTS   AGE
tensorflow-benchmark-exclusive-7dff2        1/1     Running   0          3m13s
tensorflow-benchmark-share-mem-core-k24gz   1/1     Running   0          4m22s
tensorflow-benchmark-share-mem-shmpj        1/1     Running   0          3m46s

The output shows that all Pods are in the Running state, which means the Jobs were deployed successfully.

Step 3: View the GPU monitoring dashboard

View the GPUs - Cluster Dimension dashboard

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Cluster Dimension tab. The information on the cluster-level Monitoring Dashboard is described below. For more information, see Cluster dimension monitoring dashboard.

No.	Panel name	Description
1	Total GPU Nodes	The cluster has 3 GPU nodes.
2	Allocated GPUs	1.9 of 3 total GPUs are allocated. Note If a Pod requests a GPU as a full card, the allocation for one card is 1. For shared GPU scheduling, the allocation is the ratio of the allocated GPU memory to the total GPU memory for that card.
3	Allocated GPU Memory	63.0% of the total GPU memory is allocated.
4	Used GPU Memory	35.5% of the total GPU memory is used.
5	Average GPU Utilization	The average utilization rate across all cards is 74%.
6	GPU Memory Copy Utilization	The average memory copy utilization rate across all cards is 43.7%.
7	GPU Node Details	Information about the GPU Nodes in the cluster, including node name, GPU card index, GPU utilization, and memory controller utilization.

View the GPUs - Nodes dashboard

On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Nodes tab. Select the target Node from the GPUNode dropdown list. This example uses cn-hangzhou.10.166.154.xxx. The information on the node-level Monitoring Dashboard is described below:

Panel group	No.	Panel name	Description
Overview	1	GPU Mode	The GPU mode is shared mode, where Pods request GPU resources by GPU memory and computing power.
	2	NVIDIA Driver Version	The installed GPU driver version is 535.161.07.
	3	Allocated GPUs	The total number of GPUs is 1, and the number of allocated GPUs is 0.45.
	4	GPU Utilization	The average GPU utilization rate is 26%.
	5	Allocated GPU Memory	The allocated GPU memory is 45.5% of the total GPU memory.
	6	Used GPU Memory	The currently used GPU memory is 36.4% of the total GPU memory.
	7	Allocated Computing Power	30% of the computing power of GPU card 0 is allocated. Note The Allocated Computing Power panel only displays data if you enable computing power allocation on the Node. Therefore, among the three nodes in this example, only the Node with the `ack.node.gpu.schedule=core_mem` label shows data for this metric.
Utilization	8	GPU Utilization	For GPU card 0, the minimum utilization is 0%, the maximum is 33%, and the average is 12%.
Utilization	9	Memory Copy Utilization	For GPU card 0, the minimum memory copy utilization is 0%, the maximum is 22%, and the average is 8%.
Memory&BAR1	10	GPU Memory Details	Details about GPU memory, including the GPU card's UUID, index number, and model.
	11	BAR1 Used	The used BAR1 memory is 4 MB.
	12	Memory Used	The used GPU memory on the card is 8.17 GB.
	13	BAR1 Total	Total BAR1 memory: 32.8 GB.
GPU Process	14	GPU Process Details	Details about each GPU process, including its Pod namespace and name.

You can also view more advanced metrics at the bottom of the page. For details, see Node-level monitoring dashboard.

View the GPUs - Application Pod Dimension dashboard

On the Prometheus monitoring page, click the GPU Monitoring tab, then click the GPUs - Application Pod Dimension tab. The information on the Pod-level Monitoring Dashboard is described below:

No.	Panel name	Description
1	GPU Pod Details	Information about Pods in the cluster that have requested GPU resources, including the Pod's namespace, name, Node name, and used GPU memory. Note Allocated GPU Memory represents the GPU memory allocated to the Pod. In shared GPU scheduling, a Node can only report an integer value for total GPU memory to the API Server. This means the Node rounds down the reported value from the actual value. For example, if a card has 31.7 GiB of GPU memory, the Node reports 31 GiB to the API Server. If a Pod requests 10 GiB, the actual GPU memory allocated to the Pod is 31.7 * (10 / 31) = 10.2 GiB. Allocated Computing Power represents the computing power allocated to the Pod. If a Pod does not request computing power, this value is "-". The figure shows that the Pod named `tensorflow-benchmark-share-mem-core-k24gz` requested 30% of the computing power.

You can also view more advanced metrics at the bottom of the page. For details, see GPUs - Application Pod Dimension.