All Products
Search
Document Center

Container Compute Service:Topology-aware scheduling for GPU-HPN nodes

Last Updated:Jul 15, 2025

In an ACS cluster that runs on GPU devices, you can schedule GPU-accelerated pods to the same GPU-HPN node. Pods can communicate with each other through methods such as NVLink. To ensure communication efficiency and fairness between GPU devices, ACS conforms to partition constraints of different models when scheduling devices. This topic describes the GPU partition scheduling mechanism of ACS and its use scenarios.

Prerequisites

This feature applies only to pods of the gpu-hpn compute class and their corresponding nodes.

Background information

GPU devices communicate through one or more channels. ACS also allows pods with different GPU specifications to run on the same GPU-HPN node. To ensure GPU communication efficiency and fairness and avoid inter-pod interference, ACS schedules pods based on the GPU topology. To accomplish this, ACS divides the GPU topology into multiple partitions based on the number of GPUs requested by each pod.

In the following figure, the node has eight GPUs. Each group contains four GPUs. The GPUs in each group are interconnected, and the groups are connected through PCIe.

image

The following table describes the partitions created based on different GPU specifications.

Number of GPUs requested by a pod

Optional device allocation results

8

[0,1,2,3,4,5,6,7]

4

[0,1,2,3], [4,5,6,7]

2

[0,1], [2,3], [4,5], [6,7]

1

[0], [1], [2], [3], [4], [5], [6], [7]

Partition fragments may be generated on GPU devices after pods are continuously created and deleted. Consequently, pods remain in the Pending state and pod scheduling fails. You can check the scheduling result of existing pods and the priorities of your businesses and then evict certain pods to meet the resource demands of the pending pods.

Query partitions on a GPU-HPN node

Partitions on GPU-HPN nodes of different models may vary.

gpu.p16en-16XL

The node has 16 GPUs of the P16EN model. The following table describes how the 16 GPUs can be allocated to pods with different GPU specifications.

Number of GPUs requested by a pod

Optional device allocation results

16

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

8

[0,1,2,3,4,5,6,7], [8,9,10,11,12,13,14,15]

4

[0,1,2,3], [4,5,6,7], [8,9,10,11], [12,13,14,15]

2

[0,3], [1,2], [4,7], [5,6], [8,11], [9,10], [12,15], [13,14]

1

[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]

Querying pod scheduling results

GPU allocation results

You can view the GPU allocation results of GPU-HPN pods in the pod annotations. The following code block shows the format:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    alibabacloud.com/device-allocation: '{"gpus": {"minor": [0,1,2,3]}}'

Notifications for pod scheduling failures due to partition fragments

If a pod is in the Pending state, the pod is unschedulable. If you run the kubectl describe pod command in this case, the 0/5 nodes are available: xxx message is returned. Insufficient Partitioned GPU Devices indicates that pod scheduling fails due to partition fragments. Example:

kubectl describe pod pod-demo

Expected output (other content omitted):

...
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  26m    default-scheduler  0/5 nodes are available: 2 Node(s) Insufficient Partitioned GPU Devices, 1 Node(s) xxx, 2 Node(s) xxx.

FAQ

How do I plan node resources and scheduling policies to avoid partition fragments?

  • Set different group tags for nodes based on the number of GPUs requested by application pods to manage resources. For example, you can schedule pods that request eight GPUs and pods that request one GPU to different nodes.

  • When pending pods appear due to partition fragments, you can use the descheduling mechanism to evict pods with lower priorities to free resources for the pending pods.

  • If the node scale is small or you cannot plan group tags, and applications have various GPU specifications, we recommend that you use GPU Pod capacity reservation to meet application resource requirements.

How to select pods for eviction when resolving partition fragments

  • Determine the resource specifications of the pending pod, such as eight GPUs.

  • Check the pod annotations on the target node and view the device allocation results from the alibabacloud.com/device-allocation property.

  • Determine which pods to evict based on the allocation results. Ensure that the free devices after eviction can meet the resource requirements and partition constraints of the pending pod. For example, an eight-GPU P16EN requirement needs to ensure that device numbers [0,1,2,3,4,5,6,7] or [8,9,10,11,12,13,14,15] are all unallocated.

  • Evict pods using commands such as evict or delete.

What are the considerations for partitions if I use a custom scheduler?

After the custom scheduler schedules pods to nodes, ACS allocates the pods to the corresponding GPUs. During GPU allocation, ACS attempts to schedule all pods at once to avoid creating partition fragments.

The custom scheduler only focuses on the overall GPU capacity of the node. We recommend that you use the MostAllocated policy when you schedule GPU resources. This helps reduce partition fragments.

How do different schedulers handle topology awareness for ACS GPU HPN?

Scheduler

Condition

Description

ACS default scheduler

All of the following conditions are met:

  • The cluster is an ACS cluster.

  • The schedulerName of the pod is default-scheduler.

  • The Enable GPU-HPN Node Custom Tags And Scheduler option is not selected. For more information, see kube-scheduler.

The scheduler is aware of the partition allocation on the current node. Nodes that do not meet the partition requirements are not involved in scheduling. The cause "Insufficient Partitioned GPU Devices" is displayed in the pod scheduling failure event.

ACK scheduler

All of the following conditions are met:

  • The cluster is an ACK managed cluster, ACK One registered cluster, or ACK One cluster for distributed Argo workflows.

  • The schedulerName of the pod is default-scheduler.

The scheduler is unaware of the partition topology. The GPU-HPN node attempts to allocate GPUs in a centralized manner. If the partition requirements are not met, the pod will remain in the Pending state until the partition requirements are met. The corresponding message displays "Insufficient Partitioned GPU Devices". For more information, see How do I plan node resources and scheduling policies to avoid partition fragments?

Custom scheduler

Any of the following conditions is met:

  • The cluster is not an ACK managed cluster, ACK One registered cluster, or ACK One cluster for distributed Argo workflows.

  • The schedulerName of the pod is not default-scheduler.

Same as the ACK scheduler.