In an ACS cluster that runs on GPU devices, you can schedule GPU-accelerated pods to the same GPU-HPN node. Pods can communicate with each other through methods such as NVLink. To ensure communication efficiency and fairness between GPU devices, ACS conforms to partition constraints of different models when scheduling devices. This topic describes the GPU partition scheduling mechanism of ACS and its use scenarios.
Prerequisites
This feature applies only to pods of the gpu-hpn compute class and their corresponding nodes.
Background information
GPU devices communicate through one or more channels. ACS also allows pods with different GPU specifications to run on the same GPU-HPN node. To ensure GPU communication efficiency and fairness and avoid inter-pod interference, ACS schedules pods based on the GPU topology. To accomplish this, ACS divides the GPU topology into multiple partitions based on the number of GPUs requested by each pod.
In the following figure, the node has eight GPUs. Each group contains four GPUs. The GPUs in each group are interconnected, and the groups are connected through PCIe.

The following table describes the partitions created based on different GPU specifications.
Number of GPUs requested by a pod | Optional device allocation results |
8 | [0,1,2,3,4,5,6,7] |
4 | [0,1,2,3], [4,5,6,7] |
2 | [0,1], [2,3], [4,5], [6,7] |
1 | [0], [1], [2], [3], [4], [5], [6], [7] |
Partition fragments may be generated on GPU devices after pods are continuously created and deleted. Consequently, pods remain in the Pending state and pod scheduling fails. You can check the scheduling result of existing pods and the priorities of your businesses and then evict certain pods to meet the resource demands of the pending pods.
Query partitions on a GPU-HPN node
Partitions on GPU-HPN nodes of different models may vary.
gpu.p16en-16XL
The node has 16 GPUs of the P16EN model. The following table describes how the 16 GPUs can be allocated to pods with different GPU specifications.
Number of GPUs requested by a pod | Optional device allocation results |
16 | [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] |
8 | [0,1,2,3,4,5,6,7], [8,9,10,11,12,13,14,15] |
4 | [0,1,2,3], [4,5,6,7], [8,9,10,11], [12,13,14,15] |
2 | [0,3], [1,2], [4,7], [5,6], [8,11], [9,10], [12,15], [13,14] |
1 | [0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] |
Querying pod scheduling results
GPU allocation results
You can view the GPU allocation results of GPU-HPN pods in the pod annotations. The following code block shows the format:
apiVersion: v1
kind: Pod
metadata:
annotations:
alibabacloud.com/device-allocation: '{"gpus": {"minor": [0,1,2,3]}}'Notifications for pod scheduling failures due to partition fragments
If a pod is in the Pending state, the pod is unschedulable. If you run the kubectl describe pod command in this case, the 0/5 nodes are available: xxx message is returned. Insufficient Partitioned GPU Devices indicates that pod scheduling fails due to partition fragments. Example:
kubectl describe pod pod-demoExpected output (other content omitted):
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 26m default-scheduler 0/5 nodes are available: 2 Node(s) Insufficient Partitioned GPU Devices, 1 Node(s) xxx, 2 Node(s) xxx.FAQ
How do I plan node resources and scheduling policies to avoid partition fragments?
Set different group tags for nodes based on the number of GPUs requested by application pods to manage resources. For example, you can schedule pods that request eight GPUs and pods that request one GPU to different nodes.
When pending pods appear due to partition fragments, you can use the descheduling mechanism to evict pods with lower priorities to free resources for the pending pods.
If the node scale is small or you cannot plan group tags, and applications have various GPU specifications, we recommend that you use GPU Pod capacity reservation to meet application resource requirements.
How to select pods for eviction when resolving partition fragments
Determine the resource specifications of the pending pod, such as eight GPUs.
Check the pod annotations on the target node and view the device allocation results from the
alibabacloud.com/device-allocationproperty.Determine which pods to evict based on the allocation results. Ensure that the free devices after eviction can meet the resource requirements and partition constraints of the pending pod. For example, an eight-GPU P16EN requirement needs to ensure that device numbers [0,1,2,3,4,5,6,7] or [8,9,10,11,12,13,14,15] are all unallocated.
Evict pods using commands such as
evictordelete.
What are the considerations for partitions if I use a custom scheduler?
After the custom scheduler schedules pods to nodes, ACS allocates the pods to the corresponding GPUs. During GPU allocation, ACS attempts to schedule all pods at once to avoid creating partition fragments.
The custom scheduler only focuses on the overall GPU capacity of the node. We recommend that you use the MostAllocated policy when you schedule GPU resources. This helps reduce partition fragments.
How do different schedulers handle topology awareness for ACS GPU HPN?
Scheduler | Condition | Description |
ACS default scheduler | All of the following conditions are met:
| The scheduler is aware of the partition allocation on the current node. Nodes that do not meet the partition requirements are not involved in scheduling. The cause "Insufficient Partitioned GPU Devices" is displayed in the pod scheduling failure event. |
ACK scheduler | All of the following conditions are met:
| The scheduler is unaware of the partition topology. The GPU-HPN node attempts to allocate GPUs in a centralized manner. If the partition requirements are not met, the pod will remain in the Pending state until the partition requirements are met. The corresponding message displays "Insufficient Partitioned GPU Devices". For more information, see How do I plan node resources and scheduling policies to avoid partition fragments? |
Custom scheduler | Any of the following conditions is met:
| Same as the ACK scheduler. |