Machine learning and big data jobs require frequent communication between pods. The default Kubernetes scheduler spreads pods evenly across nodes—a sensible default for availability, but a poor fit for latency-sensitive workloads. When pods land on nodes in different zones or racks, inter-pod communication travels through multiple network hops, increasing job duration and network cost.
Topology-aware scheduling solves this by combining gang scheduling with topology constraints. Rather than placing pods wherever resources are available, kube-scheduler loops through topology domains—zones, racks, or individual nodes—until it finds one where all pods in the group can run together. If no single domain satisfies the requirements, the pods wait instead of scattering across suboptimal nodes.
This topic covers two scheduling scenarios:
-
Loop through topology domains: Schedule all pods in a job to the same topology domain using gang scheduling labels and a topology-aware constraint annotation.
-
Schedule to a deployment set: Reduce network latency by pinning pods to Elastic Compute Service (ECS) instances within the same ECS deployment set.
Why topology-aware scheduling instead of native Kubernetes affinity
Native Kubernetes node affinity and pod affinity do not loop through topology domains. When a job's first pod is scheduled to a node, subsequent pods use affinity rules to target the same zone or node. If that node cannot accommodate the remaining pods, they become pending—the scheduler does not automatically try a different topology domain. This means some pods can remain stuck even when another domain could satisfy the requirements.
Topology-aware scheduling in ACK addresses both limitations:
| Limitation | Native Kubernetes affinity | Topology-aware scheduling |
|---|---|---|
| Domain traversal | No: stuck on first domain | Yes: loops through all domains |
| Sub-zone granularity | Zone-level only | Node pool, rack, hostname, or custom key |
Loop through topology domains during pod scheduling
Gang scheduling ensures that kube-scheduler fulfills the resource requests of all pods in a job at the same time. Combined with a topology-aware constraint, the scheduler evaluates each topology domain in turn until it finds one that can accommodate the entire pod group.
Step 1: Add gang scheduling labels
Add the following labels to the pod spec. For more information about gang scheduling, see Work with gang scheduling.
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu # Name of the PodGroup
pod-group.scheduling.sigs.k8s.io/min-available: "3" # Total number of pods in the job
Step 2: Add the topology-aware constraint
Add the following annotation to the pod spec:
annotations:
alibabacloud.com/topology-aware-constraint: '{"name":"test","required":{"topologies":[{"key":"kubernetes.io/hostname"}],"nodeSelectors":[{"matchLabels":{"test":"abc"}}]}}'
The value must be a valid JSON string with this structure:
| Field | Description |
|---|---|
name |
A custom name for this constraint |
required.topologies[].key |
The topology domain key for affinity scheduling (for example, kubernetes.io/hostname) |
required.nodeSelectors[].matchLabels |
Label selector to restrict scheduling to specific nodes. Follows the Kubernetes LabelSelector format. |
required.nodeSelectors[].matchExpressions |
Expression-based label selector. Follows the Kubernetes LabelSelector format. |
Verify the result
After applying the configuration, run the following command to confirm all pods were scheduled to nodes matching the constraint:
kubectl get pod -ojson | jq '.items[] | {"name":.metadata.name,"ann":.metadata.annotations["alibabacloud.com/topology-aware-constraint"], "node": spec.nodeName}'
Expected output: all pods show the same node or topology domain:
{
"name": "nginx-deployment-basic-69f47fc6db-6****",
"ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}}",
"node": "cn-shenzhen.10.0.2.4"
}
{
"name": "nginx-deployment-basic-69f47fc6db-h****",
"ann": "{\"name\": \"test\", \"required\": {\"topologies\":[{\"key\": \"kubernetes.io/hostname\"}], \"nodeSelectors\": [{\"matchLabels\": {\"test\": \"a\"}}]}}",
"node": "cn-shenzhen.10.0.2.4"
}
Schedule pods to the same deployment set
ECS allows you to create a deployment set to reduce the network latency among ECS instances in the deployment set. For more information about how to use deployment sets in ACK, see Best practices for associating deployment sets with node pools.
Step 1: Create a node pool associated with a deployment set
Create a node pool that is associated with an ECS deployment set and add a custom node label to identify it. The following figure shows where to add the custom label in the ACK console.
Step 2: Add gang scheduling labels
Add the following labels to the pod spec. For more information about gang scheduling, see Work with gang scheduling.
labels:
pod-group.scheduling.sigs.k8s.io/name: <podgroup-name> # Name of the PodGroup
pod-group.scheduling.sigs.k8s.io/min-available: "<n>" # Total number of pods in the job
Step 3: Add the topology-aware constraint
Add the following annotation to the pod spec.
Replace the matchLabels value with the custom node label you added to the node pool in Step 1. Replace name with the actual value.
annotations:
alibabacloud.com/topology-aware-constraint: '{"name":"test","required":{"topologies":[{"key":"alibabacloud.com/nodepool-id"}],"nodeSelectors":[{"matchLabels":{"np-type":"low-latency"}}]}}'
The topology key alibabacloud.com/nodepool-id scopes scheduling to the node pool associated with the deployment set. Combined with the node selector, kube-scheduler places all pods on nodes within that node pool—and therefore within the same deployment set.