In distributed AI training and big data workloads, pods communicate frequently. The default Kubernetes scheduler spreads pods evenly across nodes, so pods may end up on nodes separated by multiple switch hops—each additional hop increases latency and reduces throughput. Network topology-aware scheduling solves this by assigning all pods in a job to nodes within the same Layer 1 or Layer 2 forwarding domain, minimizing switch hops and accelerating job completion.
How it works
Network topology-aware scheduling uses a greedy algorithm to assign jobs to Lingjun nodes with minimal topological spans.
ACK Lingjun clusters have a two-level network topology:
-
Access Switch (ASW): the direct interface for Lingjun nodes. Nodes under the same ASW require at least one hop to communicate.
-
Point of Delivery (Pod): the upper-level topology grouping multiple ASWs. Nodes under different ASWs in the same Pod require at least two hops.
The fewer switch layers between nodes, the lower the latency. The scheduler tries to fit all pods within the lowest possible topology level first:
-
2-node job: assigned to nodes within the same ASW (for example, Node Pair A-B or E-F).
-
4-node job: assigned to nodes within the same Pod (for example, Node Pair A-D or E-H).
Node labels indicate which ASW and Pod a node belongs to:
-
alibabacloud.com/asw-id: identifies the ASW. -
alibabacloud.com/point-of-delivery: identifies the Pod.
In ACK Lingjun clusters, the lingjun-networktopology-collector component automatically collects this information and applies these labels to Lingjun nodes. For other node types or cluster types, add the labels manually and make sure the label keys match the labelKey values in your ClusterNetworkTopology configuration.
Scheduling strategies
Two strategies control how strictly pods are grouped within a topology level:
| Strategy | Behavior |
|---|---|
PreferGather |
Groups pods within the layer when possible; allows cross-layer scheduling if resources are insufficient |
MustGather |
Requires all pods to be within the same layer; scheduling fails if no single layer can accommodate the job |
Prerequisites
Before you begin, make sure you have:
-
An ACK Lingjun cluster
-
kubectl configured to connect to the cluster
-
Sufficient nodes with topology labels applied
Configure and deploy network topology-aware scheduling
Configuring network topology-aware scheduling requires three YAML files: one that defines the cluster-level topology structure, one that declares the scheduling constraints for a specific job, and one for the job itself.
Step 1: Define the cluster topology
Create cluster-network-topology.yaml to declare the two-level topology structure for the cluster:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
# Keep unchanged.
name: default
spec:
networkTopologySpec:
# parentTopologyLayer declares the upper topology structure.
- parentTopologyLayer: ASWTopologyLayer
# The lowest level must be NodeTopologyLayer for Lingjun nodes.
topologyLayer: NodeTopologyLayer
# The following defines the cross-layer topology mapping. Normally, no modification is needed.
- labelKey:
- alibabacloud.com/point-of-delivery
topologyLayer: PoDTopologyLayer
- labelKey:
- alibabacloud.com/asw-id
parentTopologyLayer: PoDTopologyLayer
topologyLayer: ASWTopologyLayer
Step 2: Declare the job topology constraints
Create sample-network-topology.yaml to specify the scheduling strategy for each topology level:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: JobNetworkTopology
metadata:
labels:
network-topology-permit-wait-time: "999999"
# The job name.
name: sample-network-topology
# The namespace the job belongs to.
namespace: sample-network-topology
spec:
topologyStrategy:
# PreferGather: allows cross-ASW scheduling when resources are insufficient.
- layer: ASWTopologyLayer
strategy: PreferGather
- layer: NodeTopologyLayer
strategy: PreferGather
# MustGather: cross-Pod scheduling is not allowed.
- layer: PoDTopologyLayer
strategy: MustGather
# Must match spec.parallelism and pod-group.scheduling.sigs.k8s.io/min-available in the Job manifest.
workerNum: 2
Step 3: Create the job
Create pi.yaml to define the job and reference the topology constraints:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
# Must match workerNum in JobNetworkTopology.
parallelism: 2
template:
metadata:
labels:
# Must match workerNum in JobNetworkTopology.
pod-group.scheduling.sigs.k8s.io/min-available: "2"
pod-group.scheduling.sigs.k8s.io/name: sample-gang
# Reference the JobNetworkTopology created in Step 2.
network-topology-job-name: sample-network-topology
network-topology-job-namespace: sample-network-topology
spec:
schedulerName: default-scheduler
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
resources:
limits:
# This example uses a single GPU. Adjust as needed.
nvidia.com/gpu: 1
restartPolicy: Never
backoffLimit: 4
Step 4: Deploy the configuration
Apply all three files to the cluster:
kubectl apply -f cluster-network-topology.yaml
kubectl apply -f sample-network-topology.yaml
kubectl apply -f pi.yaml
Verify scheduling results
Check the cluster topology
Retrieve the topology labels on your nodes to confirm the topology structure:
# Network topology:
# test-pod-1 test-pod-2
# / | \ |
# test-1 test-2 test-3 test-4
# / \ | | |
# 0.12 0.14 0.15 0.16 0.17
kubectl get no -l alibabacloud.com/asw-id,alibabacloud.com/point-of-delivery -ojson | jq '.items[] | {"Name":.metadata.name, "ASW":.metadata.labels."alibabacloud.com/asw-id", "POD":.metadata.labels."alibabacloud.com/point-of-delivery"}'
Expected output:
{
"Name": "cn-hongkong.10.1.0.12",
"ASW": "test-1",
"POD": "test-pod-1"
}
{
"Name": "cn-hongkong.10.1.0.14",
"ASW": "test-1",
"POD": "test-pod-1"
}
{
"Name": "cn-hongkong.10.1.0.15",
"ASW": "test-2",
"POD": "test-pod-1"
}
{
"Name": "cn-hongkong.10.1.0.16",
"ASW": "test-3",
"POD": "test-pod-1"
}
{
"Name": "cn-hongkong.10.1.0.17",
"ASW": "test-4",
"POD": "test-pod-2"
}
Scenario 1: 2-pod job
With workerNum: 2, both pods are scheduled onto nodes within the same ASW (test-1):
kubectl get pod -owideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pi-8p89l 1/1 Running 0 4s 172.30.240.197 cn-hongkong.10.1.0.14 <none> <none>
pi-p8swv 0/1 ContainerCreating 0 4s <none> cn-hongkong.10.1.0.12 <none> <none>
Both pods land on test-1 (ASW), sharing the lowest-hop network path.
Scenario 2: 4-pod job
Update parallelism and pod-group.scheduling.sigs.k8s.io/min-available in pi.yaml, and workerNum in sample-network-topology.yaml to 4. Reapply both files.
The scheduler assigns all 4 pods within test-pod-1, because MustGather on PoDTopologyLayer prevents cross-Pod scheduling. The node under test-pod-2 remains unscheduled:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pi-2kwq9 1/1 Running 0 4s 172.30.241.123 cn-hongkong.10.1.0.12 <none> <none>
pi-87hm5 0/1 ContainerCreating 0 4s <none> cn-hongkong.10.1.0.16 <none> <none>
pi-bsvx8 1/1 Running 0 4s 172.30.240.198 cn-hongkong.10.1.0.14 <none> <none>
pi-dvwhl 0/1 ContainerCreating 0 4s <none> cn-hongkong.10.1.0.15 <none> <none>
Scenario 3: 5-pod job (scheduling failure)
Update the same parameters to 5. The job fails because test-pod-1 has only 4 available slots and MustGather on PoDTopologyLayer prohibits using the single node in test-pod-2.
All pods remain in Pending state:
NAME READY STATUS RESTARTS AGE
pi-75qf5 0/1 Pending 0 2s
pi-8k4nd 0/1 Pending 0 2s
pi-b2pmc 0/1 Pending 0 2s
pi-n7c2b 0/1 Pending 0 2s
pi-wf4zn 0/1 Pending 0 2s
Inspect the scheduling failure message:
kubectl get pod -ojson | jq '.items[].status'
The scheduling failure message contains:
0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 [NetworkTopology begin] cluster total nodes:6, 5 node provide 5 freeSlot, 1 node unavailable cause Insufficient nvidia.com/gpu, job desireNum:5, all fail topology paths by MustGather reason: [path:RootNode->test-pod-1, freeSlotNum:4], [path:RootNode->DefaultTopologyName, freeSlotNum:0], [path:RootNode->test-pod-2, freeSlotNum:1] [NetworkTopology end], 4 NetworkTopology bestPlan empty. network topology job sample-network-topology/sample-network-topology gets rejected due to pod is unschedulable...
Key fields in the message:
| Field | Value | Meaning |
|---|---|---|
job desireNum |
5 |
The scheduler needs 5 slots within a single Pod-level domain |
path:RootNode->test-pod-1, freeSlotNum |
4 |
test-pod-1 has only 4 available slots—not enough for 5 pods |
path:RootNode->test-pod-2, freeSlotNum |
1 |
test-pod-2 has 1 slot, but MustGather forbids splitting the job across Pods |
NetworkTopology bestPlan empty |
— | No valid topology plan exists; scheduling is blocked |
To resolve this, either reduce workerNum to fit within a single Pod, or change the PoDTopologyLayer strategy from MustGather to PreferGather to allow cross-Pod scheduling.