Reduce ML Latency with Network Topology-Aware Scheduling in Lingjun - Container Service for Kubernetes

In distributed AI training and big data workloads, pods communicate frequently. The default Kubernetes scheduler spreads pods evenly across nodes, so pods may end up on nodes separated by multiple switch hops—each additional hop increases latency and reduces throughput. Network topology-aware scheduling solves this by assigning all pods in a job to nodes within the same Layer 1 or Layer 2 forwarding domain, minimizing switch hops and accelerating job completion.

How it works

Network topology-aware scheduling uses a greedy algorithm to assign jobs to Lingjun nodes with minimal topological spans.

ACK Lingjun clusters have a two-level network topology:

Access Switch (ASW): the direct interface for Lingjun nodes. Nodes under the same ASW require at least one hop to communicate.
Point of Delivery (Pod): the upper-level topology grouping multiple ASWs. Nodes under different ASWs in the same Pod require at least two hops.

The fewer switch layers between nodes, the lower the latency. The scheduler tries to fit all pods within the lowest possible topology level first:

2-node job: assigned to nodes within the same ASW (for example, Node Pair A-B or E-F).
4-node job: assigned to nodes within the same Pod (for example, Node Pair A-D or E-H).

Node labels indicate which ASW and Pod a node belongs to:

alibabacloud.com/asw-id: identifies the ASW.
alibabacloud.com/point-of-delivery: identifies the Pod.

In ACK Lingjun clusters, the lingjun-networktopology-collector component automatically collects this information and applies these labels to Lingjun nodes. For other node types or cluster types, add the labels manually and make sure the label keys match the labelKey values in your ClusterNetworkTopology configuration.

Scheduling strategies

Two strategies control how strictly pods are grouped within a topology level:

Strategy	Behavior
`PreferGather`	Groups pods within the layer when possible; allows cross-layer scheduling if resources are insufficient
`MustGather`	Requires all pods to be within the same layer; scheduling fails if no single layer can accommodate the job

Prerequisites

Before you begin, make sure you have:

An ACK Lingjun cluster
kubectl configured to connect to the cluster
Sufficient nodes with topology labels applied

Configure and deploy network topology-aware scheduling

Configuring network topology-aware scheduling requires three YAML files: one that defines the cluster-level topology structure, one that declares the scheduling constraints for a specific job, and one for the job itself.

Step 1: Define the cluster topology

Create cluster-network-topology.yaml to declare the two-level topology structure for the cluster:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
  # Keep unchanged.
  name: default
spec:
  networkTopologySpec:
  # parentTopologyLayer declares the upper topology structure.
  - parentTopologyLayer: ASWTopologyLayer
  # The lowest level must be NodeTopologyLayer for Lingjun nodes.
    topologyLayer: NodeTopologyLayer
  # The following defines the cross-layer topology mapping. Normally, no modification is needed.
  - labelKey:
    - alibabacloud.com/point-of-delivery
    topologyLayer: PoDTopologyLayer
  - labelKey:
    - alibabacloud.com/asw-id
    parentTopologyLayer: PoDTopologyLayer
    topologyLayer: ASWTopologyLayer

Step 2: Declare the job topology constraints

Create sample-network-topology.yaml to specify the scheduling strategy for each topology level:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: JobNetworkTopology
metadata:
  labels:
    network-topology-permit-wait-time: "999999"
  # The job name.
  name: sample-network-topology
  # The namespace the job belongs to.
  namespace: sample-network-topology
spec:
  topologyStrategy:
  # PreferGather: allows cross-ASW scheduling when resources are insufficient.
  - layer: ASWTopologyLayer
    strategy: PreferGather
  - layer: NodeTopologyLayer
    strategy: PreferGather
  # MustGather: cross-Pod scheduling is not allowed.
  - layer: PoDTopologyLayer
    strategy: MustGather
  # Must match spec.parallelism and pod-group.scheduling.sigs.k8s.io/min-available in the Job manifest.
  workerNum: 2

Step 3: Create the job

Create pi.yaml to define the job and reference the topology constraints:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  # Must match workerNum in JobNetworkTopology.
  parallelism: 2
  template:
    metadata:
      labels:
        # Must match workerNum in JobNetworkTopology.
        pod-group.scheduling.sigs.k8s.io/min-available: "2"
        pod-group.scheduling.sigs.k8s.io/name: sample-gang
        # Reference the JobNetworkTopology created in Step 2.
        network-topology-job-name: sample-network-topology
        network-topology-job-namespace: sample-network-topology
    spec:
      schedulerName: default-scheduler
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
        resources:
          limits:
            # This example uses a single GPU. Adjust as needed.
            nvidia.com/gpu: 1
      restartPolicy: Never
  backoffLimit: 4

Step 4: Deploy the configuration

Apply all three files to the cluster:

kubectl apply -f cluster-network-topology.yaml
kubectl apply -f sample-network-topology.yaml
kubectl apply -f pi.yaml

Verify scheduling results

Check the cluster topology

Retrieve the topology labels on your nodes to confirm the topology structure:

# Network topology:
#              test-pod-1                     test-pod-2
#        /          |           \                   |
#    test-1      test-2      test-3               test-4
#     /   \         |           |                   |
#   0.12  0.14     0.15        0.16                0.17

kubectl get no -l alibabacloud.com/asw-id,alibabacloud.com/point-of-delivery -ojson | jq '.items[] | {"Name":.metadata.name, "ASW":.metadata.labels."alibabacloud.com/asw-id", "POD":.metadata.labels."alibabacloud.com/point-of-delivery"}'

Expected output:

{
  "Name": "cn-hongkong.10.1.0.12",
  "ASW": "test-1",
  "POD": "test-pod-1"
}
{
  "Name": "cn-hongkong.10.1.0.14",
  "ASW": "test-1",
  "POD": "test-pod-1"
}
{
  "Name": "cn-hongkong.10.1.0.15",
  "ASW": "test-2",
  "POD": "test-pod-1"
}
{
  "Name": "cn-hongkong.10.1.0.16",
  "ASW": "test-3",
  "POD": "test-pod-1"
}
{
  "Name": "cn-hongkong.10.1.0.17",
  "ASW": "test-4",
  "POD": "test-pod-2"
}

Scenario 1: 2-pod job

With workerNum: 2, both pods are scheduled onto nodes within the same ASW (test-1):

kubectl get pod -owide

NAME       READY   STATUS              RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
pi-8p89l   1/1     Running             0          4s    172.30.240.197   cn-hongkong.10.1.0.14   <none>           <none>
pi-p8swv   0/1     ContainerCreating   0          4s    <none>           cn-hongkong.10.1.0.12   <none>           <none>

Both pods land on test-1 (ASW), sharing the lowest-hop network path.

Scenario 2: 4-pod job

Update parallelism and pod-group.scheduling.sigs.k8s.io/min-available in pi.yaml, and workerNum in sample-network-topology.yaml to 4. Reapply both files.

The scheduler assigns all 4 pods within test-pod-1, because MustGather on PoDTopologyLayer prevents cross-Pod scheduling. The node under test-pod-2 remains unscheduled:

NAME       READY   STATUS              RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
pi-2kwq9   1/1     Running             0          4s    172.30.241.123   cn-hongkong.10.1.0.12   <none>           <none>
pi-87hm5   0/1     ContainerCreating   0          4s    <none>           cn-hongkong.10.1.0.16   <none>           <none>
pi-bsvx8   1/1     Running             0          4s    172.30.240.198   cn-hongkong.10.1.0.14   <none>           <none>
pi-dvwhl   0/1     ContainerCreating   0          4s    <none>           cn-hongkong.10.1.0.15   <none>           <none>

Scenario 3: 5-pod job (scheduling failure)

Update the same parameters to 5. The job fails because test-pod-1 has only 4 available slots and MustGather on PoDTopologyLayer prohibits using the single node in test-pod-2.

All pods remain in Pending state:

NAME       READY   STATUS    RESTARTS   AGE
pi-75qf5   0/1     Pending   0          2s
pi-8k4nd   0/1     Pending   0          2s
pi-b2pmc   0/1     Pending   0          2s
pi-n7c2b   0/1     Pending   0          2s
pi-wf4zn   0/1     Pending   0          2s

Inspect the scheduling failure message:

kubectl get pod -ojson | jq '.items[].status'

The scheduling failure message contains:

0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 [NetworkTopology begin] cluster total nodes:6, 5 node provide 5 freeSlot, 1 node unavailable cause Insufficient nvidia.com/gpu, job desireNum:5, all fail topology paths by MustGather reason: [path:RootNode->test-pod-1, freeSlotNum:4], [path:RootNode->DefaultTopologyName, freeSlotNum:0], [path:RootNode->test-pod-2, freeSlotNum:1] [NetworkTopology end], 4 NetworkTopology bestPlan empty. network topology job sample-network-topology/sample-network-topology gets rejected due to pod is unschedulable...

Key fields in the message:

Field	Value	Meaning
`job desireNum`	`5`	The scheduler needs 5 slots within a single Pod-level domain
`path:RootNode->test-pod-1, freeSlotNum`	`4`	`test-pod-1` has only 4 available slots—not enough for 5 pods
`path:RootNode->test-pod-2, freeSlotNum`	`1`	`test-pod-2` has 1 slot, but `MustGather` forbids splitting the job across Pods
`NetworkTopology bestPlan empty`	—	No valid topology plan exists; scheduling is blocked

To resolve this, either reduce workerNum to fit within a single Pod, or change the PoDTopologyLayer strategy from MustGather to PreferGather to allow cross-Pod scheduling.