Enable Topology-Aware GPU Scheduling for TensorFlow Jobs - ACK

Distributed training workloads spread across multiple GPU nodes incur significant cross-node communication overhead during gradient synchronization. ACK supports topology-aware GPU scheduling based on the scheduling framework, which selects GPUs based on their physical interconnect topology—placing workers on GPUs that share high-bandwidth NVLink connections—to reduce this overhead and maximize training throughput. This topic shows how to apply topology-aware GPU scheduling to TensorFlow distributed training jobs and compare the results against regular GPU scheduling.

Prerequisites

Before you begin, ensure that you have:

An ACK Pro cluster with the instance type set to Elastic GPU Service. See Create an ACK managed cluster
Arena installed
The topology-aware GPU scheduling component installed
Component versions that meet the following requirements:

Component	Required version	Check command
Kubernetes	1.18.8 and later	`kubectl version --short`
NVIDIA driver	418.87.01 and later	`nvidia-smi --query-gpu=driver_version --format=csv,noheader`
NCCL (NVIDIA Collective Communications Library)	2.7 and later	`python3 -c "import torch; print(torch.cuda.nccl.version())"`
Operating system	CentOS 7.6, CentOS 7.7, Ubuntu 16.04, Ubuntu 18.04, Alibaba Cloud Linux 2, Alibaba Cloud Linux 3	`cat /etc/os-release`
GPU	V100	`nvidia-smi --query-gpu=name --format=csv,noheader`

Important

Complete the prerequisites in order. Create the ACK Pro cluster first, then install Arena, and finally install the topology-aware GPU scheduling component. Installing components out of order may cause failures.

Limitations

Topology-aware GPU scheduling applies only to Message Passing Interface (MPI) jobs trained with a distributed framework.

Regular GPU scheduling assigns GPUs based on availability alone, without considering physical interconnect topology. This means workers can be placed on GPUs across different nodes connected only by slower network links, making cross-GPU communication the primary bottleneck. Topology-aware scheduling solves this by grouping workers on GPUs that share NVLink connections within the same node, significantly reducing gradient synchronization latency.

Pods are created only when all requested resources can be satisfied simultaneously (gang scheduling). If resources are insufficient, the job remains pending until enough GPUs are available.

Configure nodes

Label each node to enable topology-aware GPU scheduling:

kubectl label node <your-node-name> ack.node.gpu.schedule=topology

Note

Enabling topology-aware scheduling on a node disables regular GPU scheduling for that node. To revert to regular GPU scheduling, run:

kubectl label node <your-node-name> ack.node.gpu.schedule=default --overwrite

Submit a job

Submit an MPI job with --gputopology=true and --gang:

arena submit mpi --gputopology=true --gang <other-parameters>

Both flags are required: --gputopology=true enables topology-aware GPU selection, and --gang enforces gang scheduling so all workers start simultaneously.

Example 1: Train VGG16

The cluster in this example has two nodes, each with eight V100 GPUs.

Topology-aware GPU scheduling

Submit the training job:

arena submit mpi \
  --name=tensorflow-topo-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --gang \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"

Check the job status:

arena get tensorflow-topo-4-vgg16 --type mpijob

Expected output:

Name:      tensorflow-topo-4-vgg16
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  2m

Instances:
  NAME                                    STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                    ------   ---  --------  --------------  ----
  tensorflow-topo-4-vgg16-launcher-lmhjl  Running  2m   true      0               cn-shanghai.192.168.16.172
  tensorflow-topo-4-vgg16-worker-0        Running  2m   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-vgg16-worker-1        Running  2m   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-vgg16-worker-2        Running  2m   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-vgg16-worker-3        Running  2m   false     1               cn-shanghai.192.168.16.173

All four workers are placed on the same node, sharing its high-bandwidth GPU interconnect.

View the training log:

arena logs -f tensorflow-topo-4-vgg16

Expected output:

total images/sec: 991.92

Regular GPU scheduling

Submit the training job without topology flags:

arena submit mpi \
  --name=tensorflow-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"

Check the job status:

arena get tensorflow-4-vgg16 --type mpijob

Expected output:

Name:      tensorflow-4-vgg16
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  9s

Instances:
  NAME                               STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                               ------   ---  --------  --------------  ----
  tensorflow-4-vgg16-launcher-xc28k  Running  9s   true      0               cn-shanghai.192.168.16.172
  tensorflow-4-vgg16-worker-0        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-vgg16-worker-1        Running  9s   false     1               cn-shanghai.192.168.16.173
  tensorflow-4-vgg16-worker-2        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-vgg16-worker-3        Running  9s   false     1               cn-shanghai.192.168.16.173

Workers are spread across two nodes, requiring cross-node communication for each gradient synchronization step.

View the training log:

arena logs -f tensorflow-4-vgg16

Expected output:

total images/sec: 200.47

Example 2: Train ResNet50

Topology-aware GPU scheduling

Submit the training job:

arena submit mpi \
  --name=tensorflow-topo-4-resnet50 \
  --gpus=1 \
  --workers=4 \
  --gang \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64 --variable_update=horovod"

Check the job status:

arena get tensorflow-topo-4-resnet50 --type mpijob

Expected output:

Name:      tensorflow-topo-4-resnet50
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  8s

Instances:
  NAME                                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                       ------   ---  --------  --------------  ----
  tensorflow-topo-4-resnet50-launcher-7ln8j  Running  8s   true      0               cn-shanghai.192.168.16.172
  tensorflow-topo-4-resnet50-worker-0        Running  8s   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-resnet50-worker-1        Running  8s   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-resnet50-worker-2        Running  8s   false     1               cn-shanghai.192.168.16.173
  tensorflow-topo-4-resnet50-worker-3        Running  8s   false     1               cn-shanghai.192.168.16.173

View the training log:

arena logs -f tensorflow-topo-4-resnet50

Expected output:

total images/sec: 1471.55

Regular GPU scheduling

Submit the training job without topology flags:

arena submit mpi \
  --name=tensorflow-4-resnet50 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64 --variable_update=horovod"

Check the job status:

arena get tensorflow-4-resnet50 --type mpijob

Expected output:

Name:      tensorflow-4-resnet50
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  9s

Instances:
  NAME                                  STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                  ------   ---  --------  --------------  ----
  tensorflow-4-resnet50-launcher-q24hv  Running  9s   true      0               cn-shanghai.192.168.16.172
  tensorflow-4-resnet50-worker-0        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-resnet50-worker-1        Running  9s   false     1               cn-shanghai.192.168.16.173
  tensorflow-4-resnet50-worker-2        Running  9s   false     1               cn-shanghai.192.168.16.172
  tensorflow-4-resnet50-worker-3        Running  9s   false     1               cn-shanghai.192.168.16.173

View the training log:

arena logs -f tensorflow-4-resnet50

Expected output:

total images/sec: 745.38

Performance comparison

The following chart shows throughput for VGG16 and ResNet50 training under topology-aware and regular GPU scheduling.

Model	Topology-aware (images/sec)	Regular (images/sec)	Improvement
VGG16	991.92	200.47	~4.9x
ResNet50	1471.55	745.38	~2.0x

Important

The performance values in this topic are theoretical values. Actual results vary based on your model architecture, cluster configuration, and network conditions. Run the examples above in your own cluster to measure actual gains.

Container Service for Kubernetes:Use GPU topology-aware scheduling (TensorFlow)

Prerequisites

Limitations

Configure nodes

Submit a job

Example 1: Train VGG16

Topology-aware GPU scheduling

Regular GPU scheduling

Example 2: Train ResNet50

Topology-aware GPU scheduling

Regular GPU scheduling

Performance comparison

What's next