This topic describes the GPU topology and benefits of topology-aware GPU scheduling.
Benefits of topology-aware GPU scheduling
The one-way communication bandwidth of an NVLink is 25 Gbit/s. The two-way communication bandwidth of an NVLink is 50 Gbit/s. The PCIe bandwidth is 16 Gbit/s. In a training job, the training speed depends on the combination of GPUs. The optimal combination of GPUs can be selected during the GPU scheduling process to ensure the optimal training speed.
Kubernetes does not support topology-aware GPU scheduling. In this case, GPUs are selected at random. The training speed varies based on different combinations of GPUs. Container Service for Kubernetes (ACK) supports topology-aware GPU scheduling based on the scheduling framework. This feature selects a combination of GPUs from GPU-accelerated nodes to achieve optimal GPU acceleration for training jobs.