This topic describes the GPU topology. It also includes further details of the benefits of GPU topology-aware scheduling.
The one-way communication bandwidth of an NVLink is 25 Gbit/s. The two-way communication bandwidth of an NVLink is 50 Gbit/s. The PCIe bandwidth is 16 Gbit/s. In a training job, the training speed depends on the different combinations of GPUs. Therefore, the optimal combination of GPUs can be selected during the GPU scheduling process. This ensures the optimal training speed.
Kubernetes does not support GPU topology-aware scheduling. In this case, GPUs are selected at random. The training speed can vary based on different combinations of GPUs. To fix this issue, Container Service for Kubernetes (ACK) supports GPU topology-aware scheduling based on the scheduling framework. You can use this feature to select a combination of GPUs on GPU nodes. This ensures the optimal training speed.