Best practice for optimizing GPU costs for ACK clusters - Container Service for Kubernetes

cGPU allows you to schedule application containers whose GPU utilization is low to the same GPU. This improves the overall GPU utilization, reduces resource costs, and ensures resource supply for applications with heavy loads.

Background Information

The cGPU solution provided by Alibaba Cloud adopts a kernel driver to isolate GPU memory and computing power by using a lightweight runtime library in user mode to create virtual GPUs in containers. To schedule computing power and isolate GPU memory, cGPU does not need to replace any CUDA static or dynamic library. In addition, cGPU does not need to recompile CUDA applications. CUDA and cuDNN can be updated anytime without adaption.

Solution

Flexibly share GPUs.
Improve GPU utilization and reduce the total cost of ownership (TCO).
Share one GPU among multiple isolated containers without the need to modify container configurations.

Architecture

References

For more information about how to optimize GPU costs for ACK clusters, see Optimize GPU costs for ACK clusters.