This topic describes how to install and use the cGPU service by using Container Service for Kubernetes (ACK).
Install the cGPU service
Use ACK to implement GPU sharing and scheduling. Before you can implement GPU sharing and scheduling, you must install a resource isolation module and a GPU scheduling inspection tool on a GPU node. For more information, see Install a shared GPU.
Use the cGPU service
The following operations describe how to use the cGPU service by using ACK. You can select and view a content based on your business scenarios.
- Run the cGPU service
To efficiently use the GPU device resources, you can deploy YAML files to create cGPU by using cGPU memory isolation. For more information, see Enable GPU sharing.
- Monitor and isolate GPU resources
The cGPU solution isolates the GPU resources that are allocated to containers that share a single GPU. You do not need to modify the existing GPU program. This topic describes how to monitor the GPU memory usage by using the managed Prometheus plug-in. This topic also provides information about how to isolate GPU resources by using cGPU. For more information, see Monitor and isolate GPU resources.
- Upgrade the Docker runtime of a GPU node
To isolate cGPU resources shared by multiple nodes in a Kubernetes cluster, Docker 19.03.5 and the corresponding nvidia-container-runtime binary must be used. If the Docker runtime version is earlier than 19.03.5, you must upgrade the Docker runtime to Docker 19.03.5. Otherwise, Kubernetes cluster nodes cannot support the cGPU service.
For more information how to upgrade Docker and the corresponding nvidia-container-runtime, see Upgrade the Docker runtime of a GPU node.
- Use node pools to manage the cGPU service
You can use node pools to regulate the GPU sharing and memory isolation policies of cGPU. You can create two labeled node pools and use node pools to control the GPU sharing and memory isolation capabilities of cGPU. For more information, see Use node pools to control cGPU.
- Disable cGPU memory isolation
You can disable cGPU when you use a YAML file to create a container that uses a GPU. cGPU is used to isolate GPU resources that are allocated to different containers when the containers share the same physical GPU. For more information, see Disable cGPU.