This topic introduces the cGPU solution provided by Alibaba Cloud, describes the benefits of cGPU Professional Edition, and compares the features and use scenarios of cGPU Basic Edition and cGPU Professional Edition. This helps you better understand and use cGPU.
Background information
Container Service for Kubernetes (ACK) provides the open source cGPU solution that allows you to share one GPU among multiple containers in a Kubernetes cluster. You can enable cGPU for container clusters that are deployed in Alibaba Cloud, Amazon Web Services (AWS), Google Compute Engine (GCE), or data centers. cGPU reduces the expenses on GPUs. However, when you run multiple containers on one GPU, the stability of the containers cannot be ensured.
To ensure container stability, you must isolate the GPU resources that are allocated to each container. When you run multiple containers on one GPU, GPU resources are allocated to each container as requested. However, if one container occupies excessive GPU resources, the performance of other containers may be affected. To solve this issue, many solutions are provided in the computing industry. For example, NVIDIA vGPU, Multi-Process Service (MPS), and vCUDA enable fine-grained sharing of GPUs.
ACK provides the cGPU solution to meet the preceding requirements. cGPU enables a GPU to be shared by multiple tasks. cGPU also allows you to isolate the GPU memory that is allocated to each application and partition the computing capacity of the GPU.
Features
- High compatibility: cGPU is compatible with standard open source solutions, such as Kubernetes and NVIDIA Docker.
- Ease of use: cGPU provides excellent user experience. To replace a Compute Unified Device Architecture (CUDA) library of an AI application, you do not need to recompile the application or create a new container image.
- Stability: cGPU provides stable underlying operations on NVIDIA GPUs. API operations on CUDA libraries and some private API operations on CUDA Deep Neural Network (cuDNN) are difficult to call.
- Resource isolation: cGPU isolates the allocated GPU memory and computing capacity.
cGPU provides a cost-effective, reliable, and user-friendly solution that allows you to enable GPU scheduling and memory isolation.
Benefits of cGPU Professional Edition
Benefit | Description |
---|---|
Supports GPU sharing, scheduling, and memory isolation. |
|
Supports flexible GPU sharing and memory isolation policies. |
|
Supports comprehensive monitoring of GPU resources. | Supports monitoring of both exclusive GPUs and shared GPUs. |
Comparison between cGPU Basic Edition and cGPU Professional Edition
Feature | cGPU Professional Edition | cGPU Basic Edition |
---|---|---|
GPU sharing and scheduling on one GPU | Supported | Supported |
GPU sharing and scheduling on multiple GPUs | Supported | Not supported |
Memory isolation on one GPU | Supported | Supported |
Memory isolation on multiple GPUs | Supported | Not supported |
Monitoring and auto scaling of exclusive GPUs and shared GPUs | Supported | Supported |
Node pools that support flexible policy configurations | Supported. Allows you to create different GPU policies for a node pool. You can enable GPU sharing with or without memory isolation for a node pool. | Supported. You can configure different GPU policies for a node pool. You can enable GPU sharing with or without memory isolation for a node pool. In addition, you can use the binpack or spread algorithm to allocate GPUs. |
Allocate GPU memory to pods by using algorithms | Supported. GPUs can be allocated by using the binpack and spread algorithms. You can choose binpack or spread to meet your business requirements. | Supported. By default, GPUs are allocated by using the binpack algorithm. |
Usage notes
- Dedicated Kubernetes clusters: cGPU Basic Edition. For more information, see the following topics:
- Professional Kubernetes clusters: cGPU Professional Edition. For more information, see the following topics:
- Install and use ack-ai-installer and the GPU inspection tool
- Enable GPU sharing
- Use cGPU to achieve GPU sharing based on multiple GPUs
- Use node pools to control cGPU
- If you migrate workloads from a dedicated Kubernetes cluster installed with cGPU Basic Edition to a professional managed Kubernetes cluster, you must upgrade to cGPU Professional Edition in the professional managed Kubernetes cluster after the migration is completed. For more information, see Upgrade cGPU Basic Edition to cGPU Professional Edition in an ACK Pro cluster.
- ACK edge clusters: cGPU Professional Edition. For more information, see the following topics: