This topic introduces cGPU and describes the benefits of cGPU Professional Edition by comparing it with cGPU Basic Edition.
Benefits of cGPU Professional Edition
|Supports graphics processing unit (GPU) sharing, scheduling, and memory isolation.||
|Supports flexible GPU sharing and memory isolation policies.||
|Supports comprehensive monitoring of GPU resources.||Supports monitoring of both exclusive GPUs and shared GPUs.|
Comparison between cGPU Basic Edition and cGPU Professional Edition
|Feature||cGPU Professional Edition||cGPU Basic Edition|
|GPU sharing and scheduling on one GPU||Supported||Supported|
|GPU sharing and scheduling on multiple GPUs||Supported||Not supported|
|Memory isolation on one GPU||Supported||Supported|
|Memory isolation on multiple GPUs||Supported||Not supported|
|Monitoring and auto scaling of exclusive GPUs and shared GPUs||Supported||Supported|
|Node pools that support flexible policy configurations||Supported. Allows you to create different GPU policies for a node pool. You can enable GPU sharing with or without memory isolation for a node pool.||Supported. You can configure different GPU policies for a node pool. You can enable GPU sharing with or without memory isolation for a node pool. In addition, you can use the binpack or spread algorithm to allocate GPUs.|
|Allocate GPU memory to pods by using algorithms||Supported. GPUs can be allocated by using the binpack and spread algorithms. You can choose binpack or spread to meet your business requirements.||Supported. By default, GPUs are allocated by using the binpack algorithm.|
- cGPU Basic Edition is used after you install ack-ai-installer in a dedicated Kubernetes cluster with GPU-accelerated nodes. For more information, see Install the cGPU component.
- cGPU Professional Edition is used after you install ack-ai-installer in a professional Kubernetes cluster with GPU-accelerated nodes. For more information, see Install and use ack-ai-installer and the GPU scheduling inspection tool.
GPU sharing solution by Alibaba Cloud
A key requirement of GPU sharing among multiple pods is to isolate the GPU memory and computing power that are allocated to each pod. When you run multiple containers on one GPU, the GPU resources are allocated to each container as required. However, if one container occupies excessive GPU resources, the performance of the other containers may be affected. To address this issue, many solutions have been developed in the computing industry. Technologies, such as NVIDIA virtual GPU (vGPU), NVIDIA Multi-Process Service (MPS), rCUDA, and vCUDA, all contribute to fine-grained GPU resource allocation.
- High compatibility: cGPU is compatible with standard open source solutions, such as Kubernetes and NVIDIA Docker.
- Ease-of-use: cGPU adopts a user-friendly design. To replace a Compute Unified Device Architecture (CUDA) library of an artificial intelligence (AI) application, you do not need to re-compile the application or create a new container image.
- Stability: cGPU provides stable underlying operations on NVIDIA GPUs. API operations on CUDA libraries and some private API operations on cuDNN are difficult to call.
- Resource isolation: cGPU ensures that the allocated GPU memory and computing capacity do not affect each other.
Based on cGPU, ACK enables GPU sharing and the scheduling of multiple tasks to one GPU. This enables GPU sharing and memory isolation for scheduled Kubernetes resources and the container runtime. This provides low-cost, reliable, and user-friendly GPU sharing and memory isolation for large scale business.