cGPU is an Alibaba Cloud module that provides GPU memory and computing power isolation, allowing multiple containers to share a single GPU without interfering with each other's resources. This topic covers compatibility requirements and known issues to check before deploying or upgrading cGPU.
Prerequisites
A GPU node has cGPU isolation enabled if it carries any of the following labels:
ack.node.gpu.schedule=cgpuack.node.gpu.schedule=core_memcgpu=true
For the version mapping between the ack-ai-installer and cGPU components, see the ack-ai-installer release notes.
For more information about cGPU, see the NVIDIA official documentation.
Limitations
Maximum pods per GPU: cGPU supports a maximum of 20 pods on a single GPU. Pods scheduled to a GPU beyond this limit cannot run.
Container-optimized OS compatibility: cGPU versions 1.5.18 and earlier can cause the first pod on a cGPU node to fail to start when using an Alibaba Cloud container-optimized OS image. Upgrade to
ack-ai-installer ≥ 1.12.6to resolve this.Kernel panic risk: cGPU versions 1.5.7 and earlier have a known deadlock in the kernel driver that causes Linux kernel panics. Upgrade to
cGPU ≥ 1.5.10.
cGPU version compatibility
NVIDIA driver compatibility
cGPU version | Supported NVIDIA drivers | Not supported |
1.5.3 – 1.5.20 | 460, 470, 510, 515, 525, 535, 550, 560, 565, 570, 575 series | — |
1.0.5 – 1.5.2 | 460; 470 ≤ 470.161.03; 510 ≤ 510.108.03; 515 ≤ 515.86.01; 525 ≤ 525.89.03 | 535, 550, 560, 565, 570, 575 series |
0.8.13 – 1.0.3 | 460; 470 ≤ 470.161.03 | 510, 515, 525, 535, 550, 560, 565, 570, 575 series |
Instance family compatibility
cGPU version | Supported instance families | Not supported |
1.5.19 – 1.5.20 | gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t | — |
1.5.9 – 1.5.18 | gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia | ebmgn9t |
1.5.7 – 1.5.8 | gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v | gn8ia/ebmgn8ia; ebmgn9t |
1.5.5 – 1.5.6 | gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t | gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t |
1.0.3 – 1.5.3 | gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e | gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t |
0.8.13 – 0.8.17 | gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e | gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t |
nvidia-container-toolkit compatibility
cGPU version | Supported nvidia-container-toolkit |
1.0.10 – 1.5.20 | ≤ 1.10 and 1.11 – 1.17 |
0.8.13 – 1.0.9 | ≤ 1.10 only (1.11 – 1.17 not supported) |
Kernel version compatibility
cGPU version | Supported kernel versions |
1.5.9 – 1.5.20 | 3.x, 4.x, 5.x ≤ 5.15 |
1.5.3 – 1.5.8 | 3.x, 4.x, 5.x ≤ 5.10 |
1.0.3 – 1.5.2 | 3.x, 4.x, 5.x ≤ 5.1 |
0.8.17 | 3.x, 4.x, 5.x ≤ 5.0 |
0.8.10 – 0.8.13 | 3.x, 4.x (5.x not supported) |
Troubleshooting
A Linux kernel panic occurs when using cGPU
A deadlock in the cGPU kernel driver causes concurrent processes to block each other, resulting in a Linux kernel panic. This affects cGPU versions 1.5.7 and earlier.
Upgrade to cGPU ≥ 1.5.10 to prevent kernel errors in new services. For instructions, see Upgrade the cGPU version of a node.
A cGPU pod fails to start on a container-optimized OS node
When using an Alibaba Cloud container-optimized OS image, the first cGPU pod on a node may fail to start with an error similar to:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 2, stdout: , stderr: Auto-detected mode as 'legacy': unknownThis affects cGPU versions 1.5.18 and earlier. Upgrade to ack-ai-installer ≥ 1.12.6. For instructions, see Upgrade the shared GPU scheduling component.
A modprobe error occurs when creating a cGPU pod
If you see either of the following errors when creating a cGPU pod:
modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permittedmodprobe: ERROR: could not insert 'km': Operation not permittedThe operating system version is not compatible with cGPU. Upgrade the cGPU component to the latest version. For instructions, see Upgrade the shared GPU scheduling component.
A cGPU pod container fails to create or exits due to timeout
This is caused by an incompatibility between cGPU versions 1.0.10 and earlier and nvidia-container-toolkit versions 1.11 and later. Upgrade the cGPU component to the latest version. For instructions, see Upgrade the shared GPU scheduling component.
The "Error occurs when creating cGPU instance: unknown" error appears
cGPU supports a maximum of 20 pods per GPU. If the number of pods scheduled to a GPU exceeds this limit, subsequent pods cannot run and this error appears. Keep the number of pods per GPU at 20 or fewer.
The "Failed to initialize NVML" error appears when running nvidia-smi
Running nvidia-smi in a pod that uses shared GPU scheduling resources returns:
Failed to initialize NVML: GPU access blocked by operating systemThis is caused by an incompatibility between cGPU versions 1.5.2 and earlier and GPU driver versions released after July 2023. For a list of GPU driver release dates, see GPU Driver Release Dates. For the default GPU driver versions supported by ACK cluster versions, see List of NVIDIA driver versions supported by ACK.
Upgrade the cGPU component to the latest version. For instructions, see Upgrade the shared GPU scheduling component.