All Products
Search
Document Center

Container Service for Kubernetes:cGPU FAQ

Last Updated:Mar 31, 2026

cGPU is an Alibaba Cloud module that provides GPU memory and computing power isolation, allowing multiple containers to share a single GPU without interfering with each other's resources. This topic covers compatibility requirements and known issues to check before deploying or upgrading cGPU.

Prerequisites

A GPU node has cGPU isolation enabled if it carries any of the following labels:

  • ack.node.gpu.schedule=cgpu

  • ack.node.gpu.schedule=core_mem

  • cgpu=true

For the version mapping between the ack-ai-installer and cGPU components, see the ack-ai-installer release notes.

For more information about cGPU, see the NVIDIA official documentation.

Limitations

  • Maximum pods per GPU: cGPU supports a maximum of 20 pods on a single GPU. Pods scheduled to a GPU beyond this limit cannot run.

  • Container-optimized OS compatibility: cGPU versions 1.5.18 and earlier can cause the first pod on a cGPU node to fail to start when using an Alibaba Cloud container-optimized OS image. Upgrade to ack-ai-installer ≥ 1.12.6 to resolve this.

  • Kernel panic risk: cGPU versions 1.5.7 and earlier have a known deadlock in the kernel driver that causes Linux kernel panics. Upgrade to cGPU ≥ 1.5.10.

cGPU version compatibility

NVIDIA driver compatibility

cGPU version

Supported NVIDIA drivers

Not supported

1.5.3 – 1.5.20

460, 470, 510, 515, 525, 535, 550, 560, 565, 570, 575 series

1.0.5 – 1.5.2

460; 470 ≤ 470.161.03; 510 ≤ 510.108.03; 515 ≤ 515.86.01; 525 ≤ 525.89.03

535, 550, 560, 565, 570, 575 series

0.8.13 – 1.0.3

460; 470 ≤ 470.161.03

510, 515, 525, 535, 550, 560, 565, 570, 575 series

Instance family compatibility

cGPU version

Supported instance families

Not supported

1.5.19 – 1.5.20

gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t

1.5.9 – 1.5.18

gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia

ebmgn9t

1.5.7 – 1.5.8

gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v

gn8ia/ebmgn8ia; ebmgn9t

1.5.5 – 1.5.6

gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t

gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t

1.0.3 – 1.5.3

gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e; gn7i/gn7/gn7e/ebmgn7i/ebmgn7e

gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t

0.8.13 – 0.8.17

gn6i/gn6e/gn6v/gn6t/ebmgn6i/ebmgn6t/ebmgn6e

gn7i/gn7/gn7e/ebmgn7i/ebmgn7e; gn8t/ebmgn8t; gn8is/gn8v/ebmgn8is/ebmgn8v; gn8ia/ebmgn8ia; ebmgn9t

nvidia-container-toolkit compatibility

cGPU version

Supported nvidia-container-toolkit

1.0.10 – 1.5.20

≤ 1.10 and 1.11 – 1.17

0.8.13 – 1.0.9

≤ 1.10 only (1.11 – 1.17 not supported)

Kernel version compatibility

cGPU version

Supported kernel versions

1.5.9 – 1.5.20

3.x, 4.x, 5.x ≤ 5.15

1.5.3 – 1.5.8

3.x, 4.x, 5.x ≤ 5.10

1.0.3 – 1.5.2

3.x, 4.x, 5.x ≤ 5.1

0.8.17

3.x, 4.x, 5.x ≤ 5.0

0.8.10 – 0.8.13

3.x, 4.x (5.x not supported)

Troubleshooting

A Linux kernel panic occurs when using cGPU

A deadlock in the cGPU kernel driver causes concurrent processes to block each other, resulting in a Linux kernel panic. This affects cGPU versions 1.5.7 and earlier.

Upgrade to cGPU ≥ 1.5.10 to prevent kernel errors in new services. For instructions, see Upgrade the cGPU version of a node.

A cGPU pod fails to start on a container-optimized OS node

When using an Alibaba Cloud container-optimized OS image, the first cGPU pod on a node may fail to start with an error similar to:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 2, stdout: , stderr: Auto-detected mode as 'legacy': unknown

This affects cGPU versions 1.5.18 and earlier. Upgrade to ack-ai-installer ≥ 1.12.6. For instructions, see Upgrade the shared GPU scheduling component.

A modprobe error occurs when creating a cGPU pod

If you see either of the following errors when creating a cGPU pod:

modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted
modprobe: ERROR: could not insert 'km': Operation not permitted

The operating system version is not compatible with cGPU. Upgrade the cGPU component to the latest version. For instructions, see Upgrade the shared GPU scheduling component.

A cGPU pod container fails to create or exits due to timeout

This is caused by an incompatibility between cGPU versions 1.0.10 and earlier and nvidia-container-toolkit versions 1.11 and later. Upgrade the cGPU component to the latest version. For instructions, see Upgrade the shared GPU scheduling component.

The "Error occurs when creating cGPU instance: unknown" error appears

cGPU supports a maximum of 20 pods per GPU. If the number of pods scheduled to a GPU exceeds this limit, subsequent pods cannot run and this error appears. Keep the number of pods per GPU at 20 or fewer.

The "Failed to initialize NVML" error appears when running nvidia-smi

Running nvidia-smi in a pod that uses shared GPU scheduling resources returns:

Failed to initialize NVML: GPU access blocked by operating system

This is caused by an incompatibility between cGPU versions 1.5.2 and earlier and GPU driver versions released after July 2023. For a list of GPU driver release dates, see GPU Driver Release Dates. For the default GPU driver versions supported by ACK cluster versions, see List of NVIDIA driver versions supported by ACK.

Upgrade the cGPU component to the latest version. For instructions, see Upgrade the shared GPU scheduling component.