All Products
Search
Document Center

Container Service for Kubernetes:Usage notes for the memory isolation capability of cGPU

Last Updated:Apr 17, 2025

cGPU is a GPU memory and computing power isolation module developed by Alibaba Cloud. It ensures that when multiple containers share a single GPU, the memory and compute resources used by each container do not interfere with one another. This topic provides answers to some frequently asked questions (FAQs) when using cGPU.

Before you begin

Before you start, make sure that you take note of the following items:

  • If the GPU-accelerated nodes in your cluster have the ack.node.gpu.schedule=cgpu, ack.node.gpu.schedule=core_mem, or cgpu=true labels, this indicates that the isolation feature is enabled on those nodes using cGPU.

  • See the release notes for ack-ai-installer to find the mapping between ack-ai-installer versions and cGPU versions.

  • For more information about cGPU, see official NVIDIA documentation.

FAQs

What should I do if modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted occurs when I create a cGPU pod?

The following error message is returned:

Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 2, stdout: , stderr: modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted modprobe: ERROR: could not insert 'cgpu_procfs': Operation not permitted Auto-detected mode as 'legacy': unknown

This error typically indicates incompatibility between the OS version and the cGPU driver requirements. To resolve it, update the GPU sharing component to the latest version.

What should I do if a Linux Kernel Panic occurs when I use cGPU?

If you install cGPU version 1.5.7, a deadlock may occur in the cGPU kernel driver, causing concurrent processes to lock each other, resulting in a Linux kernel panic. To prevent this issue, we recommend that you install or update to cGPU version 1.5.10 or later. For more information about how to update, see Update the cGPU version on a node.

What should I do if a Failed to initialize NVML error occurs when I execute nvidia-smi in a cGPU pod?

If you install cGPU version 1.5.2 or earlier with a driver version released after July 2023, you may encounter incompatibility issues between the cGPU and GPU driver versions. To verify the release date of your GPU driver, find it in the Linux AMD64 Display Driver Archive. For a list of default GPU driver versions supported by each ACK cluster type, see NVIDIA driver versions supported by ACK.

Once a pod requests to share GPU scheduling resources and its status is Running, you can run the nvidia-smi command in the pod to verify if the following output is returned:

Failed to initialize NVML: GPU access blocked by operating system

If this output is returned, update the GPU sharing component to the latest version to resolve the issue.

What should I do if a failure or timeout occurs when I create a container for a cGPU pod?

If you install a cGPU version earlier than 1.0.10 and use an NVIDIA Toolkit version 1.11 or later, you may encounter container creation failures or experience timeouts.

To resolve this issue, update the GPU sharing component to the latest version.

Why do I see an Error occurs when creating cGPU instance: unknown message during pod creation?

For performance optimization, a maximum of 20 pods can be created per GPU card when using cGPU. If the number of created pods exceeds this limit, subsequent pods scheduled to the same GPU card will fail to initialize and return the error: Error occurs when creating cGPU instance: unknown (Code: 500).