Disable cGPU memory isolation - Container Service for Kubernetes

cGPU's GPU sharing scheduler allocates a fixed amount of GPU memory to each container and enforces that limit at runtime. When memory isolation is disabled, the container can see and use the full physical GPU memory of the node, while the scheduler still tracks the allocated quota for scheduling purposes.

Disable memory isolation when a workload requires the full GPU memory at startup — for example, when a deep learning framework like TensorFlow probes available GPU memory during initialization and pre-allocates it. Without disabling isolation, such frameworks see only the allocated quota (for example, 3 GiB) and may fail or underperform.

This procedure applies to ACK dedicated clusters and ACK Pro clusters that have the memory isolation feature of cGPU enabled.

Prerequisites

Before you begin, ensure that you have:

The ack-cgpu component installed in your cluster. See Install ack-cgpu or Install and use ack-ai-installer and the GPU inspection tool

Disable memory isolation for a container

Step 1: Check current GPU sharing status

Run the following command to see how GPU memory is currently allocated across your cluster nodes:

kubectl inspect cgpu

The expected output is similar to:

NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-beijing.192.16x.x.xx3  192.16x.x.xx3  0/15                   0/15
cn-beijing.192.16x.x.xx1  192.16x.x.xx1  0/15                   0/15
cn-beijing.192.16x.x.xx2  192.16x.x.xx2  0/15                   0/15
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/45 (0%)

To get detailed GPU sharing information, run kubectl inspect cgpu -d.

Step 2: Deploy a container with memory isolation disabled

Create a Kubernetes Job that requests GPU memory through cGPU but opts out of memory isolation enforcement. Set the CGPU_DISABLE environment variable to "true" in the container spec.

apiVersion: batch/v1
kind: Job
metadata:
  name: disable-cgpu
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: disable-cgpu
    spec:
      containers:
      - name: disable-cgpu
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        env:
        - name: CGPU_DISABLE   # Set to "true" to disable memory isolation
          value: "true"
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            aliyun.com/gpu-mem: 3   # Requests 3 GiB of GPU memory for scheduling
        workingDir: /root
      restartPolicy: Never

Key fields in this configuration:

Field	Description
`CGPU_DISABLE: "true"`	Disables memory isolation enforcement. The container sees the full physical GPU memory instead of only the allocated quota.
`aliyun.com/gpu-mem: 3`	Requests 3 GiB of GPU memory from the cGPU scheduler. This value is used for scheduling and allocation tracking, not for enforcement when `CGPU_DISABLE` is set.

Step 3: Verify GPU scheduling

After the Job is scheduled, confirm that the cGPU scheduler allocated the requested GPU memory to a node:

kubectl inspect cgpu

The expected output is similar to:

NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-beijing.192.16x.x.xx1  192.16x.x.xx1  0/15                   0/15
cn-beijing.192.16x.x.xx2  192.16x.x.xx2  0/15                   0/15
cn-beijing.192.16x.x.xx3  192.16x.x.xx3  3/15                   3/15
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
3/45 (6%)

Check these fields to confirm scheduling succeeded:

`GPU0(Allocated/Total)` on `cn-beijing.192.16x.x.xx3`: Shows 3/15, confirming 3 GiB was allocated from this node's 15 GiB total.
`Allocated/Total GPU Memory In Cluster`: Shows 3/45 (6%), confirming the cluster-level allocation is updated.

Verify that memory isolation is disabled

Use either of the following methods to confirm that the container can see the full physical GPU memory, not just the allocated quota.

Method 1: Check the application log

kubectl logs disable-cgpu-xxxx --tail=1

The expected output is similar to:

2020-08-25 08:14:54.927965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15024 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)

Confirm these values:

`15024 MB memory`: The container sees 15,024 MiB — the full physical GPU memory, not the 3 GiB quota. If memory isolation were enabled, this value would show 3 GiB.

Method 2: Run nvidia-smi in the container

kubectl exec disable-cgpu-xxxx nvidia-smi

The expected output is similar to:

Tue Aug 25 08:23:33 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   33C    P0    55W / 300W |  15453MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Confirm these values:

`15453MiB / 16130MiB`: The container is using 15,453 MiB out of the host's 16,130 MiB total. If memory isolation were enabled, the container would be limited to 3 GiB.

Container Service for Kubernetes:Disable the memory isolation feature of cGPU

Prerequisites

Disable memory isolation for a container

Step 1: Check current GPU sharing status

Step 2: Deploy a container with memory isolation disabled

Step 3: Verify GPU scheduling

Verify that memory isolation is disabled

Method 1: Check the application log

Method 2: Run nvidia-smi in the container

What's next