cGPU's GPU sharing scheduler allocates a fixed amount of GPU memory to each container and enforces that limit at runtime. When memory isolation is disabled, the container can see and use the full physical GPU memory of the node, while the scheduler still tracks the allocated quota for scheduling purposes.
Disable memory isolation when a workload requires the full GPU memory at startup — for example, when a deep learning framework like TensorFlow probes available GPU memory during initialization and pre-allocates it. Without disabling isolation, such frameworks see only the allocated quota (for example, 3 GiB) and may fail or underperform.
This procedure applies to ACK dedicated clusters and ACK Pro clusters that have the memory isolation feature of cGPU enabled.
Prerequisites
Before you begin, ensure that you have:
The
ack-cgpucomponent installed in your cluster. See Install ack-cgpu or Install and use ack-ai-installer and the GPU inspection tool
Disable memory isolation for a container
Step 1: Check current GPU sharing status
Run the following command to see how GPU memory is currently allocated across your cluster nodes:
kubectl inspect cgpuThe expected output is similar to:
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
cn-beijing.192.16x.x.xx3 192.16x.x.xx3 0/15 0/15
cn-beijing.192.16x.x.xx1 192.16x.x.xx1 0/15 0/15
cn-beijing.192.16x.x.xx2 192.16x.x.xx2 0/15 0/15
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/45 (0%)To get detailed GPU sharing information, run kubectl inspect cgpu -d.Step 2: Deploy a container with memory isolation disabled
Create a Kubernetes Job that requests GPU memory through cGPU but opts out of memory isolation enforcement. Set the CGPU_DISABLE environment variable to "true" in the container spec.
apiVersion: batch/v1
kind: Job
metadata:
name: disable-cgpu
spec:
parallelism: 1
template:
metadata:
labels:
app: disable-cgpu
spec:
containers:
- name: disable-cgpu
image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
env:
- name: CGPU_DISABLE # Set to "true" to disable memory isolation
value: "true"
command:
- python
- tensorflow-sample-code/tfjob/docker/mnist/main.py
- --max_steps=100000
- --data_dir=tensorflow-sample-code/data
resources:
limits:
aliyun.com/gpu-mem: 3 # Requests 3 GiB of GPU memory for scheduling
workingDir: /root
restartPolicy: NeverKey fields in this configuration:
| Field | Description |
|---|---|
CGPU_DISABLE: "true" | Disables memory isolation enforcement. The container sees the full physical GPU memory instead of only the allocated quota. |
aliyun.com/gpu-mem: 3 | Requests 3 GiB of GPU memory from the cGPU scheduler. This value is used for scheduling and allocation tracking, not for enforcement when CGPU_DISABLE is set. |
Step 3: Verify GPU scheduling
After the Job is scheduled, confirm that the cGPU scheduler allocated the requested GPU memory to a node:
kubectl inspect cgpuThe expected output is similar to:
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
cn-beijing.192.16x.x.xx1 192.16x.x.xx1 0/15 0/15
cn-beijing.192.16x.x.xx2 192.16x.x.xx2 0/15 0/15
cn-beijing.192.16x.x.xx3 192.16x.x.xx3 3/15 3/15
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
3/45 (6%)Check these fields to confirm scheduling succeeded:
`GPU0(Allocated/Total)` on `cn-beijing.192.16x.x.xx3`: Shows
3/15, confirming 3 GiB was allocated from this node's 15 GiB total.`Allocated/Total GPU Memory In Cluster`: Shows
3/45 (6%), confirming the cluster-level allocation is updated.
Verify that memory isolation is disabled
Use either of the following methods to confirm that the container can see the full physical GPU memory, not just the allocated quota.
Method 1: Check the application log
kubectl logs disable-cgpu-xxxx --tail=1The expected output is similar to:
2020-08-25 08:14:54.927965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15024 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)Confirm these values:
`15024 MB memory`: The container sees 15,024 MiB — the full physical GPU memory, not the 3 GiB quota. If memory isolation were enabled, this value would show 3 GiB.
Method 2: Run nvidia-smi in the container
kubectl exec disable-cgpu-xxxx nvidia-smiThe expected output is similar to:
Tue Aug 25 08:23:33 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 33C P0 55W / 300W | 15453MiB / 16130MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+Confirm these values:
`15453MiB / 16130MiB`: The container is using 15,453 MiB out of the host's 16,130 MiB total. If memory isolation were enabled, the container would be limited to 3 GiB.