GPU sharing lets multiple pods run on the same GPU card in an ACK Lingjun managed cluster. Depending on whether your workloads need strict memory boundaries, choose between two modes:
| Mode | How it works | GPU memory isolated? | Use when |
|---|---|---|---|
| Sharing without isolation | Pods share the GPU; memory is not partitioned between them | No | Workloads that manage their own memory limits (for example, Java apps with -Xmx) |
| Sharing with isolation (eGPU) | Pods share the GPU; each pod gets a hard memory boundary enforced by the eGPU module | Yes | Multiple containers on one GPU where one must not starve others |
Prerequisites
Before you begin, ensure that you have:
-
An ACK Lingjun managed cluster with at least one GPU-accelerated Lingjun node. See Create a Lingjun cluster with ACK activated
-
The GPU sharing component, which is installed by default in ACK Lingjun managed clusters
Enable GPU sharing without isolation
Use this mode when your workloads handle GPU memory limits at the application layer.
Step 1: Label the node
-
Confirm the node is a Lingjun node by checking whether
/etc/lingjun_metadataexists on the node. If the file exists, runnvidia-smito verify the GPU is accessible. If the file does not exist, the node is not a Lingjun node and you cannot enable GPU sharing for it. Create a Lingjun node pool first. See Overview of Lingjun node pools. -
Add the GPU sharing label to the node:
kubectl label node <NODE_NAME> ack.node.gpu.schedule=share
Step 2: Submit a GPU-sharing job
-
Create a file named
tensorflow.yamlwith the following content:apiVersion: batch/v1 kind: Job metadata: name: tensorflow-mnist-share spec: parallelism: 1 template: metadata: labels: app: tensorflow-mnist-share spec: containers: - name: tensorflow-mnist-share image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5 command: - python - tensorflow-sample-code/tfjob/docker/mnist/main.py - --max_steps=100000 - --data_dir=tensorflow-sample-code/data resources: limits: aliyun.com/gpu-mem: 4 # Request 4 GiB of GPU memory workingDir: /root restartPolicy: NeverThe key field is
aliyun.com/gpu-mem: 4underresources.limits, which requests 4 GiB of GPU memory for the pod. -
Submit the job:
kubectl apply -f tensorflow.yaml
Step 3: Verify GPU sharing without isolation
-
Get the pod name:
kubectl get pod | grep tensorflow -
Run
nvidia-smiinside the pod:kubectl exec -ti tensorflow-mnist-share-xxxxx -- nvidia-smiExpected output:
Wed Jun 14 06:45:56 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 35C P0 59W / 300W | 334MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+Key fields to check:
-
Memory-Usage shows
334MiB / 16384MiB— the pod sees the full 16,384 MiB of GPU memory, not just the 4 GiB it requested. This confirms isolation is not active. -
If the GPU isolation module were installed, the memory field would show only the requested 4 GiB.
-
In this mode, the pod does not enforce memory limits at the GPU driver level. The scheduler tracks memory allocations using two environment variables injected into the container: Applications that need to respect the allocation can calculate the permitted fraction: 4 / 16 = 0.25 (25% of total GPU memory).ALIYUN_COM_GPU_MEM_CONTAINER=4 # GPU memory allocated to this pod (GiB)
ALIYUN_COM_GPU_MEM_DEV=16 # Total GPU memory per card (GiB)
Enable GPU sharing with isolation (eGPU)
Use this mode to enforce hard GPU memory boundaries between pods on the same GPU card.
Step 1: Label the node
-
Confirm the node is a Lingjun node by checking whether
/etc/lingjun_metadataexists on the node. If the file exists, runnvidia-smito verify the GPU is accessible. If the file does not exist, the node is not a Lingjun node and you cannot enable GPU sharing for it. Create a Lingjun node pool first. See Overview of Lingjun node pools. -
Add the GPU sharing label to the node. Choose the label value based on which isolation you need:
Label value What it enables egpu_memGPU memory isolation only egpu_core_memGPU memory isolation and computing power isolation To enable memory isolation:
kubectl label node <NODE_NAME> ack.node.gpu.schedule=egpu_memGPU computing power must always be requested together with GPU memory. Requesting computing power alone is not supported.
Step 2: Confirm node resources are ready
After labeling the node, wait for the node to report its GPU resources, then verify:
kubectl get node <NODE_NAME> -oyaml
Look for aliyun.com/gpu-mem and aliyun.com/gpu-count in the allocatable and capacity sections:
allocatable:
aliyun.com/gpu-count: "1"
aliyun.com/gpu-mem: "80"
...
nvidia.com/gpu: "0"
...
capacity:
aliyun.com/gpu-count: "1"
aliyun.com/gpu-mem: "80"
...
nvidia.com/gpu: "0"
...
Key fields to check:
-
`aliyun.com/gpu-count: "1"` — the node has one GPU card.
-
`aliyun.com/gpu-mem: "80"` — the node has 80 GB of total GPU memory.
-
`nvidia.com/gpu: "0"` — the whole GPU is not exposed as an independent schedulable resource; memory is allocated via
aliyun.com/gpu-mem.
To schedule a pod to an entire GPU device, add the labelack.gpushare.placement=require-whole-deviceto the pod and specify the amount of GPU memory usingaliyun.com/gpu-mem.
Step 3: Run a benchmarking job to verify isolation
-
Create a file named
benchmark.yamlwith the following content:apiVersion: batch/v1 kind: Job metadata: name: benchmark-job spec: parallelism: 1 template: spec: containers: - name: benchmark-job image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=500000000 - --batch_size=8 resources: limits: aliyun.com/gpu-mem: 10 # Request 10 GB of GPU memory workingDir: /root restartPolicy: Never hostNetwork: true tolerations: - operator: Exists -
Submit the job:
kubectl apply -f benchmark.yaml -
After the pod starts, open a shell in the pod:
kubectl exec -ti benchmark-job-xxxx bash -
Run
vgpu-smito check the GPU isolation status:-
Memory-Usage shows
8307MiB / 10782MiB— the pod is capped at approximately 10 GB, confirming that GPU memory isolation is active. -
Unlike
nvidia-smiin the non-isolated mode,vgpu-smishows only the memory allocated to this pod, not the total GPU memory.
vgpu-smiExpected output:
+------------------------------------------------------------------------------+ | VGPU_SMI 460.91.03 DRIVER_VERSION: 460.91.03 CUDA Version: 11.2 | +-------------------------------------------+----------------------------------+ | GPU Name Bus-Id | Memory-Usage GPU-Util | |===========================================+==================================| | 0 xxxxxxxx 00000000:00:07.0 | 8307MiB / 10782MiB 100% / 100% | +-------------------------------------------+----------------------------------+Key fields to check:
-
FAQ
How do I check whether the GPU sharing component is installed?
Run the following command:
kubectl get ds -nkube-system | grep gpushare
If the component is installed, the output lists the following DaemonSets:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpushare-egpu-device-plugin-ds 0 0 0 0 0 <none>
gpushare-egpucore-device-plugin-ds 0 0 0 0 0 <none>
What's next
-
Labels for enabling GPU scheduling policies — learn about all available node labels for GPU scheduling.
-
Overview of Lingjun node pools — add Lingjun nodes to your cluster.