When a single model training job needs more GPU memory than one physical card provides, multi-GPU sharing lets a Pod draw equal allocations from multiple GPU cards simultaneously — without exclusively occupying any card. ACK Pro clusters support multi-GPU sharing with GPU memory isolation, enabling finer-grained resource utilization during model development.
Prerequisites
Before you begin, ensure that you have:
Limitations
Multi-GPU sharing supports GPU memory isolation with computing power sharing only. GPU memory isolation with computing power allocation is not supported.
How it works
Multi-GPU sharing lets a single Pod request GPU memory from multiple physical GPU cards simultaneously. Each card contributes an equal share.
| Mode | Description |
|---|---|
| Single-GPU sharing | A Pod uses a portion of one GPU card's resources. |
| Multi-GPU sharing | A Pod spans multiple GPU cards, with each card contributing the same amount of GPU memory. |
Allocation formula: If a Pod requests N GiB of GPU memory from M GPU cards, each card allocates N/M GiB.
For example, a Pod requesting 8 GiB across 2 GPU cards receives 4 GiB from each card.
Constraints:
-
N/M must be an integer.
-
All M GPU cards must be on the same Kubernetes node.
Configure multi-GPU sharing
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Jobs.
-
On the Jobs page, click Create from YAML. Copy the following YAML into the Template area, then click Create.
apiVersion: batch/v1 kind: Job metadata: name: tensorflow-mnist-multigpu spec: parallelism: 1 template: metadata: labels: app: tensorflow-mnist-multigpu aliyun.com/gpu-count: "2" # Number of GPU cards to use spec: containers: - name: tensorflow-mnist-multigpu image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5 command: - python - tensorflow-sample-code/tfjob/docker/mnist/main.py - --max_steps=100000 - --data_dir=tensorflow-sample-code/data resources: limits: aliyun.com/gpu-mem: 8 # Total GPU memory in GiB across all cards workingDir: /root restartPolicy: NeverKey parameters:
Parameter Type Description aliyun.com/gpu-countString (Pod label) Number of GPU cards to use. Set in metadata.labels. In this example,"2"means the Pod requests GPU memory from 2 cards.aliyun.com/gpu-memInteger (resource limit) Total GPU memory in GiB to request across all GPU cards. Set in resources.limits. In this example,8means 8 GiB total — each of the 2 cards provides 4 GiB.
Verify GPU memory isolation
After the Job starts, verify that the Pod can access only its allocated GPU memory.
-
On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.
-
In the row for the Pod (for example,
tensorflow-mnist-multigpu-***), click Actions > Terminal to open a terminal session. Run the following command:nvidia-smiThe expected output is similar to:
Wed Jun 14 03:24:14 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 38C P0 61W / 300W | 569MiB / 4309MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 | | N/A 36C P0 61W / 300W | 381MiB / 4309MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+Confirm the following in the output:
-
Two GPU cards are listed (GPU 0 and GPU 1), matching
aliyun.com/gpu-count: "2". -
Each card shows 4309 MiB total memory — the requested 4 GiB per card, not the physical 16,160 MiB. This confirms GPU memory isolation is active.
-
-
In the row for the same Pod, click Actions > Logs to view the container logs. Confirm the following output appears twice (once per card):
totalMemory: 4.21GiB freeMemory: 3.91GiBThe totalMemory value of approximately 4 GiB per card — rather than the physical 16,160 MiB — confirms that GPU memory isolation is working correctly from the application's perspective.