Some AI/ML workloads manage GPU memory through framework-level APIs or custom allocation logic. For those workloads, ACK's GPU memory isolation layer is redundant and may interfere with the application's own memory management. This topic shows you how to enable GPU sharing on a node pool without installing the GPU memory isolation module.
Use this mode only when your application already manages GPU memory limits internally.
If you need both GPU sharing and memory isolation, see Configure GPU sharing with memory isolation.
Prerequisites
Before you begin, ensure that you have:
-
An ACK Pro cluster. See Create an ACK Pro cluster
-
The GPU inspection tool installed. See Install the GPU inspection tool
How it works
When GPU sharing is enabled without memory isolation:
-
The node label
ack.node.gpu.schedule=shareactivates GPU sharing on the node pool. -
The
aliyun.com/gpu-memresource limit tells the scheduler how much GPU memory the pod requests. ACK uses this value for scheduling decisions and ratio calculations, but does not enforce it as a hard memory cap. -
The pod sees the full physical GPU memory (for example, 16,384 MiB on a V100). ACK injects two environment variables —
ALIYUN_COM_GPU_MEM_CONTAINERandALIYUN_COM_GPU_MEM_DEV— that the application uses to calculate its share and stay within the requested limit.
Step 1: Create a node pool
-
Log on to the ACK console and click Clusters in the left-side navigation pane.
-
Click the name of the cluster, then choose Nodes > Node Pools in the left-side navigation pane.
-
On the Node Pools page, click Create Node Pool.
-
In the Create Node Pool dialog box, configure the following parameters, then click Confirm Order. For all other parameters, see Create and manage a node pool.
Setting
ack.node.gpu.schedule=shareenables GPU sharing on the node pool without installing the GPU memory isolation module. For all supported GPU scheduling labels, see Labels for enabling GPU scheduling policies.Parameter Value Instance type Set Architecture to GPU-accelerated and select one or more GPU instance types. This example uses V100 instances. Expected nodes Set to 0if you do not want to provision nodes immediately.Node labels Click Add Label, set Key to ack.node.gpu.schedule, and set Value toshare.
Step 2: Submit a job
-
Log on to the ACK console and click Clusters in the left-side navigation pane.
-
Click the name of the cluster, then choose Workloads > Jobs in the left-side navigation pane.
-
Click Create from YAML, paste the following YAML into the Template section, and click Create.
apiVersion: batch/v1 kind: Job metadata: name: tensorflow-mnist-share spec: parallelism: 1 template: metadata: labels: app: tensorflow-mnist-share spec: containers: - name: tensorflow-mnist-share image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5 command: - python - tensorflow-sample-code/tfjob/docker/mnist/main.py - --max_steps=100000 - --data_dir=tensorflow-sample-code/data resources: limits: aliyun.com/gpu-mem: 4 # Request 4 GiB of GPU memory workingDir: /root restartPolicy: NeverThe
aliyun.com/gpu-mem: 4limit requests 4 GiB of GPU memory for scheduling purposes. In this mode, it does not prevent other pods from using the remaining GPU memory on the same device.
Verify the configuration
To confirm that GPU sharing is active without memory isolation, check that the pod sees the full physical GPU memory rather than only the requested amount.
-
On the Clusters page, click the cluster name, then choose Workloads > Pods in the left-side navigation pane.
-
In the Actions column of the pod (for example,
tensorflow-mnist-share-***), click Terminal and run:nvidia-smiExpected output:
Wed Jun 14 06:45:56 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 35C P0 59W / 300W | 334MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+What to check: The denominator in the Memory-Usage field must show the full physical GPU memory —
16384MiBfor a V100. If it shows4096MiBinstead, GPU memory isolation is active on this node. -
Verify the environment variables ACK injects into the pod:
ALIYUN_COM_GPU_MEM_CONTAINER=4 # GPU memory requested by this pod (GiB) ALIYUN_COM_GPU_MEM_DEV=16 # Total physical GPU memory (GiB)The application uses these variables to calculate its memory usage ratio and stay within the requested limit:
percentage = ALIYUN_COM_GPU_MEM_CONTAINER / ALIYUN_COM_GPU_MEM_DEV = 4 / 16 = 0.25