GPU-accelerated nodes are expensive, and a single inference service rarely saturates an entire GPU. Shared GPU scheduling lets you run multiple inference services on one GPU by slicing its memory into fixed-size allocations. This guide shows how to deploy two Qwen1.5-0.5B-Chat inference services on a single V100 GPU using KServe and Arena, with each service receiving a 6 GB memory slice.
How it works
ACK's shared GPU scheduling component implements GPU memory slicing. Each inference service declares how much GPU memory it needs via the --gpumemory flag, and the scheduler places multiple pods on the same GPU node as long as the total requested memory stays within the node's physical GPU capacity.
Use shared GPU scheduling when maximizing GPU utilization matters more than fault isolation between services. For workloads that require strict isolation, use dedicated GPU nodes.
Limitations
-
The total GPU memory requested across all pods on a node must not exceed the node's physical GPU memory.
-
GPU-accelerated nodes use CUDA 11 by default. This guide requires CUDA 12.0 or later.
-
ack-kserve must be in Raw Deployment mode.
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed cluster or ACK dedicated cluster with GPU-accelerated nodes, running Kubernetes 1.22 or later. For more information, see Add GPU-accelerated nodes to a cluster or Create an ACK dedicated cluster with GPU-accelerated nodes.
-
CUDA 12.0 or later on the GPU nodes. By default, GPU-accelerated nodes use CUDA 11. To use CUDA 12, add the tag
ack.aliyun.com/nvidia-driver-version:525.105.17to the GPU-accelerated node pool. For more information, see Customize the NVIDIA GPU driver version on nodes. -
The shared GPU scheduling component installed on the cluster
-
Arena client version 0.9.15 or later. For more information, see Configure the Arena client.
-
cert-manager and ack-kserve installed, with ack-kserve in Raw Deployment mode
Step 1: Prepare model data
Store the model in an Object Storage Service (OSS) bucket or an Apsara File Storage NAS file system. This guide uses OSS. For more information, see Use an ossfs 1.0 statically provisioned volume or Mount a statically provisioned NAS volume.
-
Download the Qwen1.5-0.5B-Chat model.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git cd Qwen1.5-0.5B-Chat git lfs pull -
Upload the model files to your OSS bucket.
For installation and usage of ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat -
Create a persistent volume (PV) for the cluster using the following configuration.
Configuration item Value Persistent volume type OSS Name llm-model Certificate Access AccessKey ID and AccessKey secret for the OSS bucket Bucket ID The OSS bucket created in the previous step OSS path /Qwen1.5-0.5B-Chat -
Create a persistent volume claim (PVC) bound to the PV.
Configuration item Value Persistent volume claim type OSS Name llm-model Allocation mode Select Existing persistent volume Existing persistent volume Click Select Existing persistent volume and select the PV created in the previous step
Step 2: Deploy the inference services
Deploy two Qwen inference services, each requesting 6 GB of GPU memory. The commands are identical except for --name.
Run the following command to start the first service:
arena serve kserve \
--name=qwen1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
--gpumemory=6 \
--cpu=3 \
--memory=8Gi \
--data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"
To start the second service, run the same command with --name=qwen2.
The following table describes the key parameters.
| Parameter | Type | Required | Description |
|---|---|---|---|
--name |
String | Yes | Name of the inference service. Must be globally unique. |
--image |
String | Yes | Container image for the inference service. |
--gpumemory |
Integer (GB) | No | GPU memory to allocate to this service, in GB. For example, --gpumemory=6 allocates 6 GB. The total GPU memory requested by all services on a node must not exceed the node's physical GPU memory. |
--cpu |
Integer | No | Number of vCPUs for the inference service. |
--memory |
String | No | Amount of RAM for the inference service, for example, 8Gi. |
--data |
String | No | PVC-to-container mount path in the format <pvc-name>:<container-path>. In this example, the llm-model PVC is mounted to /mnt/models/ in the container. |
Step 3: Verify the inference services
-
Check that both pods are running on the same GPU node.
kubectl get pod -owide | grep qwenExpected output:
qwen1-predictor-856568bdcf-5pfdq 1/1 Running 0 7m10s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none> qwen2-predictor-6b477b587d-dpdnj 1/1 Running 0 4m3s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none>Both pods appear on the same node (
cn-beijing.172.16.XX.XX), confirming that GPU sharing is active. -
Check the GPU memory allocated to each pod. Run the following commands — one per service:
kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # First service kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Second serviceExpected output for each pod: GPU memory allocated to the first inference service
Fri Jun 28 06:20:43 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB On | 00000000:00:07.0 Off | 0 | | N/A 39C P0 53W / 300W | 5382MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+GPU memory allocated to the second inference service
Fri Jun 28 06:40:17 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB On | 00000000:00:07.0 Off | 0 | | N/A 39C P0 53W / 300W | 5382MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+Each pod's GPU memory limit is 6 GB (6,144 MiB). This confirms that both services are sharing the node's GPU memory as configured.
-
Send a test request to the inference service through the NGINX Ingress gateway.
curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \ -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \ -d '{ "model": "qwen", "messages": [{"role": "user", "content": "This is a test."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'Expected output:
{"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.completion","created":1719303373,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}The model returns a response, confirming that the inference service is working correctly.
(Optional) Step 4: Clean up
Delete the resources when they are no longer needed.
Delete the inference services:
arena serve delete qwen1
arena serve delete qwen2
Delete the PVC and PV:
kubectl delete pvc llm-model
kubectl delete pv llm-model