To improve GPU utilization, you can run multiple model inference tasks on the same GPU. This topic uses the Qwen1.5-0.5B-Chat model and a V100 GPU as an example to describe how to use KServe to deploy model inference services that share a GPU.
Prerequisites
A Container Service for Kubernetes (ACK) managed cluster or an ACK dedicated cluster with GPU-accelerated nodes is created. The cluster runs Kubernetes 1.22 or later and uses Compute Unified Device Architecture (CUDA) 12.0 or later. For more information, see Add GPU-accelerated nodes to a cluster or Create an ACK dedicated cluster with GPU-accelerated nodes.
By default, GPU-accelerated nodes use CUDA 11. You can add the
ack.aliyun.com/nvidia-driver-version:525.105.17tag to the GPU-accelerated node pool to specify CUDA 12 for the GPU-accelerated nodes. For more information, see Customize the NVIDIA GPU driver version on nodes.The shared GPU scheduling component is installed, which enables GPU sharing and scheduling.
The Arena client of version 0.9.15 or later is installed. For more information, see Configure the Arena client.
The cert-manager and ack-kserve components are installed, and the ack-kserve component is in Raw Deployment mode.
Step 1: Prepare model data
You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Use an ossfs 1.0 statically provisioned volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.
Download the model. This topic uses the Qwen1.5-0.5B-Chat model as an example.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git cd Qwen1.5-0.5B-Chat git lfs pullUpload the downloaded Qwen1.5-0.5B-Chat files to OSS.
NoteFor more information about how to install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-ChatConfigure a persistent volume (PV) and a persistent volume claim (PVC) for the target cluster.
The following table describes the basic configuration of the example PV.
Configuration item
Description
Persistent volume type
OSS
Name
llm-model
Certificate Access
Configure the AccessKey ID and AccessKey secret to access OSS.
Bucket ID
Select the OSS Bucket that you created in the previous step.
OSS path
Select the path where the model is stored, such as /Qwen1.5-0.5B-Chat.
The following table describes the basic configuration of the example PVC.
Configuration item
Description
Persistent volume claim type
OSS
Name
llm-model
Allocation mode
Select Existing persistent volume.
Existing persistent volume
Click the Select Existing persistent volume link and select the created PV.
Step 2: Deploy the inference services
Start two Qwen inference services. Each service requires 6 GB of GPU memory.
To start the second Qwen inference service, run the same command and change--name=qwen1to--name=qwen2.
arena serve kserve \
--name=qwen1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
--gpumemory=6 \
--cpu=3 \
--memory=8Gi \
--data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"The following table describes the parameters.
Parameter | Required | Description |
--name | Yes | The name of the inference service to submit. The name must be globally unique. |
--image | Yes | The registry address of the inference service. |
--gpumemory | No | The amount of GPU memory to request. Example: |
--cpu | No | The number of vCPUs to use for the inference service. |
--memory | No | The amount of memory to use for the inference service. |
--data | No | The model path for the inference service. In this topic, the PV for the model is |
Step 3: Verify the inference services
You can check the deployment status of the two Qwen inference services.
kubectl get pod -owide |grep qwenExpected output:
qwen1-predictor-856568bdcf-5pfdq 1/1 Running 0 7m10s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none> qwen2-predictor-6b477b587d-dpdnj 1/1 Running 0 4m3s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none>The output shows that both qwen1 and qwen2 are deployed on the same GPU node,
cn-beijing.172.16.XX.XX.Run the following two commands to access the pods of the two inference services and check the GPU memory allocated to each pod.
kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # Enter the pod of the first inference service. kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Enter the pod of the second inference service.Expected output:
The output shows that the GPU memory limit for both pods is 6 GB. This indicates that each pod is allocated 6 GB of GPU memory and the GPU memory of the node is successfully shared by the two inference service pods.
Access the inference service using the Nginx Ingress gateway address.
curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \ -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \ -d '{ "model": "qwen", "messages": [{"role": "user", "content": "This is a test."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'Expected output:
{"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.completion","created":1719303373,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}The output shows that the model can generate a response based on the given input, which is a test message in this example.
(Optional) Step 4: Clean up the environment
If you no longer need the created resources, delete them promptly.
Run the following command to delete the deployed model inference services.
arena serve delete qwen1 arena serve delete qwen2Run the following command to delete the created PV and PVC.
kubectl delete pvc llm-model kubectl delete pv llm-model