In some scenarios, you may want multiple inference tasks to share the same GPU to improve GPU utilization. In this example, the Qwen1.5-0.5B-Chat model and the V100 GPU are used to describe how to use KServe to deploy inference services that share a GPU.
Prerequisites
A Container Service for Kubernetes (ACK) managed cluster or an ACK dedicated cluster with GPU-accelerated nodes is created. The cluster runs Kubernetes 1.22 or later and uses Compute Unified Device Architecture (CUDA) 12.0 or later. For more information, see Create an ACK cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
By default, GPU-accelerated nodes use CUDA 11. You can add the
ack.aliyun.com/nvidia-driver-version:525.105.17tag to the GPU-accelerated node pool to specify CUDA 12 for the GPU-accelerated nodes. For more information, see Specify an NVIDIA driver version for nodes by adding a label.The GPU sharing component is installed and GPU sharing is enabled.
The Arena client of version 0.9.15 or later is installed. For more information, see Configure the Arena client.
The cert-manager and ack-kserve components are installed. The ack-kserve component is deployed in Raw Deployment mode.
Step 1: Prepare model data
You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Mount a statically provisioned ossfs 1.0 volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.
Download the model. In this example, the Qwen1.5-0.5B-Chat model is used.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git cd Qwen1.5-0.5B-Chat.git git lfs pullUpload the Qwen1.5-0.5B-Chat files to Object Storage Service (OSS).
NoteFor more information about how to install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-ChatConfigure a persistent volume (PV) named llm-model and a persistent volume claim (PVC) named llm-model for the cluster.
The following table describes the parameters of the PV.
Parameter
Description
PV Type
OSS
Volume Name
llm-model
Access Certificate
Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID
Select the OSS bucket that you created in the previous step.
OSS Path
Select the path of the model, such as /Qwen1.5-0.5B-Chat.
The following table describes the parameters of the PVC.
Parameter
Description
PVC Type
OSS
Volume Name
llm-model
Allocation Mode
Select Existing Volumes.
Existing Volumes
Click the Existing Volumes hyperlink and select the PV that you created.
Step 2: Deploy inference services
Start two Qwen inference services. Each inference service requires 6 GB of GPU memory.
When starting the second Qwen inference service, you only need to execute the same command and change--name=qwen1to--name=qwen2.
arena serve kserve \
--name=qwen1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
--gpumemory=6 \
--cpu=3 \
--memory=8Gi \
--data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"Configure the parameters described in the following table.
Parameter | Required | Description |
--name | Yes | The name of the inference service, which is globally unique. |
--image | Yes | The address of the inference service image. |
--gpumemory | No | The requested amount of GPU memory. Example: |
--cpu | No | The number of vCPUs requested by the inference service. |
--memory | No | The amount of memory requested by the inference service. |
--data | No | The address of the inference service model. In this example, the PV of the model is |
Step 3: Verify the inference services
Query the status of the inference services.
kubectl get pod -owide |grep qwenExpected output:
qwen1-predictor-856568bdcf-5pfdq 1/1 Running 0 7m10s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none> qwen2-predictor-6b477b587d-dpdnj 1/1 Running 0 4m3s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none>The expected output indicates that qwen1 and qwen2 are deployed on the same GPU-accelerated node (
cn-beijing. 172.16.XX.XX).Run the following commands to log on to the pods where the inference services are deployed and view the amount of GPU memory allocated to the pods:
kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # Log on to the pod where the first inference service is deployed. kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Log on to the pod where the second inference service is deployed.Expected output:
The output shows that each pod can use at most 6 GB of GPU memory and each pod is allocated 6 GB of GPU memory. Therefore, the node has sufficient GPU memory for the pods where the two inference services are deployed.
Access one of the inference services by using the IP address of the NGINX Ingress.
curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \ -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \ -d '{ "model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'Expected output:
{"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.com pletion","created":1719303373,"model":"qwen","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}The output indicates that the model can generate a response based on the given prompt. In this example, the prompt is a test request.
(Optional) Step 4: Clear the environment
If you no longer need the resources, clear the environment promptly.
Run the following commands to delete the inference services:
arena serve delete qwen1 arena serve delete qwen2Run the following commands to delete the PV and the PVC:
kubectl delete pvc llm-model kubectl delete pv llm-model