Vectorized Large Language Model (vLLM) is a high-performance large language model (LLM) inference library that supports multiple model formats and acceleration of backend Services. vLLM is suitable for deploying an LLM as an inference service. This topic describes how to deploy a vLLM model as an inference service. In this example, a Qwen-7B-Chat-Int8 model that uses NVIDIA V100 GPUs is used.
For more information about vLLM, see vllm-project.
Prerequisites
A Container Service for Kubernetes (ACK) managed cluster or an ACK dedicated cluster with GPU-accelerated nodes is created. The cluster runs Kubernetes 1.22 or later and uses Compute Unified Device Architecture (CUDA) 12.0 or later. For more information, see Create an ACK cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
By default, GPU-accelerated nodes use CUDA 11. You can add the
ack.aliyun.com/nvidia-driver-version:525.105.17
tag to the GPU-accelerated node pool to specify CUDA 12 for the GPU-accelerated nodes. For more information, see Specify an NVIDIA driver version for nodes by adding a label.The ack-kserve component is installed. For more information, see Install ack-kserve.
The cloud-native AI suite is installed.
The Arena client of version 0.9.15 or later is installed. For more information, see Configure the Arena client.
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
Step 1: Prepare model data and upload the model data to an OSS bucket
You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Mount a statically provisioned OSS volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.
Download a model. In this example, a Qwen-7B-Chat-Int8 model is used.
Run the following command to install Git:
sudo yum install git
Run the following command to install the Large File Support (LFS) plug-in:
sudo yum install git-lfs
Run the following command to clone the Qwen-7B-Chat-Int8 repository from the ModelScope community to your local host:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git
Run the following command to go to the directory in which the Qwen-7B-Chat-Int8 repository is stored:
cd Qwen-7B-Chat-Int8
Run the following command to download large files managed by LFS from the directory in which the Qwen-7B-Chat-Int8 repository is stored:
git lfs pull
Upload the downloaded Qwen-7B-Chat-Int8 file to the OSS bucket.
Log on to the OSS console and view and record the name of the OSS bucket that you created.
For more information about how to create an OSS bucket, see Create a bucket.
Install and configure ossutil. For more information, see Install ossutil.
Run the following command to create a directory named Qwen-7B-Chat-Int8 in the OSS bucket:
ossutil mkdir oss://<your-bucket-name>/Qwen-7B-Chat-Int8
Run the following command to upload the model file to the OSS bucket.
ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<Your-Bucket-Name>/Qwen-7B-Chat-Int8
Configure a persistent volume (PV) and a persistent volume claim (PVC) that are named llm-model for the cluster. For more information, see Mount a statically provisioned OSS volume.
The following table describes the basic parameters that are used to create the PV.
Parameter
Description
PV Type
The type of the PV. In this example, OSS is selected.
Volume Name:
The name of the PV. In this example, the PV is named llm-model.
Access Certificate
The AccessKey pair that is used to access the OSS bucket. The AccessKey pair consists of an AccessKey ID and an AccessKey secret
Bucket ID:
The name of the OSS bucket. Select the OSS bucket that you created.
OSS Path
The path in which the model resides. Example: /Qwen-7B-Chat-Int8.
The following table describes the basic parameters that are used to create the PVC.
Parameter
Description
PVC Type
The type of the PVC. In this example, OSS is selected.
Name
The name of the PVC. In this example, the PVC is named llm-model.
Allocation Mode
In this example, Existing Volumes is selected.
Existing Volumes
Click Select PV. In the Select PV dialog box, find the PV that you want to use and click Select in the Actions column.
Step 2: Deploy an inference service
Run the following command to query the GPU resources that are available in the cluster:
arena top node
The number of GPU-accelerated nodes that can be used to run the inference service is returned.
Run the following command to start the inference service named
vllm
:arena serve kserve \ --name=qwen \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"
The following table describes the parameters.
Parameter
Required
Description
--name
Yes
The name of the inference service that you submit, which is globally unique.
--image
Yes
The image address of the inference service.
--gpus
No
The number of GPUs to be used by the inference service. Default value: 0.
--cpu
No
The number of CPU cores to be used by the inference service.
--memory
No
The size of memory to be used by the inference service.
--data
No
The address of the model that is deployed as the inference service. In this example, the model is stored in the llm-model directory, which is mounted to the /mnt/models/ directory in the pod.
Expected output:
inferenceservice.serving.kserve.io/qwen created INFO[0006] The Job qwen has been submitted successfully INFO[0006] You can run `arena serve get qwen --type kserve -n default` to check the job status
The preceding output indicates that the inference service is deployed.
Step 3: Verify the inference service
Run the following command to view the deployment progress of the inference service deployed by using KServe:
arena serve get qwen
Expected output:
The preceding output indicates that the inference service is deployed by using KServe and the model can be accessed from
http://qwen-default.example.com
.Run the following command to obtain the IP address of the NGINX Ingress controller and access the inference service by using the IP address.
# Obtain the IP address of the NGINX Ingress controller. NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}') # Obtain the hostname of the inference service. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a request to access the inference service. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Perform a text."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}
Expected output:
The preceding output indicates that the request is correctly sent to the inference service and the service returns an expected response in the JSON format.
Step 4: (Optional) Delete the inference service
Before you delete the inference service, make sure that you no longer require the inference service and its related resources.
Run the following command to delete the inference service.
arena serve delete qwen
References
You can configure Managed Service for Prometheus to monitor the inference service and detect exceptions. This way, you can handle exceptions at the earliest opportunity. For more information, see Configure Managed Service for Prometheus for a service deployed by using KServe.
For more information about how to accelerate models by using KServe, see Accelerate data pulling for a model by using Fluid in KServe.