All Products
Search
Document Center

Container Service for Kubernetes:Deploy a vLLM model as an inference service

Last Updated:Nov 01, 2024

Vectorized Large Language Model (vLLM) is a high-performance large language model (LLM) inference library that supports multiple model formats and acceleration of backend Services. vLLM is suitable for deploying an LLM as an inference service. This topic describes how to deploy a vLLM model as an inference service. In this example, a Qwen-7B-Chat-Int8 model that uses NVIDIA V100 GPUs is used.

Note

For more information about vLLM, see vllm-project.

Prerequisites

Step 1: Prepare model data and upload the model data to an OSS bucket

You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Mount a statically provisioned OSS volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.

  1. Download a model. In this example, a Qwen-7B-Chat-Int8 model is used.

    1. Run the following command to install Git:

      sudo yum install git
    2. Run the following command to install the Large File Support (LFS) plug-in:

      sudo yum install git-lfs
    3. Run the following command to clone the Qwen-7B-Chat-Int8 repository from the ModelScope community to your local host:

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git
    4. Run the following command to go to the directory in which the Qwen-7B-Chat-Int8 repository is stored:

      cd Qwen-7B-Chat-Int8
    5. Run the following command to download large files managed by LFS from the directory in which the Qwen-7B-Chat-Int8 repository is stored:

      git lfs pull
  2. Upload the downloaded Qwen-7B-Chat-Int8 file to the OSS bucket.

    1. Log on to the OSS console and view and record the name of the OSS bucket that you created.

      For more information about how to create an OSS bucket, see Create a bucket.

    2. Install and configure ossutil. For more information, see Install ossutil.

    3. Run the following command to create a directory named Qwen-7B-Chat-Int8 in the OSS bucket:

      ossutil mkdir oss://<your-bucket-name>/Qwen-7B-Chat-Int8
    4. Run the following command to upload the model file to the OSS bucket.

      ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<Your-Bucket-Name>/Qwen-7B-Chat-Int8
  3. Configure a persistent volume (PV) and a persistent volume claim (PVC) that are named llm-model for the cluster. For more information, see Mount a statically provisioned OSS volume.

    • The following table describes the basic parameters that are used to create the PV.

      Parameter

      Description

      PV Type

      The type of the PV. In this example, OSS is selected.

      Volume Name:

      The name of the PV. In this example, the PV is named llm-model.

      Access Certificate

      The AccessKey pair that is used to access the OSS bucket. The AccessKey pair consists of an AccessKey ID and an AccessKey secret

      Bucket ID:

      The name of the OSS bucket. Select the OSS bucket that you created.

      OSS Path

      The path in which the model resides. Example: /Qwen-7B-Chat-Int8.

    • The following table describes the basic parameters that are used to create the PVC.

      Parameter

      Description

      PVC Type

      The type of the PVC. In this example, OSS is selected.

      Name

      The name of the PVC. In this example, the PVC is named llm-model.

      Allocation Mode

      In this example, Existing Volumes is selected.

      Existing Volumes

      Click Select PV. In the Select PV dialog box, find the PV that you want to use and click Select in the Actions column.

Step 2: Deploy an inference service

  1. Run the following command to query the GPU resources that are available in the cluster:

    arena top node

    The number of GPU-accelerated nodes that can be used to run the inference service is returned.

  2. Run the following command to start the inference service named vllm:

    arena serve kserve \
        --name=qwen \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    The following table describes the parameters.

    Parameter

    Required

    Description

    --name

    Yes

    The name of the inference service that you submit, which is globally unique.

    --image

    Yes

    The image address of the inference service.

    --gpus

    No

    The number of GPUs to be used by the inference service. Default value: 0.

    --cpu

    No

    The number of CPU cores to be used by the inference service.

    --memory

    No

    The size of memory to be used by the inference service.

    --data

    No

    The address of the model that is deployed as the inference service. In this example, the model is stored in the llm-model directory, which is mounted to the /mnt/models/ directory in the pod.

    Expected output:

    inferenceservice.serving.kserve.io/qwen created
    INFO[0006] The Job qwen has been submitted successfully 
    INFO[0006] You can run `arena serve get qwen --type kserve -n default` to check the job status 

    The preceding output indicates that the inference service is deployed.

Step 3: Verify the inference service

  1. Run the following command to view the deployment progress of the inference service deployed by using KServe:

    arena serve get qwen

    Expected output:

    View the deployment status of the inference service

    Name:       qwen
    Namespace:  default
    Type:       KServe
    Version:    1
    Desired:    1
    Available:  1
    Age:        2m
    Address:    http://qwen-default.example.com
    Port:       :80
    GPU:        1
    
    
    Instances:
      NAME                             STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                             ------   ---  -----  --------  ---  ----
      qwen-predictor-5485d6d8d5-kvj7g  Running  2m   1/1    0         1    cn-beijing.XX.XX.XX.XX

    The preceding output indicates that the inference service is deployed by using KServe and the model can be accessed from http://qwen-default.example.com.

  2. Run the following command to obtain the IP address of the NGINX Ingress controller and access the inference service by using the IP address.

    # Obtain the IP address of the NGINX Ingress controller. 
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # Obtain the hostname of the inference service. 
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to access the inference service. 
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Perform a text."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}

    Expected output:

    View expected output

    {"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test? <|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}% 

    The preceding output indicates that the request is correctly sent to the inference service and the service returns an expected response in the JSON format.

Step 4: (Optional) Delete the inference service

Important

Before you delete the inference service, make sure that you no longer require the inference service and its related resources.

Run the following command to delete the inference service.

arena serve delete qwen

References