All Products
Search
Document Center

Container Service for Kubernetes:Deploy inference services that share a GPU

Last Updated:Jun 04, 2025

In some scenarios, you may want multiple inference tasks to share the same GPU to improve GPU utilization. In this example, the Qwen1.5-0.5B-Chat model and the V100 GPU are used to describe how to use KServe to deploy inference services that share a GPU.

Prerequisites

Step 1: Prepare model data

You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Mount a statically provisioned ossfs 1.0 volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.

  1. Download the model. In this example, the Qwen1.5-0.5B-Chat model is used.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git
    cd Qwen1.5-0.5B-Chat.git
    git lfs pull
  2. Upload the Qwen1.5-0.5B-Chat files to Object Storage Service (OSS).

    Note

    For more information about how to install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
    ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
  3. Configure a persistent volume (PV) named llm-model and a persistent volume claim (PVC) named llm-model for the cluster.

    • The following table describes the parameters of the PV.

      Parameter

      Description

      PV Type

      OSS

      Volume Name

      llm-model

      Access Certificate

      Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.

      Bucket ID

      Select the OSS bucket that you created in the previous step.

      OSS Path

      Select the path of the model, such as /Qwen1.5-0.5B-Chat.

    • The following table describes the parameters of the PVC.

      Parameter

      Description

      PVC Type

      OSS

      Volume Name

      llm-model

      Allocation Mode

      Select Existing Volumes.

      Existing Volumes

      Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy inference services

Start two Qwen inference services. Each inference service requires 6 GB of GPU memory.

When starting the second Qwen inference service, you only need to execute the same command and change --name=qwen1 to --name=qwen2.
arena serve kserve \
    --name=qwen1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpumemory=6 \
    --cpu=3 \
    --memory=8Gi \
    --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

Configure the parameters described in the following table.

Parameter

Required

Description

--name

Yes

The name of the inference service, which is globally unique.

--image

Yes

The address of the inference service image.

--gpumemory

No

The requested amount of GPU memory. Example: --gpumemory=6. Make sure that the amount of GPU memory requested by all inference services does not exceed the GPU memory capacity.

--cpu

No

The number of vCPUs requested by the inference service.

--memory

No

The amount of memory requested by the inference service.

--data

No

The address of the inference service model. In this example, the PV of the model is llm-model, which is mounted to the /mnt/models/ directory of a container.

Step 3: Verify the inference services

  1. Query the status of the inference services.

    kubectl get pod -owide |grep qwen

    Expected output:

    qwen1-predictor-856568bdcf-5pfdq   1/1     Running   0          7m10s   10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>
    qwen2-predictor-6b477b587d-dpdnj   1/1     Running   0          4m3s    10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>

    The expected output indicates that qwen1 and qwen2 are deployed on the same GPU-accelerated node (cn-beijing. 172.16.XX.XX).

  2. Run the following commands to log on to the pods where the inference services are deployed and view the amount of GPU memory allocated to the pods:

    kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # Log on to the pod where the first inference service is deployed. 
    kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Log on to the pod where the second inference service is deployed.

    Expected output:

    • The GPU memory allocated to the first inference service

      Fri Jun 28 06:20:43 2024       
      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
      | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
      |                                         |                      |                  N/A |
      +-----------------------------------------+----------------------+----------------------+
                                                                                               
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      +---------------------------------------------------------------------------------------+
    • The GPU memory allocated to the second inference service

      Fri Jun 28 06:40:17 2024       
      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
      | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
      |                                         |                      |                  N/A |
      +-----------------------------------------+----------------------+----------------------+
                                                                                               
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      +---------------------------------------------------------------------------------------+

    The output shows that each pod can use at most 6 GB of GPU memory and each pod is allocated 6 GB of GPU memory. Therefore, the node has sufficient GPU memory for the pods where the two inference services are deployed.

  3. Access one of the inference services by using the IP address of the NGINX Ingress.

    curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \
         -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \
         -d '{
                "model": "qwen", 
                "messages": [{"role": "user", "content": "Test"}], 
                "max_tokens": 10, 
                "temperature": 0.7, 
                "top_p": 0.9, 
                "seed": 10
             }'
    

    Expected output:

    {"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.com pletion","created":1719303373,"model":"qwen","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

    The output indicates that the model can generate a response based on the given prompt. In this example, the prompt is a test request.

(Optional) Step 4: Clear the environment

If you no longer need the resources, clear the environment promptly.

  • Run the following commands to delete the inference services:

    arena serve delete qwen1
    arena serve delete qwen2
  • Run the following commands to delete the PV and the PVC:

    kubectl delete pvc llm-model
    kubectl delete pv llm-model