All Products
Search
Document Center

Container Service for Kubernetes:Deploy inference services that share a GPU

Last Updated:Dec 25, 2025

To improve GPU utilization, you can run multiple model inference tasks on the same GPU. This topic uses the Qwen1.5-0.5B-Chat model and a V100 GPU as an example to describe how to use KServe to deploy model inference services that share a GPU.

Prerequisites

Step 1: Prepare model data

You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Use an ossfs 1.0 statically provisioned volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.

  1. Download the model. This topic uses the Qwen1.5-0.5B-Chat model as an example.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git
    cd Qwen1.5-0.5B-Chat
    git lfs pull
  2. Upload the downloaded Qwen1.5-0.5B-Chat files to OSS.

    Note

    For more information about how to install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
    ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
  3. Configure a persistent volume (PV) and a persistent volume claim (PVC) for the target cluster.

    • The following table describes the basic configuration of the example PV.

      Configuration item

      Description

      Persistent volume type

      OSS

      Name

      llm-model

      Certificate Access

      Configure the AccessKey ID and AccessKey secret to access OSS.

      Bucket ID

      Select the OSS Bucket that you created in the previous step.

      OSS path

      Select the path where the model is stored, such as /Qwen1.5-0.5B-Chat.

    • The following table describes the basic configuration of the example PVC.

      Configuration item

      Description

      Persistent volume claim type

      OSS

      Name

      llm-model

      Allocation mode

      Select Existing persistent volume.

      Existing persistent volume

      Click the Select Existing persistent volume link and select the created PV.

Step 2: Deploy the inference services

Start two Qwen inference services. Each service requires 6 GB of GPU memory.

To start the second Qwen inference service, run the same command and change --name=qwen1 to --name=qwen2.
arena serve kserve \
    --name=qwen1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpumemory=6 \
    --cpu=3 \
    --memory=8Gi \
    --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

The following table describes the parameters.

Parameter

Required

Description

--name

Yes

The name of the inference service to submit. The name must be globally unique.

--image

Yes

The registry address of the inference service.

--gpumemory

No

The amount of GPU memory to request. Example: --gpumemory=6. Make sure that the total GPU memory requested by all services does not exceed the total GPU memory of the node.

--cpu

No

The number of vCPUs to use for the inference service.

--memory

No

The amount of memory to use for the inference service.

--data

No

The model path for the inference service. In this topic, the PV for the model is llm-model, which is mounted to the /mnt/models/ folder of the container.

Step 3: Verify the inference services

  1. You can check the deployment status of the two Qwen inference services.

    kubectl get pod -owide |grep qwen

    Expected output:

    qwen1-predictor-856568bdcf-5pfdq   1/1     Running   0          7m10s   10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>
    qwen2-predictor-6b477b587d-dpdnj   1/1     Running   0          4m3s    10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>

    The output shows that both qwen1 and qwen2 are deployed on the same GPU node, cn-beijing.172.16.XX.XX.

  2. Run the following two commands to access the pods of the two inference services and check the GPU memory allocated to each pod.

    kubectl exec -it qwen1-predictor-856568bdcf-5pfdq  -- nvidia-smi # Enter the pod of the first inference service.
    kubectl exec -it qwen2-predictor-6b477b587d-dpdnj  -- nvidia-smi # Enter the pod of the second inference service.

    Expected output:

    • GPU memory allocated to the first inference service

      Fri Jun 28 06:20:43 2024       
      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
      | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
      |                                         |                      |                  N/A |
      +-----------------------------------------+----------------------+----------------------+
                                                                                               
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      +---------------------------------------------------------------------------------------+
    • GPU memory allocated to the second inference service

      Fri Jun 28 06:40:17 2024       
      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
      | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
      |                                         |                      |                  N/A |
      +-----------------------------------------+----------------------+----------------------+
                                                                                               
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      +---------------------------------------------------------------------------------------+

    The output shows that the GPU memory limit for both pods is 6 GB. This indicates that each pod is allocated 6 GB of GPU memory and the GPU memory of the node is successfully shared by the two inference service pods.

  3. Access the inference service using the Nginx Ingress gateway address.

    curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \
         -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \
         -d '{
                "model": "qwen", 
                "messages": [{"role": "user", "content": "This is a test."}], 
                "max_tokens": 10, 
                "temperature": 0.7, 
                "top_p": 0.9, 
                "seed": 10
             }'
    

    Expected output:

    {"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.completion","created":1719303373,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

    The output shows that the model can generate a response based on the given input, which is a test message in this example.

(Optional) Step 4: Clean up the environment

If you no longer need the created resources, delete them promptly.

  • Run the following command to delete the deployed model inference services.

    arena serve delete qwen1
    arena serve delete qwen2
  • Run the following command to delete the created PV and PVC.

    kubectl delete pvc llm-model
    kubectl delete pv llm-model