All Products
Search
Document Center

Container Service for Kubernetes:Deploy GPU-shared inference services

Last Updated:Mar 26, 2026

GPU-accelerated nodes are expensive, and a single inference service rarely saturates an entire GPU. Shared GPU scheduling lets you run multiple inference services on one GPU by slicing its memory into fixed-size allocations. This guide shows how to deploy two Qwen1.5-0.5B-Chat inference services on a single V100 GPU using KServe and Arena, with each service receiving a 6 GB memory slice.

How it works

ACK's shared GPU scheduling component implements GPU memory slicing. Each inference service declares how much GPU memory it needs via the --gpumemory flag, and the scheduler places multiple pods on the same GPU node as long as the total requested memory stays within the node's physical GPU capacity.

Use shared GPU scheduling when maximizing GPU utilization matters more than fault isolation between services. For workloads that require strict isolation, use dedicated GPU nodes.

Limitations

  • The total GPU memory requested across all pods on a node must not exceed the node's physical GPU memory.

  • GPU-accelerated nodes use CUDA 11 by default. This guide requires CUDA 12.0 or later.

  • ack-kserve must be in Raw Deployment mode.

Prerequisites

Before you begin, ensure that you have:

Step 1: Prepare model data

Store the model in an Object Storage Service (OSS) bucket or an Apsara File Storage NAS file system. This guide uses OSS. For more information, see Use an ossfs 1.0 statically provisioned volume or Mount a statically provisioned NAS volume.

  1. Download the Qwen1.5-0.5B-Chat model.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git
    cd Qwen1.5-0.5B-Chat
    git lfs pull
  2. Upload the model files to your OSS bucket.

    For installation and usage of ossutil, see Install ossutil.
    ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
    ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
  3. Create a persistent volume (PV) for the cluster using the following configuration.

    Configuration item Value
    Persistent volume type OSS
    Name llm-model
    Certificate Access AccessKey ID and AccessKey secret for the OSS bucket
    Bucket ID The OSS bucket created in the previous step
    OSS path /Qwen1.5-0.5B-Chat
  4. Create a persistent volume claim (PVC) bound to the PV.

    Configuration item Value
    Persistent volume claim type OSS
    Name llm-model
    Allocation mode Select Existing persistent volume
    Existing persistent volume Click Select Existing persistent volume and select the PV created in the previous step

Step 2: Deploy the inference services

Deploy two Qwen inference services, each requesting 6 GB of GPU memory. The commands are identical except for --name.

Run the following command to start the first service:

arena serve kserve \
    --name=qwen1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpumemory=6 \
    --cpu=3 \
    --memory=8Gi \
    --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

To start the second service, run the same command with --name=qwen2.

The following table describes the key parameters.

Parameter Type Required Description
--name String Yes Name of the inference service. Must be globally unique.
--image String Yes Container image for the inference service.
--gpumemory Integer (GB) No GPU memory to allocate to this service, in GB. For example, --gpumemory=6 allocates 6 GB. The total GPU memory requested by all services on a node must not exceed the node's physical GPU memory.
--cpu Integer No Number of vCPUs for the inference service.
--memory String No Amount of RAM for the inference service, for example, 8Gi.
--data String No PVC-to-container mount path in the format <pvc-name>:<container-path>. In this example, the llm-model PVC is mounted to /mnt/models/ in the container.

Step 3: Verify the inference services

  1. Check that both pods are running on the same GPU node.

    kubectl get pod -owide | grep qwen

    Expected output:

    qwen1-predictor-856568bdcf-5pfdq   1/1     Running   0          7m10s   10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>
    qwen2-predictor-6b477b587d-dpdnj   1/1     Running   0          4m3s    10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>

    Both pods appear on the same node (cn-beijing.172.16.XX.XX), confirming that GPU sharing is active.

  2. Check the GPU memory allocated to each pod. Run the following commands — one per service:

    kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi   # First service
    kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi   # Second service

    Expected output for each pod: GPU memory allocated to the first inference service

    Fri Jun 28 06:20:43 2024
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
    | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    +---------------------------------------------------------------------------------------+

    GPU memory allocated to the second inference service

    Fri Jun 28 06:40:17 2024
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
    | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    +---------------------------------------------------------------------------------------+

    Each pod's GPU memory limit is 6 GB (6,144 MiB). This confirms that both services are sharing the node's GPU memory as configured.

  3. Send a test request to the inference service through the NGINX Ingress gateway.

    curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \
         -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \
         -d '{
                "model": "qwen",
                "messages": [{"role": "user", "content": "This is a test."}],
                "max_tokens": 10,
                "temperature": 0.7,
                "top_p": 0.9,
                "seed": 10
             }'

    Expected output:

    {"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.completion","created":1719303373,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

    The model returns a response, confirming that the inference service is working correctly.

(Optional) Step 4: Clean up

Delete the resources when they are no longer needed.

Delete the inference services:

arena serve delete qwen1
arena serve delete qwen2

Delete the PVC and PV:

kubectl delete pvc llm-model
kubectl delete pv llm-model