All Products
Search
Document Center

Container Service for Kubernetes:Use LMDeploy to deploy the Qwen model inference service

Last Updated:Mar 26, 2026

This tutorial shows you how to deploy the Qwen1.5-4B-Chat model as an inference service on Container Service for Kubernetes (ACK) using the LMDeploy framework and an A10 GPU. By the end, you will have a running REST API endpoint that accepts chat completion requests.

About LMDeploy and Qwen1.5-4B-Chat

LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models (LLMs):

  • Model compression and optimization: Applies weight quantization and key-value (KV) cache quantization to reduce model size and memory usage. Improves inference throughput through tensor parallelism and KV cache optimization.

  • Flexible deployment: Supports single-machine, multi-machine, and multi-GPU environments, including distributed deployment for scalability and high availability.

  • Service management: Reduces redundant computation and improves response speed through caching.

Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud, trained on web text, domain-specific books, and code. For details, see the Qwen GitHub repository.

Prerequisites

Before you begin, make sure you have:

Step 1: Prepare model data

Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in your ACK cluster.

To store the model on NAS instead of OSS, see Use NAS static persistent volume.

Download the model

  1. Install Git:

    # Run yum install git or apt install git.
    yum install git
  2. Install Git Large File Storage (LFS):

    # Run yum install git-lfs or apt install git-lfs.
    yum install git-lfs
  3. Clone the Qwen1.5-4B-Chat repository from ModelScope without downloading LFS files:

    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
  4. Enter the repository directory and pull the LFS-managed model files:

    cd Qwen1.5-4B-Chat
    git lfs pull

Upload the model to OSS

  1. Log in to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.

  2. Install and configure ossutil. For instructions, see Install ossutil.

  3. Create a directory for the model in OSS:

    ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
  4. Upload the model files:

    ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Create a PV and PVC

Create a PV and PVC in your ACK cluster to mount the model data. For step-by-step instructions, see Mount a statically provisioned OSS volume.

Use the following settings when creating the PV:

ParameterValue
PV TypeOSS
Volume Namellm-model
Access CertificateThe AccessKey ID and AccessKey secret for your OSS bucket
Bucket IDThe name of your OSS bucket
OSS PathThe path to the model, such as /models/Qwen1.5-4B-Chat

Use the following settings when creating the PVC:

ParameterValue
PVC TypeOSS
Volume Namellm-model
Allocation ModeExisting Volumes
Existing VolumesClick the Existing Volumes link and select the PV you created

Step 2: Deploy the inference service

  1. Deploy the inference service using the Arena client:

    arena serve custom \
        --name=lmdeploy-qwen \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/lmdeploy:v0.4.2 \
        --data=llm-model:/model/Qwen1.5-4B-Chat \
        "lmdeploy serve api_server /model/Qwen1.5-4B-Chat --server-port 8000"

    The command uses the following parameters:

    ParameterDescription
    --nameName of the inference service
    --versionVersion of the inference service
    --gpusNumber of GPUs allocated to each replica
    --replicasNumber of inference service replicas
    --restful-portPort exposed by the inference service
    --readiness-probe-actionConnection type for readiness probes. Valid values: HttpGet, Exec, gRPC, TCPSocket
    --readiness-probe-action-optionConnection method for readiness probes
    --readiness-probe-optionReadiness probe configuration
    --dataMounts a PVC to the runtime environment. Format: <pvc-name>:<mount-path>. Run arena data list to list available PVCs
    --imageContainer image for the inference service

    Expected output:

    service/lmdeploy-qwen-v1 created
    deployment.apps/lmdeploy-qwen-v1-custom-serving created
    INFO[0002] The Job lmdeploy-qwen has been submitted successfully
    INFO[0002] You can run `arena serve get lmdeploy-qwen --type custom-serving -n default` to check the job status
  2. Check the service status and wait until Available shows 1:

    arena serve get lmdeploy-qwen

    Expected output:

    Name:       lmdeploy-qwen
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        1m
    Address:    192.168.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                              STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                              ------   ---  -----  --------  ---  ----
      lmdeploy-qwen-v1-custom-serving-8476b9dd8c-8b4d2  Running  1m   1/1    0         1    cn-beijing.172.16.XX.XX

    When Available: 1 and the pod status is Running, the service is ready to accept requests.

Step 3: Verify the inference service

  1. Set up port forwarding from the service to your local machine:

    kubectl port-forward svc/lmdeploy-qwen-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
    Important

    kubectl port-forward is intended for development and debugging only. For production networking, see Ingress overview.

  2. Send a test inference request:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test it out."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"1","object":"chat.completion","created":1719833349,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"Sure, do you have any testing requirements or issues?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"total_tokens":32,"completion_tokens":11}}

    A valid JSON response with a choices array confirms the model is generating outputs correctly.

(Optional) Step 4: Clean up

Delete the resources when you no longer need them.

  • Delete the inference service:

    arena serve del lmdeploy-qwen
  • Delete the PVC and PV:

    kubectl delete pvc llm-model
    kubectl delete pv llm-model