Deploy Qwen Inference with LMDeploy on ACK for High-Throughput AI - Container Service for Kubernetes

This tutorial shows you how to deploy the Qwen1.5-4B-Chat model as an inference service on Container Service for Kubernetes (ACK) using the LMDeploy framework and an A10 GPU. By the end, you will have a running REST API endpoint that accepts chat completion requests.

About LMDeploy and Qwen1.5-4B-Chat

LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models (LLMs):

Model compression and optimization: Applies weight quantization and key-value (KV) cache quantization to reduce model size and memory usage. Improves inference throughput through tensor parallelism and KV cache optimization.
Flexible deployment: Supports single-machine, multi-machine, and multi-GPU environments, including distributed deployment for scalability and high availability.
Service management: Reduces redundant computation and improves response speed through caching.

Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud, trained on web text, domain-specific books, and code. For details, see the Qwen GitHub repository.

Prerequisites

Before you begin, make sure you have:

An ACK Pro cluster with GPU-accelerated nodes. Kubernetes version 1.22 or later is required. Each node must have 16 GB of GPU memory or more. For setup instructions, see Create an ACK managed cluster. Install GPU driver version 525. Add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to pin the driver version to 525.105.17. For details, see Specify an NVIDIA driver version for nodes by adding a label.
The latest version of the Arena client installed. For setup instructions, see Configure the Arena client.

Step 1: Prepare model data

Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in your ACK cluster.

To store the model on NAS instead of OSS, see Use NAS static persistent volume.

Download the model

Install Git:

# Run yum install git or apt install git.
yum install git

Install Git Large File Storage (LFS):

# Run yum install git-lfs or apt install git-lfs.
yum install git-lfs

Clone the Qwen1.5-4B-Chat repository from ModelScope without downloading LFS files:
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
```
Enter the repository directory and pull the LFS-managed model files:
```
cd Qwen1.5-4B-Chat
git lfs pull
```

Upload the model to OSS

Log in to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.
Install and configure ossutil. For instructions, see Install ossutil.

Create a directory for the model in OSS:

ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Upload the model files:

ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Create a PV and PVC

Create a PV and PVC in your ACK cluster to mount the model data. For step-by-step instructions, see Mount a statically provisioned OSS volume.

Use the following settings when creating the PV:

Parameter	Value
PV Type	OSS
Volume Name	llm-model
Access Certificate	The AccessKey ID and AccessKey secret for your OSS bucket
Bucket ID	The name of your OSS bucket
OSS Path	The path to the model, such as `/models/Qwen1.5-4B-Chat`

Use the following settings when creating the PVC:

Parameter	Value
PVC Type	OSS
Volume Name	llm-model
Allocation Mode	Existing Volumes
Existing Volumes	Click the Existing Volumes link and select the PV you created

Step 2: Deploy the inference service

Deploy the inference service using the Arena client:

arena serve custom \
    --name=lmdeploy-qwen \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/lmdeploy:v0.4.2 \
    --data=llm-model:/model/Qwen1.5-4B-Chat \
    "lmdeploy serve api_server /model/Qwen1.5-4B-Chat --server-port 8000"

The command uses the following parameters:

Parameter	Description
`--name`	Name of the inference service
`--version`	Version of the inference service
`--gpus`	Number of GPUs allocated to each replica
`--replicas`	Number of inference service replicas
`--restful-port`	Port exposed by the inference service
`--readiness-probe-action`	Connection type for readiness probes. Valid values: `HttpGet`, `Exec`, `gRPC`, `TCPSocket`
`--readiness-probe-action-option`	Connection method for readiness probes
`--readiness-probe-option`	Readiness probe configuration
`--data`	Mounts a PVC to the runtime environment. Format: `<pvc-name>:<mount-path>`. Run `arena data list` to list available PVCs
`--image`	Container image for the inference service

Expected output:

service/lmdeploy-qwen-v1 created
deployment.apps/lmdeploy-qwen-v1-custom-serving created
INFO[0002] The Job lmdeploy-qwen has been submitted successfully
INFO[0002] You can run `arena serve get lmdeploy-qwen --type custom-serving -n default` to check the job status

Check the service status and wait until Available shows 1:

arena serve get lmdeploy-qwen

Expected output:

Name:       lmdeploy-qwen
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        1m
Address:    192.168.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                              STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                              ------   ---  -----  --------  ---  ----
  lmdeploy-qwen-v1-custom-serving-8476b9dd8c-8b4d2  Running  1m   1/1    0         1    cn-beijing.172.16.XX.XX

When Available: 1 and the pod status is Running, the service is ready to accept requests.

Step 3: Verify the inference service

Set up port forwarding from the service to your local machine:
```
kubectl port-forward svc/lmdeploy-qwen-v1 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```
Important
kubectl port-forward is intended for development and debugging only. For production networking, see Ingress overview.

Send a test inference request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test it out."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"1","object":"chat.completion","created":1719833349,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"Sure, do you have any testing requirements or issues?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"total_tokens":32,"completion_tokens":11}}

A valid JSON response with a choices array confirms the model is generating outputs correctly.

(Optional) Step 4: Clean up

Delete the resources when you no longer need them.

Delete the inference service:
```
arena serve del lmdeploy-qwen
```

Delete the PVC and PV:

kubectl delete pvc llm-model
kubectl delete pv llm-model