This tutorial shows you how to deploy the Qwen1.5-4B-Chat model as an inference service on Container Service for Kubernetes (ACK) using the LMDeploy framework and an A10 GPU. By the end, you will have a running REST API endpoint that accepts chat completion requests.
About LMDeploy and Qwen1.5-4B-Chat
LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models (LLMs):
Model compression and optimization: Applies weight quantization and key-value (KV) cache quantization to reduce model size and memory usage. Improves inference throughput through tensor parallelism and KV cache optimization.
Flexible deployment: Supports single-machine, multi-machine, and multi-GPU environments, including distributed deployment for scalability and high availability.
Service management: Reduces redundant computation and improves response speed through caching.
Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud, trained on web text, domain-specific books, and code. For details, see the Qwen GitHub repository.
Prerequisites
Before you begin, make sure you have:
An ACK Pro cluster with GPU-accelerated nodes. Kubernetes version 1.22 or later is required. Each node must have 16 GB of GPU memory or more. For setup instructions, see Create an ACK managed cluster. Install GPU driver version 525. Add the
ack.aliyun.com/nvidia-driver-version:525.105.17label to GPU-accelerated nodes to pin the driver version to 525.105.17. For details, see Specify an NVIDIA driver version for nodes by adding a label.The latest version of the Arena client installed. For setup instructions, see Configure the Arena client.
Step 1: Prepare model data
Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in your ACK cluster.
To store the model on NAS instead of OSS, see Use NAS static persistent volume.
Download the model
Install Git:
# Run yum install git or apt install git. yum install gitInstall Git Large File Storage (LFS):
# Run yum install git-lfs or apt install git-lfs. yum install git-lfsClone the Qwen1.5-4B-Chat repository from ModelScope without downloading LFS files:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.gitEnter the repository directory and pull the LFS-managed model files:
cd Qwen1.5-4B-Chat git lfs pull
Upload the model to OSS
Log in to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.
Install and configure ossutil. For instructions, see Install ossutil.
Create a directory for the model in OSS:
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-ChatUpload the model files:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Create a PV and PVC
Create a PV and PVC in your ACK cluster to mount the model data. For step-by-step instructions, see Mount a statically provisioned OSS volume.
Use the following settings when creating the PV:
| Parameter | Value |
|---|---|
| PV Type | OSS |
| Volume Name | llm-model |
| Access Certificate | The AccessKey ID and AccessKey secret for your OSS bucket |
| Bucket ID | The name of your OSS bucket |
| OSS Path | The path to the model, such as /models/Qwen1.5-4B-Chat |
Use the following settings when creating the PVC:
| Parameter | Value |
|---|---|
| PVC Type | OSS |
| Volume Name | llm-model |
| Allocation Mode | Existing Volumes |
| Existing Volumes | Click the Existing Volumes link and select the PV you created |
Step 2: Deploy the inference service
Deploy the inference service using the Arena client:
arena serve custom \ --name=lmdeploy-qwen \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/lmdeploy:v0.4.2 \ --data=llm-model:/model/Qwen1.5-4B-Chat \ "lmdeploy serve api_server /model/Qwen1.5-4B-Chat --server-port 8000"The command uses the following parameters:
Parameter Description --nameName of the inference service --versionVersion of the inference service --gpusNumber of GPUs allocated to each replica --replicasNumber of inference service replicas --restful-portPort exposed by the inference service --readiness-probe-actionConnection type for readiness probes. Valid values: HttpGet,Exec,gRPC,TCPSocket--readiness-probe-action-optionConnection method for readiness probes --readiness-probe-optionReadiness probe configuration --dataMounts a PVC to the runtime environment. Format: <pvc-name>:<mount-path>. Runarena data listto list available PVCs--imageContainer image for the inference service Expected output:
service/lmdeploy-qwen-v1 created deployment.apps/lmdeploy-qwen-v1-custom-serving created INFO[0002] The Job lmdeploy-qwen has been submitted successfully INFO[0002] You can run `arena serve get lmdeploy-qwen --type custom-serving -n default` to check the job statusCheck the service status and wait until
Availableshows1:arena serve get lmdeploy-qwenExpected output:
Name: lmdeploy-qwen Namespace: default Type: Custom Version: v1 Desired: 1 Available: 1 Age: 1m Address: 192.168.XX.XX Port: RESTFUL:8000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- lmdeploy-qwen-v1-custom-serving-8476b9dd8c-8b4d2 Running 1m 1/1 0 1 cn-beijing.172.16.XX.XXWhen
Available: 1and the pod status isRunning, the service is ready to accept requests.
Step 3: Verify the inference service
Set up port forwarding from the service to your local machine:
kubectl port-forward svc/lmdeploy-qwen-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Importantkubectl port-forwardis intended for development and debugging only. For production networking, see Ingress overview.Send a test inference request:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test it out."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'Expected output:
{"id":"1","object":"chat.completion","created":1719833349,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"Sure, do you have any testing requirements or issues?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"total_tokens":32,"completion_tokens":11}}A valid JSON response with a
choicesarray confirms the model is generating outputs correctly.
(Optional) Step 4: Clean up
Delete the resources when you no longer need them.
Delete the inference service:
arena serve del lmdeploy-qwenDelete the PVC and PV:
kubectl delete pvc llm-model kubectl delete pv llm-model