This guide walks you through deploying a Qwen1.5-4B-Chat inference service on Container Service for Kubernetes (ACK) using the Text Generation Inference (TGI) framework from Hugging Face, with an A10 GPU as the target hardware.
Background
Qwen1.5-4B-Chat
Qwen1.5-4B-Chat is a 4-billion-parameter large language model (LLM) developed by Alibaba Cloud, built on the Transformer architecture and trained on large-scale data covering web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.
Text Generation Inference (TGI)
TGI is an open source toolkit from Hugging Face for serving LLMs as inference services. It supports inference acceleration techniques including Flash Attention, Paged Attention, continuous batching, and tensor parallelism. For more information, see the TGI documentation.
Prerequisites
Before you begin, ensure that you have:
-
An ACK cluster with GPU-accelerated nodes (A10 GPU) running Kubernetes 1.22 or later. For more information, see Create an ACK cluster with GPU-accelerated nodes.
We recommend that you install a GPU driver whose version is 525. You can add the label
ack.aliyun.com/nvidia-driver-version:525.105.17to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label. -
The latest version of the Arena client installed. For more information, see Configure the Arena client.
TGI does not support V100 or T4 GPUs. Use an A10 or a GPU with a newer architecture.
Step 1: Prepare model data
Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in the ACK cluster so the inference service can load the model at runtime.
For an alternative using File Storage NAS (NAS), see Mount a statically provisioned NAS volume.
Download model weights
-
Install Git:
# RHEL/CentOS yum install git # Debian/Ubuntu # apt install git -
Install Git Large File Storage (LFS):
# RHEL/CentOS yum install git-lfs # Debian/Ubuntu # apt install git-lfs -
Clone the Qwen1.5-4B-Chat repository from ModelScope, skipping LFS downloads:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git -
Pull the large model weight files managed by LFS:
cd Qwen1.5-4B-Chat git lfs pull
Upload the model to OSS
-
Log on to the OSS console and record the name of your OSS bucket. To create a bucket, see Create a bucket.
-
Install and configure ossutil. For more information, see Install ossutil.
-
Create a directory for the model in OSS:
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat -
Upload the model files to OSS:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Configure a PV and PVC
Create a PV and PVC in the ACK cluster to mount the OSS-hosted model into the inference service container. For step-by-step instructions, see Mount a statically provisioned OSS volume.
Use the following settings:
PV parameters
| Parameter | Value |
|---|---|
| PV type | OSS |
| Volume name | llm-model |
| Access certificate | Your AccessKey ID and AccessKey secret for the OSS bucket |
| Bucket ID | The name of your OSS bucket |
| OSS path | The path to the model directory, for example /models/Qwen1.5-4B-Chat |
PVC parameters
| Parameter | Value |
|---|---|
| PVC type | OSS |
| Volume name | llm-model |
| Allocation mode | Existing volumes |
| Existing volumes | Select the PV you created |
Step 2: Deploy the inference service
TGI does not support V100 or T4 GPUs. Use an A10 or a GPU with a newer architecture.
Run the following Arena command to deploy the inference service:
arena serve custom \
--name=tgi-qwen-4b-chat \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/text-generation-inference:2.0.2-ubuntu22.04 \
--data=llm-model:/model/Qwen1.5-4B-Chat \
"text-generation-launcher --model-id /model/Qwen1.5-4B-Chat --num-shard 1 -p 8000"
The following table describes the parameters:
| Parameter | Description |
|---|---|
--name |
Name of the inference service |
--version |
Version of the inference service |
--gpus |
Number of GPUs per replica |
--replicas |
Number of replicas |
--restful-port |
Port exposed by the inference service |
--readiness-probe-action |
Connection type for readiness probes. Valid values: HttpGet, Exec, gRPC, TCPSocket |
--readiness-probe-action-option |
Connection method for readiness probes |
--readiness-probe-option |
Readiness probe configuration |
--data |
Mounts a PVC into the container. Format: <pvc-name>:<mount-path>. Run arena data list to list available PVCs |
--image |
Container image for the inference service |
The expected output is:
service/tgi-qwen-4b-chat-v1 created
deployment.apps/tgi-qwen-4b-chat-v1-custom-serving created
INFO[0001] The Job tgi-qwen-4b-chat has been submitted successfully
INFO[0001] You can run `arena serve get tgi-qwen-4b-chat --type custom-serving -n default` to check the job status
Check that the service is running:
arena serve get tgi-qwen-4b-chat
The expected output is:
Name: tgi-qwen-4b-chat
Namespace: default
Type: Custom
Version: v1
Desired: 1
Available: 1
Age: 3m
Address: 172.16.XX.XX
Port: RESTFUL:8000
GPU: 1
Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
tgi-qwen-4b-chat-v1-custom-serving-67b58c9865-m89lq Running 3m 1/1 0 1 cn-beijing.192.168.XX.XX
Available: 1 and STATUS: Running confirm the pod is ready to serve requests.
Step 3: Verify the inference service
-
Set up port forwarding to access the service from your local machine:
Importantkubectl port-forwardis for development and debugging only. It is not reliable, secure, or scalable for production traffic. For production networking in ACK clusters, see Ingress overview.kubectl port-forward svc/tgi-qwen-4b-chat-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 -
Send a test request to the chat completions endpoint:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'Expected output:
{"id":"","object":"text_completion","created":1716274541,"model":"/model/Qwen1.5-4B-Chat","system_fingerprint":"2.0.2-sha-6073ece","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What test do you want me to run?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"completion_tokens":10,"total_tokens":31}}A valid JSON response with a
choicesarray confirms the model is loaded and generating text.
(Optional) Step 4: Clean up
If you no longer need the resources, delete them to avoid ongoing charges.
Delete the inference service:
arena serve delete tgi-qwen-4b-chat
Delete the PVC and PV:
kubectl delete pvc llm-model
kubectl delete pv llm-model