Deploy a Qwen inference service using TGI - Container Service for Kubernetes

Background

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter large language model (LLM) developed by Alibaba Cloud, built on the Transformer architecture and trained on large-scale data covering web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.

Text Generation Inference (TGI)

TGI is an open source toolkit from Hugging Face for serving LLMs as inference services. It supports inference acceleration techniques including Flash Attention, Paged Attention, continuous batching, and tensor parallelism. For more information, see the TGI documentation.

Prerequisites

Before you begin, ensure that you have:

An ACK cluster with GPU-accelerated nodes (A10 GPU) running Kubernetes 1.22 or later. For more information, see Create an ACK cluster with GPU-accelerated nodes.

We recommend that you install a GPU driver whose version is 525. You can add the label ack.aliyun.com/nvidia-driver-version:525.105.17 to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.
The latest version of the Arena client installed. For more information, see Configure the Arena client.

Important

TGI does not support V100 or T4 GPUs. Use an A10 or a GPU with a newer architecture.

Step 1: Prepare model data

Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in the ACK cluster so the inference service can load the model at runtime.

For an alternative using File Storage NAS (NAS), see Mount a statically provisioned NAS volume.

Download model weights

Install Git:

# RHEL/CentOS
yum install git
# Debian/Ubuntu
# apt install git

Install Git Large File Storage (LFS):

# RHEL/CentOS
yum install git-lfs
# Debian/Ubuntu
# apt install git-lfs

Clone the Qwen1.5-4B-Chat repository from ModelScope, skipping LFS downloads:

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git

Pull the large model weight files managed by LFS:
```
cd Qwen1.5-4B-Chat
git lfs pull
```

Upload the model to OSS

Log on to the OSS console and record the name of your OSS bucket. To create a bucket, see Create a bucket.
Install and configure ossutil. For more information, see Install ossutil.

Create a directory for the model in OSS:

ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Upload the model files to OSS:

ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Configure a PV and PVC

Create a PV and PVC in the ACK cluster to mount the OSS-hosted model into the inference service container. For step-by-step instructions, see Mount a statically provisioned OSS volume.

Use the following settings:

PV parameters

Parameter	Value
PV type	OSS
Volume name	`llm-model`
Access certificate	Your AccessKey ID and AccessKey secret for the OSS bucket
Bucket ID	The name of your OSS bucket
OSS path	The path to the model directory, for example `/models/Qwen1.5-4B-Chat`

PVC parameters

Parameter	Value
PVC type	OSS
Volume name	`llm-model`
Allocation mode	Existing volumes
Existing volumes	Select the PV you created

Step 2: Deploy the inference service

Important

TGI does not support V100 or T4 GPUs. Use an A10 or a GPU with a newer architecture.

Run the following Arena command to deploy the inference service:

arena serve custom \
    --name=tgi-qwen-4b-chat \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/text-generation-inference:2.0.2-ubuntu22.04 \
    --data=llm-model:/model/Qwen1.5-4B-Chat \
    "text-generation-launcher --model-id /model/Qwen1.5-4B-Chat --num-shard 1 -p 8000"

The following table describes the parameters:

Parameter	Description
`--name`	Name of the inference service
`--version`	Version of the inference service
`--gpus`	Number of GPUs per replica
`--replicas`	Number of replicas
`--restful-port`	Port exposed by the inference service
`--readiness-probe-action`	Connection type for readiness probes. Valid values: `HttpGet`, `Exec`, `gRPC`, `TCPSocket`
`--readiness-probe-action-option`	Connection method for readiness probes
`--readiness-probe-option`	Readiness probe configuration
`--data`	Mounts a PVC into the container. Format: `<pvc-name>:<mount-path>`. Run `arena data list` to list available PVCs
`--image`	Container image for the inference service

The expected output is:

service/tgi-qwen-4b-chat-v1 created
deployment.apps/tgi-qwen-4b-chat-v1-custom-serving created
INFO[0001] The Job tgi-qwen-4b-chat has been submitted successfully
INFO[0001] You can run `arena serve get tgi-qwen-4b-chat --type custom-serving -n default` to check the job status

Check that the service is running:

arena serve get tgi-qwen-4b-chat

The expected output is:

Name:       tgi-qwen-4b-chat
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        3m
Address:    172.16.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                 ------   ---  -----  --------  ---  ----
  tgi-qwen-4b-chat-v1-custom-serving-67b58c9865-m89lq  Running  3m   1/1    0         1    cn-beijing.192.168.XX.XX

Available: 1 and STATUS: Running confirm the pod is ready to serve requests.

Step 3: Verify the inference service

Set up port forwarding to access the service from your local machine:

Important
kubectl port-forward is for development and debugging only. It is not reliable, secure, or scalable for production traffic. For production networking in ACK clusters, see Ingress overview.
```
kubectl port-forward svc/tgi-qwen-4b-chat-v1 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Send a test request to the chat completions endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"","object":"text_completion","created":1716274541,"model":"/model/Qwen1.5-4B-Chat","system_fingerprint":"2.0.2-sha-6073ece","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What test do you want me to run?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"completion_tokens":10,"total_tokens":31}}

A valid JSON response with a choices array confirms the model is loaded and generating text.

(Optional) Step 4: Clean up

If you no longer need the resources, delete them to avoid ongoing charges.

Delete the inference service:

arena serve delete tgi-qwen-4b-chat

Delete the PVC and PV:

kubectl delete pvc llm-model
kubectl delete pv llm-model