All Products
Search
Document Center

Container Service for Kubernetes:Use TGI to deploy Qwen inference services in ACK

Last Updated:Mar 26, 2026

This guide walks you through deploying a Qwen1.5-4B-Chat inference service on Container Service for Kubernetes (ACK) using the Text Generation Inference (TGI) framework from Hugging Face, with an A10 GPU as the target hardware.

Background

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter large language model (LLM) developed by Alibaba Cloud, built on the Transformer architecture and trained on large-scale data covering web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.

Text Generation Inference (TGI)

TGI is an open source toolkit from Hugging Face for serving LLMs as inference services. It supports inference acceleration techniques including Flash Attention, Paged Attention, continuous batching, and tensor parallelism. For more information, see the TGI documentation.

Prerequisites

Before you begin, ensure that you have:

Important

TGI does not support V100 or T4 GPUs. Use an A10 or a GPU with a newer architecture.

Step 1: Prepare model data

Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in the ACK cluster so the inference service can load the model at runtime.

For an alternative using File Storage NAS (NAS), see Mount a statically provisioned NAS volume.

Download model weights

  1. Install Git:

    # RHEL/CentOS
    yum install git
    # Debian/Ubuntu
    # apt install git
  2. Install Git Large File Storage (LFS):

    # RHEL/CentOS
    yum install git-lfs
    # Debian/Ubuntu
    # apt install git-lfs
  3. Clone the Qwen1.5-4B-Chat repository from ModelScope, skipping LFS downloads:

    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
  4. Pull the large model weight files managed by LFS:

    cd Qwen1.5-4B-Chat
    git lfs pull

Upload the model to OSS

  1. Log on to the OSS console and record the name of your OSS bucket. To create a bucket, see Create a bucket.

  2. Install and configure ossutil. For more information, see Install ossutil.

  3. Create a directory for the model in OSS:

    ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
  4. Upload the model files to OSS:

    ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Configure a PV and PVC

Create a PV and PVC in the ACK cluster to mount the OSS-hosted model into the inference service container. For step-by-step instructions, see Mount a statically provisioned OSS volume.

Use the following settings:

PV parameters

Parameter Value
PV type OSS
Volume name llm-model
Access certificate Your AccessKey ID and AccessKey secret for the OSS bucket
Bucket ID The name of your OSS bucket
OSS path The path to the model directory, for example /models/Qwen1.5-4B-Chat

PVC parameters

Parameter Value
PVC type OSS
Volume name llm-model
Allocation mode Existing volumes
Existing volumes Select the PV you created

Step 2: Deploy the inference service

Important

TGI does not support V100 or T4 GPUs. Use an A10 or a GPU with a newer architecture.

Run the following Arena command to deploy the inference service:

arena serve custom \
    --name=tgi-qwen-4b-chat \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/text-generation-inference:2.0.2-ubuntu22.04 \
    --data=llm-model:/model/Qwen1.5-4B-Chat \
    "text-generation-launcher --model-id /model/Qwen1.5-4B-Chat --num-shard 1 -p 8000"

The following table describes the parameters:

Parameter Description
--name Name of the inference service
--version Version of the inference service
--gpus Number of GPUs per replica
--replicas Number of replicas
--restful-port Port exposed by the inference service
--readiness-probe-action Connection type for readiness probes. Valid values: HttpGet, Exec, gRPC, TCPSocket
--readiness-probe-action-option Connection method for readiness probes
--readiness-probe-option Readiness probe configuration
--data Mounts a PVC into the container. Format: <pvc-name>:<mount-path>. Run arena data list to list available PVCs
--image Container image for the inference service

The expected output is:

service/tgi-qwen-4b-chat-v1 created
deployment.apps/tgi-qwen-4b-chat-v1-custom-serving created
INFO[0001] The Job tgi-qwen-4b-chat has been submitted successfully
INFO[0001] You can run `arena serve get tgi-qwen-4b-chat --type custom-serving -n default` to check the job status

Check that the service is running:

arena serve get tgi-qwen-4b-chat

The expected output is:

Name:       tgi-qwen-4b-chat
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        3m
Address:    172.16.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                 ------   ---  -----  --------  ---  ----
  tgi-qwen-4b-chat-v1-custom-serving-67b58c9865-m89lq  Running  3m   1/1    0         1    cn-beijing.192.168.XX.XX

Available: 1 and STATUS: Running confirm the pod is ready to serve requests.

Step 3: Verify the inference service

  1. Set up port forwarding to access the service from your local machine:

    Important

    kubectl port-forward is for development and debugging only. It is not reliable, secure, or scalable for production traffic. For production networking in ACK clusters, see Ingress overview.

    kubectl port-forward svc/tgi-qwen-4b-chat-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send a test request to the chat completions endpoint:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"","object":"text_completion","created":1716274541,"model":"/model/Qwen1.5-4B-Chat","system_fingerprint":"2.0.2-sha-6073ece","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What test do you want me to run?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"completion_tokens":10,"total_tokens":31}}

    A valid JSON response with a choices array confirms the model is loaded and generating text.

(Optional) Step 4: Clean up

If you no longer need the resources, delete them to avoid ongoing charges.

Delete the inference service:

arena serve delete tgi-qwen-4b-chat

Delete the PVC and PV:

kubectl delete pvc llm-model
kubectl delete pv llm-model