ACK managed Pro clusters give you a ready-to-use environment for running large language model (LLM) inference services—no local GPU hardware required, no complex dependency setup. This guide covers two deployment paths: a quick option for validating a model in about 15 minutes, and a production-grade option that pre-loads model files onto persistent storage to reduce cold-start times and bandwidth costs.
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed Pro cluster running Kubernetes 1.22 or later
-
At least one GPU-accelerated node with 16 GB or more of GPU memory
-
NVIDIA driver version 535 or later installed on the GPU node pool (this guide uses
550.144.03, set via theack.aliyun.com/nvidia-driver-versionlabel) -
The Arena client installed
Choose a deployment path
| Option 1: Quick test | Option 2: Production | |
|---|---|---|
| Setup time | ~15 minutes | Longer (model pre-upload required) |
| Model storage | Downloaded into the container at startup | Pre-loaded on Object Storage Service (OSS) |
| Cold-start | Slow — model re-downloads on every pod restart | Fast — model is already on the mounted volume |
| Best for | Validating inference capabilities | Stable, repeatable production workloads |
Option 1: Quick deployment for testing
Use Arena to deploy qwen/Qwen1.5-4B-Chat from ModelScope. The container downloads the model at startup, so the GPU node needs at least 30 GB of free disk space.
-
Run the Arena command to deploy the inference service:
arena serve custom \ --name=modelscope \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \ "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"To pull model files from a Hugging Face repository, see Pull models from Hugging Face.
The following output confirms the Kubernetes resources for
modelscope-v1were created:service/modelscope-v1 created deployment.apps/modelscope-v1-custom-serving created INFO[0002] The Job modelscope has been submitted successfully INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status -
Check the service status. The pod stays in
ContainerCreatingwhile the model downloads. Depending on network conditions, this can take 5–15 minutes:arena serve get modelscopeOnce the pod status shows
Running, the inference service is ready.
Option 2: Production-ready deployment with persistent storage
Pre-loading model files on OSS avoids re-downloading files larger than 10 GB every time a pod restarts. This reduces cold-start times, lowers bandwidth costs, and improves service stability.
Step 1: Download the model files
-
Install Git and Git Large File Storage (LFS). macOS
brew install git brew install git-lfsWindows Download and install Git from the official Git website. Git Large File Storage is bundled with Git for Windows — download the latest version. Linux (Red Hat-based)
yum install git yum install git-lfsFor other Linux distributions, see the official Git website.
-
Clone the Qwen1.5-4B-Chat model repository and pull the large files:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git cd Qwen1.5-4B-Chat git lfs pull
Step 2: Upload model files to OSS
-
Create a bucket. To reduce model pull latency, create the bucket in the same region as your cluster:
ossutil mb oss://<your-bucket-name> -
Create a folder in the bucket for the model files:
ossutil mkdir oss://<your-bucket-name>/Qwen1.5-4B-Chat -
Upload the model files:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<your-bucket-name>/Qwen1.5-4B-Chat
Step 3: Configure a persistent volume (PV)
-
Log on to the ACK console and click the target cluster. In the left navigation pane, choose Volumes > Persistent Volumes.
-
Click Create. In the Create PV dialog box, set the following parameters and click Create:
Parameter Value PV type OSSVolume name llm-modelCapacity 20GiAccess mode ReadOnlyManyAccess certificate Select Create Secret Optional parameters -o umask=022 -o max_stat_cache_size=0 -o allow_otherBucket ID Click Select Bucket and select your bucket OSS path /Qwen1.5-4B-ChatEndpoint Select Public Endpoint
Step 4: Configure a persistent volume claim (PVC)
-
In the left navigation pane, choose Volumes > Persistent Volume Claims.
-
On the Persistent Volume Claims page, set the following parameters and click Create:
Parameter Value PVC type OSSName llm-modelAllocation mode Select Existing Volumes Existing volumes Select the llm-modelPV created in the previous stepCapacity 20Gi
Step 5: Deploy the inference service
Run the Arena command to deploy the service. The --data flag mounts the PVC containing the pre-loaded model files. Because the model is already on the mounted volume, the pod starts without downloading anything:
arena serve custom \
--name=modelscope \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--data=llm-model:/Qwen1.5-4B-Chat \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"
The following output confirms the inference service was submitted:
service/modelscope-v1 created
deployment.apps/modelscope-v1-custom-serving created
INFO[0001] The Job modelscope has been submitted successfully
INFO[0001] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status
Check the service status:
arena serve get modelscope
Once the pod status shows Running, the inference service is ready.
Validate the inference service
-
Set up port forwarding to the inference service:
Importantkubectl port-forwardis for development and debugging only. It is not reliable, secure, or scalable in production. For production networking, see Ingress management.kubectl port-forward svc/modelscope-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 -
In a new terminal, send a test inference request:
curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{ "text_input": "What is artificial intelligence? Artificial intelligence is", "parameters": { "stream": false, "temperature": 0.9, "seed": 10 } }'A successful response contains the model's generated text:
{"model_name":"/Qwen1.5-4B-Chat","text_output":"What is artificial intelligence? Artificial intelligence is a branch of computer science that studies how to make computers have intelligent behavior."}
(Optional) Clean up
Delete the inference service and storage resources when you're done:
# Delete the inference service
arena serve del modelscope
# Delete the PVC and PV (Option 2 only)
kubectl delete pvc llm-model
kubectl delete pv llm-model
FAQ
How can I pull model files from Hugging Face instead of ModelScope?
Make sure the container runtime can reach the Hugging Face repository, then set MODEL_SOURCE=Huggingface in the Arena command. The GPU node needs at least 30 GB of free disk space to accommodate the downloaded files:
arena serve custom \
--name=huggingface \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
"MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"
The following output confirms the resources were created:
service/huggingface-v1 created
deployment.apps/huggingface-v1-custom-serving created
INFO[0003] The Job huggingface has been submitted successfully
INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status
Appendix: command parameter reference
| Parameter | Description | Example |
|---|---|---|
serve custom |
Arena subcommand. Deploys a custom model service rather than a preset type such as tfserving or triton. |
— |
--name |
Service name. A unique identifier used for subsequent operations such as checking logs and deleting the service. | modelscope |
--version |
Service version. A version label for the service, useful for version management and phased releases. | v1 |
--gpus |
GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference. | 1 |
--replicas |
Replica count. The number of pods to run. More replicas increase concurrent throughput and availability. | 1 |
--restful-port |
RESTful API port. The port on which the service exposes its RESTful API to receive inference requests. | 8000 |
--readiness-probe-action |
Readiness probe type. The check method used by the Kubernetes readiness probe to determine whether the container is ready to receive traffic. | tcpSocket |
--readiness-probe-action-option |
Probe type options. Parameters for the chosen probe type. For tcpSocket, specifies the port to check. |
port: 8000 |
--readiness-probe-option |
Additional probe settings. Extra parameters for the readiness probe. This flag can be repeated. Sets the initial delay and check interval. | initialDelaySeconds: 30, periodSeconds: 30 |
--data |
Volume mount. Mounts a PVC at a specified path inside the container, in the format <pvc-name>:<mount-path>. Used to mount pre-loaded model files. |
llm-model:/Qwen1.5-4B-Chat |
--image |
Container image. The full URL of the container image that defines the runtime environment for the service. | kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 |
[COMMAND] |
Startup command. The command to run after the container starts. Sets the MODEL_ID environment variable and launches server.py. |
"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py" |
FAQ
Pull models from Hugging Face
-
Make sure that the container runtime environment can access the Hugging Face repository.
-
Use the Arena client to deploy a custom service and specify the container image for deployment using the
--imageparameter. For more information about the parameters, see Command parameter reference.This method downloads the Hugging Face model files into the container. Ensure that your GPU node has at least 30 GB of available disk space.
arena serve custom \ --name=huggingface \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \ "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"The following output indicates that the Kubernetes resources for the
huggingface-v1inference service have been created:service/huggingface-v1 created deployment.apps/huggingface-v1-custom-serving created INFO[0003] The Job huggingface has been submitted successfully INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status
What's next
-
To specify an NVIDIA driver version for GPU nodes, see Specify an NVIDIA driver version for nodes by adding a label.
-
To use production-grade inference frameworks such as vLLM or Triton, see Deploy a Qwen model inference service using vLLM and Deploy a Qwen model inference service using Triton.