Container Service for Kubernetes (ACK) Auto Mode clusters are optimized for GPU elasticity and automatically handle the scaling and basic operations of GPU nodes. This topic uses the Qwen3.5-2B model as an example to show how to quickly deploy a large model inference service with GPU compute on an ACK Auto Mode cluster.
Prerequisites
You have created an ACK Auto Mode cluster.
You have created an eligible GPU node pool in intelligent hosting mode.
Step 1: Prepare model files and mount OSS
In this step, you use a temporary ECS instance to download the Qwen3.5-2B model files from ModelScope, upload them to an OSS bucket, and then configure a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for the cluster. Mounting the model to the inference container as a volume avoids repeated downloads when the container starts.
Make sure that the following prerequisites are met:
You have created an OSS bucket.
You have installed and configured ossutil on the temporary ECS instance.
1. Download the Qwen3.5-2B model
Perform the following steps on the temporary ECS instance to download the model files from ModelScope.
Install Git.
# You can run yum install git or apt install git to install it. sudo yum install gitInstall the Git Large File Storage (LFS) extension.
# You can run yum install git-lfs or apt install git-lfs to install it. sudo yum install git-lfsInitialize Git LFS and clone the Qwen3.5-2B repository from ModelScope. This command skips LFS large files to prevent duplicate downloads.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3.5-2B.gitChange to the repository directory and pull the LFS-managed large model files.
cd Qwen3.5-2B/ git lfs pull
2. Upload the model files to OSS
Create a directory in the OSS bucket to store the model.
Replace
<Your-Bucket-Name>with your actual bucket name.ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen3.5-2BUpload the local model files to OSS.
ossutil cp -r ./Qwen3.5-2B oss://<Your-Bucket-Name>/models/Qwen3.5-2B
3. Configure an OSS volume
Create a PV and a PVC to allow pods to mount the model directory in OSS as a read-only volume. For more information, see Use a static volume with ossfs 2.0.
Select an authentication method (RRSA or AccessKey) and prepare access credentials to ensure that the cluster can securely access OSS bucket resources.
This example uses AccessKey authentication. The two methods differ slightly. For more information, see Use a static volume with ossfs 2.0.
Store your AccessKey as a Secret for the PV.
Replace
<yourAccessKeyID>and<yourAccessKeySecret>with your actual credentials. The namespace of the Secret must match the namespace of the application.kubectl create -n default secret generic oss-secret --from-literal='akId=<yourAccessKeyID>' --from-literal='akSecret=<yourAccessKeySecret>'Create a PV and a PVC to mount the model directory in OSS as a read-only volume. The following example uses a static volume with ossfs 2.0.
Step 2: Deploy and verify the inference service
1. Create Deployment and Service
Use the vLLM framework to deploy the Qwen3.5-2B model as a Deployment and expose it as a LoadBalancer Service.
On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click .
Click Create from YAML and submit the following YAML content.
After you submit the YAML, if the cluster has insufficient GPU resources, the pod will enter the
Pendingstate. ACK Auto Mode automatically triggers GPU node scaling, creates new nodes, and schedules the pod to a new node once it is initialized. No manual intervention is required. When the pod enters theRunningstate, the model service is deployed.After the deployment is complete, you can view the application status on the Deployments page.
2. Verify the inference service
Get the public IP address exposed by the Service.
export EXTERNAL_IP=$(kubectl get svc qwen-2b -o jsonpath='{.status.loadBalancer.ingress[0].ip}') echo ${EXTERNAL_IP}Send an inference request to verify that the service is available.
Replace
8.XX.XX.89with your public IP address.curl http://8.XX.XX.89:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3.5-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Kubernetes" } ] } ], "max_tokens": 200 }'Expected output:
{"id":"chatcmpl-98f158cdbbb38087","object":"chat.completion","created":1775043962,"model":"Qwen3.5-2B","choices":[{"index":0,"message":{"role":"assistant","content":"**Kubernetes** is an open-source container orchestration platform that automates deployment, scaling, management, and repair of containerized applications..."},"finish_reason":"length"}],"usage":{"prompt_tokens":14,"total_tokens":214,"completion_tokens":200}}