Quickly deploy a Qwen3.5-2B large model inference service using elastic GPUs in ACK Auto Mode - Container Service for Kubernetes

Container Service for Kubernetes (ACK) Auto Mode clusters are optimized for GPU elasticity and automatically handle the scaling and basic operations of GPU nodes. This topic uses the Qwen3.5-2B model as an example to show how to quickly deploy a large model inference service with GPU compute on an ACK Auto Mode cluster.

Prerequisites

You have created an ACK Auto Mode cluster.
You have created an eligible GPU node pool in intelligent hosting mode.
Details
1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.
2. On the Node Pools page, click Create Node Pool, and then in the Create Node Pool dialog box, configure the node pool.
  Configure the following key settings. For more information, see Create a node pool.
  - Configure Managed Node Pool: Select the intelligent managed mode.
  - Instance-related settings: Set Instance Configuration Mode to Specify Instance Type, and then select a GPU ECS instance type, such as V100, A10, or T4.
  - Node Labels: Add the label ack.aliyun.com/nvidia-driver-version:550.144.03 to specify that the NVIDIA driver version is 550.144.03.
  - Container Image Acceleration: Enable this feature to pull model images faster.

Step 1: Prepare model files and mount OSS

In this step, you use a temporary ECS instance to download the Qwen3.5-2B model files from ModelScope, upload them to an OSS bucket, and then configure a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for the cluster. Mounting the model to the inference container as a volume avoids repeated downloads when the container starts.

Make sure that the following prerequisites are met:

You have created an OSS bucket.
You have installed and configured ossutil on the temporary ECS instance.

1. Download the Qwen3.5-2B model

Perform the following steps on the temporary ECS instance to download the model files from ModelScope.

Install Git.

# You can run yum install git or apt install git to install it.
sudo yum install git

Install the Git Large File Storage (LFS) extension.

# You can run yum install git-lfs or apt install git-lfs to install it.
sudo yum install git-lfs

Initialize Git LFS and clone the Qwen3.5-2B repository from ModelScope. This command skips LFS large files to prevent duplicate downloads.
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3.5-2B.git
```
Change to the repository directory and pull the LFS-managed large model files.
```
cd Qwen3.5-2B/
git lfs pull
```

2. Upload the model files to OSS

Create a directory in the OSS bucket to store the model.
Replace <Your-Bucket-Name> with your actual bucket name.
```
ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen3.5-2B
```

Upload the local model files to OSS.

ossutil cp -r ./Qwen3.5-2B oss://<Your-Bucket-Name>/models/Qwen3.5-2B

3. Configure an OSS volume

Create a PV and a PVC to allow pods to mount the model directory in OSS as a read-only volume. For more information, see Use a static volume with ossfs 2.0.

Select an authentication method (RRSA or AccessKey) and prepare access credentials to ensure that the cluster can securely access OSS bucket resources.
This example uses AccessKey authentication. The two methods differ slightly. For more information, see Use a static volume with ossfs 2.0.
Store your AccessKey as a Secret for the PV.
Replace <yourAccessKeyID> and <yourAccessKeySecret> with your actual credentials. The namespace of the Secret must match the namespace of the application.
```
kubectl create -n default secret generic oss-secret --from-literal='akId=<yourAccessKeyID>' --from-literal='akSecret=<yourAccessKeySecret>'
```

Create a PV and a PVC to mount the model directory in OSS as a read-only volume. The following example uses a static volume with ossfs 2.0.

Sample code

apiVersion: v1
kind: PersistentVolume
metadata:
  # PV name.
  name: llm-model  
spec:
  capacity:
    # Volume capacity. This value is used only to match the PVC.
    storage: 30Gi  
  # Access mode.
  accessModes:  
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    # Must match the PV name (metadata.name).
    volumeHandle: llm-model   
    # Use the Secret created earlier.
    nodePublishSecretRef:
      # Name of the Secret storing the AccessKey.
      name: oss-secret  
      # Namespace of the Secret.
      namespace: default  
    volumeAttributes:
      fuseType: ossfs2
      # Your bucket name.
      bucket: knative-llm  
      # Subdirectory to mount. Leave empty for the root directory.
      path: /models/Qwen3.5-2B
      # Endpoint for the region where the OSS bucket is located.
      url: "http://oss-cn-hangzhou-internal.aliyuncs.com"  
      otherOpts: "-o close_to_open=false"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  # PVC name.
  name: llm-model 
  namespace: default
spec:
  # Must be consistent with the PV.
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: ""
  # The PV to bind.
  volumeName: llm-model

Step 2: Deploy and verify the inference service

1. Create Deployment and Service

Use the vLLM framework to deploy the Qwen3.5-2B model as a Deployment and expose it as a LoadBalancer Service.

On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Deployments.

Click Create from YAML and submit the following YAML content.

After you submit the YAML, if the cluster has insufficient GPU resources, the pod will enter the Pending state. ACK Auto Mode automatically triggers GPU node scaling, creates new nodes, and schedules the pod to a new node once it is initialized. No manual intervention is required. When the pod enters the Running state, the model service is deployed.

Sample code

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-2b
  labels:
    app: qwen
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen
  template:
    metadata:
      labels:
        app: qwen
    spec:
      containers:
      - command:
        - vllm
        - serve
        - /models/Qwen3.5-2B       
        - --served-model-name
        - Qwen3.5-2B
        - --port
        - "8000"                 
        - --enforce-eager
        image: ac2-mirror-registry.cn-hangzhou.cr.aliyuncs.com/evaluate/vllm-openai:nightly-d00df624f313a6a5a7a6245b71448b068b080cd7
        imagePullPolicy: IfNotPresent
        name: vllm-container
        ports:
        - containerPort: 8000
          name: http1
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          limits:
            cpu: "32"
            memory: 64Gi
            # Maximum number of GPUs.
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 32Gi
            # Each pod requests 1 GPU, consistent with limits.
            nvidia.com/gpu: "1"
        volumeMounts:
        # Must match the model path in the command.
        - mountPath: /models/Qwen3.5-2B
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-2b
spec:
  type: LoadBalancer
  ports:
    # The exposed port. Must match containerPort.
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen

After the deployment is complete, you can view the application status on the Deployments page.

2. Verify the inference service

Get the public IP address exposed by the Service.

export EXTERNAL_IP=$(kubectl get svc qwen-2b -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo ${EXTERNAL_IP}

Send an inference request to verify that the service is available.

Replace 8.XX.XX.89 with your public IP address.

curl http://8.XX.XX.89:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-2B",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Kubernetes"
          }
        ]
      }
    ],
    "max_tokens": 200
  }'

Expected output:

{"id":"chatcmpl-98f158cdbbb38087","object":"chat.completion","created":1775043962,"model":"Qwen3.5-2B","choices":[{"index":0,"message":{"role":"assistant","content":"**Kubernetes** is an open-source container orchestration platform that automates deployment, scaling, management, and repair of containerized applications..."},"finish_reason":"length"}],"usage":{"prompt_tokens":14,"total_tokens":214,"completion_tokens":200}}

Container Service for Kubernetes:Deploying a Qwen3.5-2B large model inference service in ACK Auto Mode