Deploy a Qwen3.5-4B large language model inference service in an ACK Auto Mode cluster by using Knative - Container Service for Kubernetes

ACK Auto Mode clusters support Auto Mode node pools. Combined with Knative Serving on-demand elasticity, you can deploy the Qwen3.5-4B large language model as an on-demand Serverless inference service. After deployment, no manual GPU resource management is required, making this suitable for cost-sensitive model inference scenarios with low operational complexity.

The workflow combines two mechanisms:

The Auto Mode node pool manages GPU node creation and release.
Knative Serving scales pods based on request concurrency (concurrency) or requests per second (rps).

Step 1: Create an ACK Auto Mode cluster and GPU node pool

1. Create a cluster

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click Create Kubernetes Cluster. On the ACK Managed Cluster page, enable Auto Mode.

After you enable this mode, the page displays the three core capabilities of Auto Mode: fully managed operations (fully managed control plane, automatic version upgrades, and maintenance-free nodes with auto-healing), automatic node scaling (on-demand elastic scaling, automatic instance type matching, and optimized resource costs), and highly optimized node operating system (container-optimized OS for fast startup, immutable file system, and security best practices by default).
Configure the settings and click Create Kubernetes Cluster.

See Create an ACK Auto Mode cluster.

2. Create a GPU node pool

On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.
On the Node Pools page, click Create Node Pool and configure the node pool in the Create Node Pool dialog box.

Key parameters (see Create a node pool for all options):
- Configure Managed Node Pool: Use intelligent management mode.
- Instance-related configurations: For Instance Configuration Mode, select Specify Instance Type. Then select a GPU instance type such as V100, A10, or T4.
- Node Labels: Add the label ack.aliyun.com/nvidia-driver-version:550.144.03 to set the NVIDIA driver version to 550.144.03.
- Container Image Acceleration: Enable to reduce model image pull time.

3. Deploy Knative components

See Deploy Knative components.

Step 2: Prepare model files and upload to OSS

Download Qwen3.5-4B from ModelScope to a temporary ECS instance, upload to OSS with ossutil, and mount the bucket path as a persistent volume to avoid repeated downloads on pod restarts.

Before you begin:

An OSS bucket is created.
ossutil is installed and configured on the temporary ECS instance.

1. Download Qwen3.5-4B model files

Run the following commands on the temporary ECS instance.

Install Git.

# You can run 'yum install git' or 'apt install git' to install it.
sudo yum install git

Install Git LFS (Large File Storage).

# You can run 'yum install git-lfs' or 'apt install git-lfs' to install it.
sudo yum install git-lfs

Clone the Qwen3.5-4B repository from ModelScope, skipping LFS files.

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen3.5-4B.git

Enter the directory and pull the LFS-managed files.
```
cd Qwen3.5-4B
git lfs pull
```

2. Upload model files to OSS

Create a model directory in your OSS bucket.

Replace <Your-Bucket-Name> with your bucket name.
```
ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen3.5-4B
```

Upload the model files to OSS.

ossutil cp -r ./Qwen3.5-4B oss://<Your-Bucket-Name>/models/Qwen3.5-4B

3. Configure an OSS storage volume

Choose an authentication method (RRSA or AccessKey) and prepare the access credentials.

This topic uses AccessKey authentication. For other methods, see Use an ossfs 2.0 static persistent volume.
Store the AccessKey as a Kubernetes secret for PV access.

Replace <yourAccessKeyID> and <yourAccessKeySecret> with your credentials. The secret namespace must match the application namespace.
```
kubectl create -n default secret generic oss-secret --from-literal='akId=<yourAccessKeyID>' --from-literal='akSecret=<yourAccessKeySecret>'
```

Create a PV and PVC to mount the OSS model directory in read-only mode. This example uses an ossfs 2.0 static persistent volume.

Sample code

apiVersion: v1
kind: PersistentVolume
metadata:
  # The name of the PV.
  name: llm-model
spec:
  capacity:
    # The capacity of the storage volume. This value is used only to match the PVC.
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    # Must be the same as the PV name (metadata.name).
    volumeHandle: llm-model
    nodePublishSecretRef:
      # The name of the secret that stores the AccessKey information.
      name: oss-secret
      # The namespace where the secret resides.
      namespace: default
    volumeAttributes:
      fuseType: ossfs2
      # Replace with your actual bucket name.
      bucket: <Your-Bucket-Name>
      # The subdirectory to mount. Leave it empty to mount the root directory.
      path: /models/Qwen3.5-4B
      # The endpoint of the region where the OSS bucket is located.
      url: "http://oss-cn-hangzhou-internal.aliyuncs.com"
      otherOpts: "-o close_to_open=false"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  # The name of the PVC.
  name: llm-model
  namespace: default
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: ""
  # The name of the PV to bind.
  volumeName: llm-model

Step 3: Deploy and verify Knative service

1. Create a Knative Service

On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Knative.

On the Service Management tab, click Create from Template. Set Sample Template to Custom and deploy the Knative Service.

Sample code

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        # The minimum number of replicas. Keep at least one replica running to avoid cold starts.
        autoscaling.knative.dev/minScale: "1"
        # The maximum number of replicas. This limits the upper boundary of GPU resource consumption.
        autoscaling.knative.dev/maxScale: "2"
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - vllm
        - serve
        - /models/Qwen3.5-4B
        - --served-model-name
        - Qwen3.5-4B
        - --port
        - "8000"
        - --enforce-eager
        image: ac2-mirror-registry.cn-hangzhou.cr.aliyuncs.com/evaluate/vllm-openai:nightly-d00df624f313a6a5a7a6245b71448b068b080cd7
        imagePullPolicy: IfNotPresent
        name: vllm-container
        ports:
        - containerPort: 8000
          name: http1
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          limits:
            cpu: "32"
            memory: 64Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "16"
            memory: 32Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /models/Qwen3.5-4B
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

Parameter	Description
`autoscaling.knative.dev/metric`	The autoscaling metric. Valid values: `concurrency` (default): Scale by concurrency. `rps`: Scale by requests per second.
`autoscaling.knative.dev/target`	The target metric value that triggers autoscaling.
`autoscaling.knative.dev/minScale`	Minimum replicas. Integer ≥ 0. Set to 0 to enable scale-to-zero.
`autoscaling.knative.dev/maxScale`	Maximum replicas. Limits scale-out.

2. Verify service deployment

On the Service Management tab, verify the service is ready. Note the default domain name and access gateway address.

Note: Send requests to the access gateway (format: alb-xxx.aliyuncsslb.com) with a Host header set to the service domain (format: qwen.default.example.com).

Send a test request to the inference service.

Replace xx.40.85.xx with your access gateway address and qwen.default.example.com with your default domain name.

curl http://xx.40.85.xx:80/v1/chat/completions \
  -H "Host: qwen.default.example.com" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-4B",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Tell me about Hangzhou"
          }
        ]
      }
    ],
    "max_tokens": 200
  }'

Expected output:

{
  "id": "chatcmpl-20dfb4c8-d1ab-48bc-9f1a-78b84c6c8adf",
  "object": "chat.completion",
  "created": 1772602897,
  "model": "Qwen3.5-4B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hangzhou, abbreviated as 'Hang', is a sub-provincial city located in Zhejiang Province, China..."
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 200,
    "total_tokens": 214
  }
}