All Products
Search
Document Center

Container Service for Kubernetes:Deploy Qwen3.5-4B inference with Knative in ACK Auto Mode

Last Updated:Jun 25, 2026

ACK Auto Mode clusters support Auto Mode node pools. Combined with Knative Serving on-demand elasticity, you can deploy the Qwen3.5-4B large language model as an on-demand Serverless inference service. After deployment, no manual GPU resource management is required, making this suitable for cost-sensitive model inference scenarios with low operational complexity.

The workflow combines two mechanisms:

  • The Auto Mode node pool manages GPU node creation and release.

  • Knative Serving scales pods based on request concurrency (concurrency) or requests per second (rps).

Step 1: Create an ACK Auto Mode cluster and GPU node pool

1. Create a cluster

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click Create Kubernetes Cluster. On the ACK Managed Cluster page, enable Auto Mode.

    After you enable this mode, the page displays the three core capabilities of Auto Mode: fully managed operations (fully managed control plane, automatic version upgrades, and maintenance-free nodes with auto-healing), automatic node scaling (on-demand elastic scaling, automatic instance type matching, and optimized resource costs), and highly optimized node operating system (container-optimized OS for fast startup, immutable file system, and security best practices by default).

  3. Configure the settings and click Create Kubernetes Cluster.

    See Create an ACK Auto Mode cluster.

2. Create a GPU node pool

  1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

  2. On the Node Pools page, click Create Node Pool and configure the node pool in the Create Node Pool dialog box.

    Key parameters (see Create a node pool for all options):

    • Configure Managed Node Pool: Use intelligent management mode.

    • Instance-related configurations: For Instance Configuration Mode, select Specify Instance Type. Then select a GPU instance type such as V100, A10, or T4.

    • Node Labels: Add the label ack.aliyun.com/nvidia-driver-version:550.144.03 to set the NVIDIA driver version to 550.144.03.

    • Container Image Acceleration: Enable to reduce model image pull time.

3. Deploy Knative components

See Deploy Knative components.

Step 2: Prepare model files and upload to OSS

Download Qwen3.5-4B from ModelScope to a temporary ECS instance, upload to OSS with ossutil, and mount the bucket path as a persistent volume to avoid repeated downloads on pod restarts.

Before you begin:

  • An OSS bucket is created.

  • ossutil is installed and configured on the temporary ECS instance.

1. Download Qwen3.5-4B model files

Run the following commands on the temporary ECS instance.

  1. Install Git.

    # You can run 'yum install git' or 'apt install git' to install it.
    sudo yum install git
  2. Install Git LFS (Large File Storage).

    # You can run 'yum install git-lfs' or 'apt install git-lfs' to install it.
    sudo yum install git-lfs
  3. Clone the Qwen3.5-4B repository from ModelScope, skipping LFS files.

    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen3.5-4B.git
  4. Enter the directory and pull the LFS-managed files.

    cd Qwen3.5-4B
    git lfs pull

2. Upload model files to OSS

  1. Create a model directory in your OSS bucket.

    Replace <Your-Bucket-Name> with your bucket name.

    ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen3.5-4B
  2. Upload the model files to OSS.

    ossutil cp -r ./Qwen3.5-4B oss://<Your-Bucket-Name>/models/Qwen3.5-4B

3. Configure an OSS storage volume

  1. Choose an authentication method (RRSA or AccessKey) and prepare the access credentials.

    This topic uses AccessKey authentication. For other methods, see Use an ossfs 2.0 static persistent volume.
  2. Store the AccessKey as a Kubernetes secret for PV access.

    Replace <yourAccessKeyID> and <yourAccessKeySecret> with your credentials. The secret namespace must match the application namespace.

    kubectl create -n default secret generic oss-secret --from-literal='akId=<yourAccessKeyID>' --from-literal='akSecret=<yourAccessKeySecret>'
  3. Create a PV and PVC to mount the OSS model directory in read-only mode. This example uses an ossfs 2.0 static persistent volume.

    Sample code

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      # The name of the PV.
      name: llm-model
    spec:
      capacity:
        # The capacity of the storage volume. This value is used only to match the PVC.
        storage: 30Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        # Must be the same as the PV name (metadata.name).
        volumeHandle: llm-model
        nodePublishSecretRef:
          # The name of the secret that stores the AccessKey information.
          name: oss-secret
          # The namespace where the secret resides.
          namespace: default
        volumeAttributes:
          fuseType: ossfs2
          # Replace with your actual bucket name.
          bucket: <Your-Bucket-Name>
          # The subdirectory to mount. Leave it empty to mount the root directory.
          path: /models/Qwen3.5-4B
          # The endpoint of the region where the OSS bucket is located.
          url: "http://oss-cn-hangzhou-internal.aliyuncs.com"
          otherOpts: "-o close_to_open=false"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      # The name of the PVC.
      name: llm-model
      namespace: default
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      storageClassName: ""
      # The name of the PV to bind.
      volumeName: llm-model

Step 3: Deploy and verify Knative service

1. Create a Knative Service

  1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Knative.

  2. On the Service Management tab, click Create from Template. Set Sample Template to Custom and deploy the Knative Service.

    Sample code

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      labels:
        release: qwen
      name: qwen
      namespace: default
    spec:
      template:
        metadata:
          annotations:
            # The minimum number of replicas. Keep at least one replica running to avoid cold starts.
            autoscaling.knative.dev/minScale: "1"
            # The maximum number of replicas. This limits the upper boundary of GPU resource consumption.
            autoscaling.knative.dev/maxScale: "2"
          labels:
            release: qwen
        spec:
          containers:
          - command:
            - vllm
            - serve
            - /models/Qwen3.5-4B
            - --served-model-name
            - Qwen3.5-4B
            - --port
            - "8000"
            - --enforce-eager
            image: ac2-mirror-registry.cn-hangzhou.cr.aliyuncs.com/evaluate/vllm-openai:nightly-d00df624f313a6a5a7a6245b71448b068b080cd7
            imagePullPolicy: IfNotPresent
            name: vllm-container
            ports:
            - containerPort: 8000
              name: http1
              protocol: TCP
            readinessProbe:
              tcpSocket:
                port: 8000
              initialDelaySeconds: 5
              periodSeconds: 5
            resources:
              limits:
                cpu: "32"
                memory: 64Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "16"
                memory: 32Gi
                nvidia.com/gpu: "1"
            volumeMounts:
            - mountPath: /models/Qwen3.5-4B
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model

    Parameter

    Description

    autoscaling.knative.dev/metric

    The autoscaling metric. Valid values:

    • concurrency (default): Scale by concurrency.

    • rps: Scale by requests per second.

    autoscaling.knative.dev/target

    The target metric value that triggers autoscaling.

    autoscaling.knative.dev/minScale

    Minimum replicas. Integer ≥ 0. Set to 0 to enable scale-to-zero.

    autoscaling.knative.dev/maxScale

    Maximum replicas. Limits scale-out.

2. Verify service deployment

  1. On the Service Management tab, verify the service is ready. Note the default domain name and access gateway address.

    Note: Send requests to the access gateway (format: alb-xxx.aliyuncsslb.com) with a Host header set to the service domain (format: qwen.default.example.com).

  2. Send a test request to the inference service.

    Replace xx.40.85.xx with your access gateway address and qwen.default.example.com with your default domain name.

    curl http://xx.40.85.xx:80/v1/chat/completions \
      -H "Host: qwen.default.example.com" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen3.5-4B",
        "messages": [
          {
            "role": "user",
            "content": [
              {
                "type": "text",
                "text": "Tell me about Hangzhou"
              }
            ]
          }
        ],
        "max_tokens": 200
      }'

    Expected output:

    {
      "id": "chatcmpl-20dfb4c8-d1ab-48bc-9f1a-78b84c6c8adf",
      "object": "chat.completion",
      "created": 1772602897,
      "model": "Qwen3.5-4B",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Hangzhou, abbreviated as 'Hang', is a sub-provincial city located in Zhejiang Province, China..."
          },
          "finish_reason": "length"
        }
      ],
      "usage": {
        "prompt_tokens": 14,
        "completion_tokens": 200,
        "total_tokens": 214
      }
    }