All Products
Search
Document Center

Container Service for Kubernetes:Deploying a Qwen3.5-2B large model inference service in ACK Auto Mode

Last Updated:Apr 14, 2026

Container Service for Kubernetes (ACK) Auto Mode clusters are optimized for GPU elasticity and automatically handle the scaling and basic operations of GPU nodes. This topic uses the Qwen3.5-2B model as an example to show how to quickly deploy a large model inference service with GPU compute on an ACK Auto Mode cluster.

Prerequisites

  • You have created an ACK Auto Mode cluster.

  • You have created an eligible GPU node pool in intelligent hosting mode.

    Details

    1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

    2. On the Node Pools page, click Create Node Pool, and then in the Create Node Pool dialog box, configure the node pool.

      Configure the following key settings. For more information, see Create a node pool.

      • Configure Managed Node Pool: Select the intelligent managed mode.

      • Instance-related settings: Set Instance Configuration Mode to Specify Instance Type, and then select a GPU ECS instance type, such as V100, A10, or T4.

      • Node Labels: Add the label ack.aliyun.com/nvidia-driver-version:550.144.03 to specify that the NVIDIA driver version is 550.144.03.

      • Container Image Acceleration: Enable this feature to pull model images faster.

Step 1: Prepare model files and mount OSS

In this step, you use a temporary ECS instance to download the Qwen3.5-2B model files from ModelScope, upload them to an OSS bucket, and then configure a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for the cluster. Mounting the model to the inference container as a volume avoids repeated downloads when the container starts.

Make sure that the following prerequisites are met:

1. Download the Qwen3.5-2B model

Perform the following steps on the temporary ECS instance to download the model files from ModelScope.

  1. Install Git.

    # You can run yum install git or apt install git to install it.
    sudo yum install git
  2. Install the Git Large File Storage (LFS) extension.

    # You can run yum install git-lfs or apt install git-lfs to install it.
    sudo yum install git-lfs
  3. Initialize Git LFS and clone the Qwen3.5-2B repository from ModelScope. This command skips LFS large files to prevent duplicate downloads.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3.5-2B.git
  4. Change to the repository directory and pull the LFS-managed large model files.

    cd Qwen3.5-2B/
    git lfs pull

2. Upload the model files to OSS

  1. Create a directory in the OSS bucket to store the model.

    Replace <Your-Bucket-Name> with your actual bucket name.

    ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen3.5-2B
  2. Upload the local model files to OSS.

    ossutil cp -r ./Qwen3.5-2B oss://<Your-Bucket-Name>/models/Qwen3.5-2B

3. Configure an OSS volume

Create a PV and a PVC to allow pods to mount the model directory in OSS as a read-only volume. For more information, see Use a static volume with ossfs 2.0.

  1. Select an authentication method (RRSA or AccessKey) and prepare access credentials to ensure that the cluster can securely access OSS bucket resources.

    This example uses AccessKey authentication. The two methods differ slightly. For more information, see Use a static volume with ossfs 2.0.
  2. Store your AccessKey as a Secret for the PV.

    Replace <yourAccessKeyID> and <yourAccessKeySecret> with your actual credentials. The namespace of the Secret must match the namespace of the application.

    kubectl create -n default secret generic oss-secret --from-literal='akId=<yourAccessKeyID>' --from-literal='akSecret=<yourAccessKeySecret>'
  3. Create a PV and a PVC to mount the model directory in OSS as a read-only volume. The following example uses a static volume with ossfs 2.0.

    Sample code

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      # PV name.
      name: llm-model  
    spec:
      capacity:
        # Volume capacity. This value is used only to match the PVC.
        storage: 30Gi  
      # Access mode.
      accessModes:  
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        # Must match the PV name (metadata.name).
        volumeHandle: llm-model   
        # Use the Secret created earlier.
        nodePublishSecretRef:
          # Name of the Secret storing the AccessKey.
          name: oss-secret  
          # Namespace of the Secret.
          namespace: default  
        volumeAttributes:
          fuseType: ossfs2
          # Your bucket name.
          bucket: knative-llm  
          # Subdirectory to mount. Leave empty for the root directory.
          path: /models/Qwen3.5-2B
          # Endpoint for the region where the OSS bucket is located.
          url: "http://oss-cn-hangzhou-internal.aliyuncs.com"  
          otherOpts: "-o close_to_open=false"
    ---
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      # PVC name.
      name: llm-model 
      namespace: default
    spec:
      # Must be consistent with the PV.
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      storageClassName: ""
      # The PV to bind.
      volumeName: llm-model

Step 2: Deploy and verify the inference service

1. Create Deployment and Service

Use the vLLM framework to deploy the Qwen3.5-2B model as a Deployment and expose it as a LoadBalancer Service.

  1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Deployments.

  2. Click Create from YAML and submit the following YAML content.

    After you submit the YAML, if the cluster has insufficient GPU resources, the pod will enter the Pending state. ACK Auto Mode automatically triggers GPU node scaling, creates new nodes, and schedules the pod to a new node once it is initialized. No manual intervention is required. When the pod enters the Running state, the model service is deployed.

    Sample code

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: qwen-2b
      labels:
        app: qwen
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          labels:
            app: qwen
        spec:
          containers:
          - command:
            - vllm
            - serve
            - /models/Qwen3.5-2B       
            - --served-model-name
            - Qwen3.5-2B
            - --port
            - "8000"                 
            - --enforce-eager
            image: ac2-mirror-registry.cn-hangzhou.cr.aliyuncs.com/evaluate/vllm-openai:nightly-d00df624f313a6a5a7a6245b71448b068b080cd7
            imagePullPolicy: IfNotPresent
            name: vllm-container
            ports:
            - containerPort: 8000
              name: http1
              protocol: TCP
            readinessProbe:
              tcpSocket:
                port: 8000
              initialDelaySeconds: 5
              periodSeconds: 5
            resources:
              limits:
                cpu: "32"
                memory: 64Gi
                # Maximum number of GPUs.
                nvidia.com/gpu: "1"
              requests:
                cpu: "8"
                memory: 32Gi
                # Each pod requests 1 GPU, consistent with limits.
                nvidia.com/gpu: "1"
            volumeMounts:
            # Must match the model path in the command.
            - mountPath: /models/Qwen3.5-2B
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: qwen-2b
    spec:
      type: LoadBalancer
      ports:
        # The exposed port. Must match containerPort.
        - port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen

    After the deployment is complete, you can view the application status on the Deployments page.

2. Verify the inference service

  1. Get the public IP address exposed by the Service.

    export EXTERNAL_IP=$(kubectl get svc qwen-2b -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    echo ${EXTERNAL_IP}
  2. Send an inference request to verify that the service is available.

    Replace 8.XX.XX.89 with your public IP address.

    curl http://8.XX.XX.89:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen3.5-2B",
        "messages": [
          {
            "role": "user",
            "content": [
              {
                "type": "text",
                "text": "Kubernetes"
              }
            ]
          }
        ],
        "max_tokens": 200
      }'

    Expected output:

    {"id":"chatcmpl-98f158cdbbb38087","object":"chat.completion","created":1775043962,"model":"Qwen3.5-2B","choices":[{"index":0,"message":{"role":"assistant","content":"**Kubernetes** is an open-source container orchestration platform that automates deployment, scaling, management, and repair of containerized applications..."},"finish_reason":"length"}],"usage":{"prompt_tokens":14,"total_tokens":214,"completion_tokens":200}}

Related topics