×
Community Blog Simplified Deployment Tutorial of the Qwen3 LLM on Alibaba Cloud Container Service for Kubernetes

Simplified Deployment Tutorial of the Qwen3 LLM on Alibaba Cloud Container Service for Kubernetes

The article explains how to deploy the Qwen3 large language model on Alibaba Cloud ACK and ACS serverless GPU resources.

By Zibai

1. Background

Qwen3

Qwen3 is the first hybrid inference model newly launched in the Qwen series. The flagship model, Qwen3-235B-A22B, demonstrates competitive performance in benchmark tests such as code, mathematics, and general capabilities, rivaling top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. In addition, the small MoE model, Qwen3-30B-A3B, outperforms QwQ-32B by using only 10% of the activation parameters of QwQ-32B, and even a small model like Qwen3-4B can match the performance of Qwen2.5-72B-Instruct. Qwen3 supports multiple thinking modes, allowing users to control the model's depth of thinking based on specific tasks. It also supports 119 languages and dialects, with enhanced support for MCP.

ACK

Container Service for Kubernetes (ACK) is one of the first batch of services that have passed the Certified Kubernetes Conformance Program in the world. ACK provides high-performance containerized application management services. ACK is integrated with the virtualization, storage, networking, and security capabilities provided by Alibaba Cloud, simplifies the creation and expansion of clusters, and allows you to focus on the development and management of containerized applications.

ACS

Container Compute Service (ACS) is a cloud computing service that provides container compute resources that comply with the container specifications of Kubernetes.

It can be implemented in ACK clusters by using virtual nodes. This way, Kubernetes clusters are empowered with high elasticity and are no longer limited by the computing capacity of cluster nodes. After you connect ACS to Kubernetes, ACS takes over the management of pods, including the infrastructure and resource availability. Kubernetes no longer needs to manage the lifecycle and resources of the underlying VMs.

2. Prerequisites

A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.

The kubectl client is connected to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

3. Model Deployment

Step 1: Prepare the Qwen3-8B model files

1.  Run the following command to download the Qwen3-8B model from ModelScope.

Check whether the git-lfs plug-in is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B/
git lfs pull

2.  Create an OSS directory and upload the model files to the directory.

To install and use ossutil, see Install ossutil.

ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B

3.  Create a PV and a PVC. Create a PV named llm-model and a PVC for the cluster.

apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
  akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
  akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # The name of the OSS bucket.
      url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # The model path, such as /models/Qwen3-8B/ in this example.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 2: Deploy an inference service

Run the following command to start the inference service named qwen3:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3
  name: qwen3
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3
  template:
    metadata:
      labels:
        app: qwen3
        # for ACS Cluster
        # alibabacloud.com/compute-class: gpu
        # example-model indicates the GPU model. Replace it with the actual GPU model, such as T4.
        # alibabacloud.com/gpu-model-series: "example-model"
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen3-8B/ --port 8000 --trust-remote-code --max-model-len 2048 --gpu-memory-utilization 0.98 
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.8.4
        imagePullPolicy: IfNotPresent
        name: vllm
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: 8
            memory: 16Gi
          requests:
            nvidia.com/gpu: "1"
            cpu: 8
            memory: 16Gi
        volumeMounts:
          - mountPath: /models/Qwen3-8B/
            name: model
---
apiVersion: v1
kind: Service
metadata:
  name: qwen3
spec:
  ports:
    - name: http
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

Step 3: Verify the inference service

1.  Run the following command to set up port forwarding between the inference service and the local environment.

kubectl port-forward svc/qwen3 8000:8000

Expected output:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

2.  Run the following command to send a request to the inference service.

curl -H "Content-Type: application/json" http://localhost:8000/v1/chat/completions -d '{"model": "/models/Qwen3-8B/", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"chatcmpl-3e472d9f449648718a483279062f4987","object":"chat.completion","created":1745980464,"model":"/models/Qwen3-8B/","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user said \"Say this is a test!\" and I need to respond. Let me think about how to approach this. First, I should acknowledge their message. Maybe start with a friendly greeting. Then, since they mentioned a test, perhaps they're testing my response capabilities. I should confirm that I'm here to help and offer assistance with anything they need. Keep it open-ended so they feel comfortable asking more. Also, make sure the tone is positive and encouraging. Let me put that together in a natural way.\n</think>\n\nHello! It's great to meet you. If you have any questions or need help with something, feel free to let me know. I'm here to assist! 😊","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":14,"total_tokens":161,"completion_tokens":147,"prompt_tokens_details":null},"prompt_logprobs":null}

4. Use ACS Compute Power in ACK Pro Clusters

ACK also supports ACS GPU compute power in serverless pods. ACS container compute power can be implemented in Kubernetes clusters by using virtual nodes. This way, Kubernetes clusters are empowered with high elasticity and are no longer limited by the computing capacity of cluster nodes.

Prerequisites

Activate the ACK service, assign the default roles to ACK, and activate related cloud services. For more information, see Quickly create an ACK managed cluster.

Log on to the ACS console. Follow the on-screen instructions to activate ACS.

Install ACK virtual nodes in the component center.

Model deployment

The deployment method of ACS is basically the same as that of ACK Pro. You only need to add ACS compute power labels to the pods. See alibabacloud.com/compute-class: gpu, as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3
spec:
  template:
    metadata:
      labels:
        app: qwen3
        # for ACS compute power
        alibabacloud.com/compute-class: gpu
        # example-model indicates the GPU model. Replace it with the actual GPU model, such as T4.
        alibabacloud.com/gpu-model-series: "example-model"
    spec:
      containers:
      ...
0 1 0
Share on

Alibaba Container Service

222 posts | 33 followers

You may also like

Comments