By Zibai
Qwen3 is the first hybrid inference model newly launched in the Qwen series. The flagship model, Qwen3-235B-A22B, demonstrates competitive performance in benchmark tests such as code, mathematics, and general capabilities, rivaling top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. In addition, the small MoE model, Qwen3-30B-A3B, outperforms QwQ-32B by using only 10% of the activation parameters of QwQ-32B, and even a small model like Qwen3-4B can match the performance of Qwen2.5-72B-Instruct. Qwen3 supports multiple thinking modes, allowing users to control the model's depth of thinking based on specific tasks. It also supports 119 languages and dialects, with enhanced support for MCP.
Container Service for Kubernetes (ACK) is one of the first batch of services that have passed the Certified Kubernetes Conformance Program in the world. ACK provides high-performance containerized application management services. ACK is integrated with the virtualization, storage, networking, and security capabilities provided by Alibaba Cloud, simplifies the creation and expansion of clusters, and allows you to focus on the development and management of containerized applications.
Container Compute Service (ACS) is a cloud computing service that provides container compute resources that comply with the container specifications of Kubernetes.
It can be implemented in ACK clusters by using virtual nodes. This way, Kubernetes clusters are empowered with high elasticity and are no longer limited by the computing capacity of cluster nodes. After you connect ACS to Kubernetes, ACS takes over the management of pods, including the infrastructure and resource availability. Kubernetes no longer needs to manage the lifecycle and resources of the underlying VMs.
A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.
The kubectl client is connected to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
1. Run the following command to download the Qwen3-8B model from ModelScope.
Check whether the git-lfs plug-in is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B/
git lfs pull
2. Create an OSS directory and upload the model files to the directory.
To install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B
3. Create a PV and a PVC. Create a PV named llm-model and a PVC for the cluster.
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name> # The name of the OSS bucket.
url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: <your-model-path> # The model path, such as /models/Qwen3-8B/ in this example.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-model
Run the following command to start the inference service named qwen3:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: qwen3
name: qwen3
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: qwen3
template:
metadata:
labels:
app: qwen3
# for ACS Cluster
# alibabacloud.com/compute-class: gpu
# example-model indicates the GPU model. Replace it with the actual GPU model, such as T4.
# alibabacloud.com/gpu-model-series: "example-model"
spec:
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model
containers:
- command:
- sh
- -c
- vllm serve /models/Qwen3-8B/ --port 8000 --trust-remote-code --max-model-len 2048 --gpu-memory-utilization 0.98
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.8.4
imagePullPolicy: IfNotPresent
name: vllm
ports:
- containerPort: 8000
name: restful
protocol: TCP
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 30
resources:
limits:
nvidia.com/gpu: "1"
cpu: 8
memory: 16Gi
requests:
nvidia.com/gpu: "1"
cpu: 8
memory: 16Gi
volumeMounts:
- mountPath: /models/Qwen3-8B/
name: model
---
apiVersion: v1
kind: Service
metadata:
name: qwen3
spec:
ports:
- name: http
port: 8000
protocol: TCP
targetPort: 8000
selector:
app: qwen3
type: ClusterIP
1. Run the following command to set up port forwarding between the inference service and the local environment.
kubectl port-forward svc/qwen3 8000:8000
Expected output:
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
2. Run the following command to send a request to the inference service.
curl -H "Content-Type: application/json" http://localhost:8000/v1/chat/completions -d '{"model": "/models/Qwen3-8B/", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
Expected output:
{"id":"chatcmpl-3e472d9f449648718a483279062f4987","object":"chat.completion","created":1745980464,"model":"/models/Qwen3-8B/","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user said \"Say this is a test!\" and I need to respond. Let me think about how to approach this. First, I should acknowledge their message. Maybe start with a friendly greeting. Then, since they mentioned a test, perhaps they're testing my response capabilities. I should confirm that I'm here to help and offer assistance with anything they need. Keep it open-ended so they feel comfortable asking more. Also, make sure the tone is positive and encouraging. Let me put that together in a natural way.\n</think>\n\nHello! It's great to meet you. If you have any questions or need help with something, feel free to let me know. I'm here to assist! 😊","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":14,"total_tokens":161,"completion_tokens":147,"prompt_tokens_details":null},"prompt_logprobs":null}
ACK also supports ACS GPU compute power in serverless pods. ACS container compute power can be implemented in Kubernetes clusters by using virtual nodes. This way, Kubernetes clusters are empowered with high elasticity and are no longer limited by the computing capacity of cluster nodes.
Activate the ACK service, assign the default roles to ACK, and activate related cloud services. For more information, see Quickly create an ACK managed cluster.
Log on to the ACS console. Follow the on-screen instructions to activate ACS.
Install ACK virtual nodes in the component center.
The deployment method of ACS is basically the same as that of ACK Pro. You only need to add ACS compute power labels to the pods. See alibabacloud.com/compute-class: gpu, as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen3
spec:
template:
metadata:
labels:
app: qwen3
# for ACS compute power
alibabacloud.com/compute-class: gpu
# example-model indicates the GPU model. Replace it with the actual GPU model, such as T4.
alibabacloud.com/gpu-model-series: "example-model"
spec:
containers:
...
222 posts | 33 followers
FollowAlibaba Cloud Community - October 14, 2025
Justin See - November 7, 2025
Alibaba Cloud Community - September 27, 2025
Regional Content Hub - July 14, 2025
ray - April 16, 2025
Farruh - June 23, 2025
222 posts | 33 followers
Follow
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
ACK One
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn More
Container Registry
A secure image hosting platform providing containerized image lifecycle management
Learn More
Elastic Container Instance
An agile and secure serverless container instance service.
Learn MoreMore Posts by Alibaba Container Service