This guide walks you through deploying Qwen3-32B as a multi-node, distributed inference service on Container Service for Kubernetes (ACK) using LeaderWorkerSet (LWS). Two inference backends are covered: vLLM and SGLang. Choose the one that fits your stack.
Background
Why multi-node inference
Large language models (LLMs) often exceed the memory of a single GPU. Multi-node inference solves this by splitting the model across GPUs using one or more parallelization strategies:
| Strategy | How it works | Best for |
|---|---|---|
| Data Parallelism (DP) | Each GPU holds a full copy of the model and processes a different batch. | Scaling throughput for smaller models |
| Tensor Parallelism (TP) | Weight matrices are split across GPUs; each GPU computes on its slice. | Large models that don't fit in a single GPU |
| Pipeline Parallelism (PP) | Different model layers run on different GPUs in a pipeline. | Very deep models |
| Expert Parallelism (EP) | For Mixture-of-Experts (MoE) models; expert sub-models are stored on different GPUs, and inference requests are routed to the relevant GPU. | MoE architectures like Mixtral |
This guide uses Tensor Parallelism with a TP size of 2, meaning the Qwen3-32B model weights are split across two GPUs on two nodes.
How LWS maps to the parallelism configuration
LeaderWorkerSet organizes Pods into groups. In this setup:
The leader Pod runs the inference server and acts as the Ray head node (for vLLM) or the primary server process (for SGLang).
The worker Pod holds the second GPU shard and communicates with the leader during inference.
The size: 2 field in the LWS spec corresponds directly to the TP size: --tensor-parallel-size 2 in vLLM, and --tp 2 in SGLang. Each Pod in the group runs on a separate GPU-accelerated node.

Qwen3-32B
Qwen3-32B is a 32.8B-parameter dense model optimized for reasoning and conversational tasks. Key characteristics:
Context window: 32,000 tokens natively, extendable to 131,000 tokens with YaRN
Multilingual: Understands and translates over 100 languages
Capabilities: Logical reasoning, math, code generation, instruction following, multi-turn dialog, and tool use for agent workflows
For more information, see the blog, GitHub, and documentation.
vLLM
vLLM is a fast, lightweight library for LLM inference and serving. It uses PagedAttention for efficient KV cache management and supports continuous batching, speculative decoding, and CUDA/HIP graph acceleration. vLLM supports TP, PP, DP, and EP parallelism, runs on NVIDIA, AMD, and Intel GPUs, and exposes an OpenAI-compatible API. For more information, see vLLM GitHub.
Prerequisites
Before you begin, ensure that you have:
An ACK managed cluster running Kubernetes 1.28 or later, with two or more GPU-accelerated nodes — each with at least 32 GB of memory. For instructions, see Create an ACK managed cluster and Create an ACK cluster with GPU-accelerated nodes.
The
ecs.gn8is.4xlargeinstance type is recommended. For details, see GPU-accelerated compute-optimized instance family gn8is.LeaderWorkerSet (LWS) V0.6.0 or later installed in your cluster. To install it via the ACK console:
Log on to the ACK console.
In the left navigation pane, click Clusters, then click your cluster name.
In the left navigation pane, choose Applications > Helm. On the Helm page, click Deploy.
In the Basic Information step, enter the Application Name (
lws) and Namespace (lws-system), find lws in the Chart section, and click Next.In the Parameters step, select the latest Chart Version and click OK.

Step 1: Prepare the Qwen3-32B model files
Download the model
Download Qwen3-32B from ModelScope using Git LFS.
Ifgit-lfsis not installed, runyum install git-lfsorapt-get install git-lfsto install it. For other installation methods, see Installing Git Large File Storage.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pullUpload the model to OSS
Log on to the OSS console and record your bucket name. If you don't have a bucket, see Create buckets. Then upload the model files:
For ossutil installation instructions, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32BCreate a PV and PVC for the model
Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) so the model files are accessible to your cluster Pods. For full instructions, see Create a PV and a PVC.
Option 1: ACK console
Create a PV. In the ACK console, go to your cluster and choose Volumes > Persistent Volumes. Click Create and configure the following:
Parameter Value PV Type OSS Volume Name llm-modelAccess Certificate Your AccessKey ID and AccessKey secret Bucket ID The OSS bucket you created OSS Path /Qwen3-32BCreate a PVC. Go to Volumes > Persistent Volume Claims and click Create. Configure the following:
Parameter Value PVC Type OSS Name llm-modelAllocation Mode Existing Volumes Existing Volumes Click Select PV and select the PV you created
Option 2: kubectl
Create a file named llm-model.yaml with the following content:
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name> # The bucket name.
url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: <your-model-path> # In this example, the path is /Qwen3-32B/.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelApply the manifest:
kubectl create -f llm-model.yamlStep 2: Deploy the distributed inference service
Both vLLM and SGLang use an LWS workload with size: 2 (one leader Pod + one worker Pod, each on a separate GPU-accelerated node) and a TP size of 2. The leader Pod runs the inference server and handles incoming requests; the worker Pod holds the second model shard and communicates with the leader during inference.
Deploy with vLLM
Create a file named
vllm_multi.yaml:Deploy the service:
kubectl create -f vllm_multi.yaml
Deploy with SGLang
Create a file named
sglang_multi.yaml:Deploy the service:
kubectl create -f sglang_multi.yaml
Step 3: Verify the inference service
Test with a sample request
Port forwarding via kubectl port-forward is for development and debugging only. It lacks the reliability, security, and scalability required for production. For production-ready network access, see Ingress management.
Forward port 8000 from the Service to your local machine:
kubectl port-forward svc/multi-nodes-service 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Send a test inference request:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "/models/Qwen3-32B", "messages": [{"role": "user", "content": "Test it"}], "max_tokens": 30, "temperature": 0.7, "top_p": 0.9, "seed": 10}'A successful response looks like:
{"id":"chatcmpl-ee6b347a8bd049f9a502669db0817938","object":"chat.completion","created":1753685847,"model":"/models/Qwen3-32B","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user sent \"Test it\". I need to confirm their request first. They might be testing my functionality or want to see my reaction.","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":40,"completion_tokens":30,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}The response confirms that the distributed inference service is running correctly.
What's next
Configure an Ingress to expose the service for production use.
Enable autoscaling for your LWS workload to handle variable inference traffic.
Monitor GPU utilization and throughput using the Prometheus labels (
alibabacloud.com/inference_backend) already applied to the leader Pod.