All Products
Search
Document Center

Container Service for Kubernetes:Deploy multi-node distributed inference services

Last Updated:Mar 26, 2026

This guide walks you through deploying Qwen3-32B as a multi-node, distributed inference service on Container Service for Kubernetes (ACK) using LeaderWorkerSet (LWS). Two inference backends are covered: vLLM and SGLang. Choose the one that fits your stack.

Background

Why multi-node inference

Large language models (LLMs) often exceed the memory of a single GPU. Multi-node inference solves this by splitting the model across GPUs using one or more parallelization strategies:

StrategyHow it worksBest for
Data Parallelism (DP)Each GPU holds a full copy of the model and processes a different batch.Scaling throughput for smaller models
Tensor Parallelism (TP)Weight matrices are split across GPUs; each GPU computes on its slice.Large models that don't fit in a single GPU
Pipeline Parallelism (PP)Different model layers run on different GPUs in a pipeline.Very deep models
Expert Parallelism (EP)For Mixture-of-Experts (MoE) models; expert sub-models are stored on different GPUs, and inference requests are routed to the relevant GPU.MoE architectures like Mixtral

This guide uses Tensor Parallelism with a TP size of 2, meaning the Qwen3-32B model weights are split across two GPUs on two nodes.

How LWS maps to the parallelism configuration

LeaderWorkerSet organizes Pods into groups. In this setup:

  • The leader Pod runs the inference server and acts as the Ray head node (for vLLM) or the primary server process (for SGLang).

  • The worker Pod holds the second GPU shard and communicates with the leader during inference.

The size: 2 field in the LWS spec corresponds directly to the TP size: --tensor-parallel-size 2 in vLLM, and --tp 2 in SGLang. Each Pod in the group runs on a separate GPU-accelerated node.

image.png

Qwen3-32B

Qwen3-32B is a 32.8B-parameter dense model optimized for reasoning and conversational tasks. Key characteristics:

  • Context window: 32,000 tokens natively, extendable to 131,000 tokens with YaRN

  • Multilingual: Understands and translates over 100 languages

  • Capabilities: Logical reasoning, math, code generation, instruction following, multi-turn dialog, and tool use for agent workflows

For more information, see the blog, GitHub, and documentation.

vLLM

vLLM is a fast, lightweight library for LLM inference and serving. It uses PagedAttention for efficient KV cache management and supports continuous batching, speculative decoding, and CUDA/HIP graph acceleration. vLLM supports TP, PP, DP, and EP parallelism, runs on NVIDIA, AMD, and Intel GPUs, and exposes an OpenAI-compatible API. For more information, see vLLM GitHub.

SGLang

SGLang is an inference engine combining a high-performance backend with a flexible programming frontend, designed for both LLM and multimodal workloads. Its backend features RadixAttention (prefix cache), PagedAttention, continuous batching, speculative decoding, PD separation, and multi-LoRA batching. SGLang supports TP, PP, DP, and EP parallelism, with quantization formats including FP8, INT4, AWQ, and GPTQ. For more information, see SGLang GitHub.

Prerequisites

Before you begin, ensure that you have:

  • An ACK managed cluster running Kubernetes 1.28 or later, with two or more GPU-accelerated nodes — each with at least 32 GB of memory. For instructions, see Create an ACK managed cluster and Create an ACK cluster with GPU-accelerated nodes.

    The ecs.gn8is.4xlarge instance type is recommended. For details, see GPU-accelerated compute-optimized instance family gn8is.
  • LeaderWorkerSet (LWS) V0.6.0 or later installed in your cluster. To install it via the ACK console:

    1. Log on to the ACK console.

    2. In the left navigation pane, click Clusters, then click your cluster name.

    3. In the left navigation pane, choose Applications > Helm. On the Helm page, click Deploy.

    4. In the Basic Information step, enter the Application Name (lws) and Namespace (lws-system), find lws in the Chart section, and click Next.

    5. In the Parameters step, select the latest Chart Version and click OK.

    image

Step 1: Prepare the Qwen3-32B model files

Download the model

Download Qwen3-32B from ModelScope using Git LFS.

If git-lfs is not installed, run yum install git-lfs or apt-get install git-lfs to install it. For other installation methods, see Installing Git Large File Storage.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull

Upload the model to OSS

Log on to the OSS console and record your bucket name. If you don't have a bucket, see Create buckets. Then upload the model files:

For ossutil installation instructions, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B

Create a PV and PVC for the model

Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) so the model files are accessible to your cluster Pods. For full instructions, see Create a PV and a PVC.

Option 1: ACK console

  1. Create a PV. In the ACK console, go to your cluster and choose Volumes > Persistent Volumes. Click Create and configure the following:

    ParameterValue
    PV TypeOSS
    Volume Namellm-model
    Access CertificateYour AccessKey ID and AccessKey secret
    Bucket IDThe OSS bucket you created
    OSS Path/Qwen3-32B
  2. Create a PVC. Go to Volumes > Persistent Volume Claims and click Create. Configure the following:

    ParameterValue
    PVC TypeOSS
    Namellm-model
    Allocation ModeExisting Volumes
    Existing VolumesClick Select PV and select the PV you created

Option 2: kubectl

Create a file named llm-model.yaml with the following content:

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>      # The AccessKey ID used to access the OSS bucket.
  akSecret: <your-oss-sk>  # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>       # The bucket name.
      url: <your-bucket-endpoint>      # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path>          # In this example, the path is /Qwen3-32B/.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Apply the manifest:

kubectl create -f llm-model.yaml

Step 2: Deploy the distributed inference service

Both vLLM and SGLang use an LWS workload with size: 2 (one leader Pod + one worker Pod, each on a separate GPU-accelerated node) and a TP size of 2. The leader Pod runs the inference server and handles incoming requests; the worker Pod holds the second model shard and communicates with the leader during inference.

Deploy with vLLM

  1. Create a file named vllm_multi.yaml:

    Expand to view the YAML template

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: vllm-multi-nodes
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels:
              role: leader
              # for prometheus to scrape
              alibabacloud.com/inference-workload: vllm-multi-nodes
              alibabacloud.com/inference_backend: vllm
          spec:
            volumes:
              - name: model
                persistentVolumeClaim:
                  claimName: llm-model
              - name: dshm
                emptyDir:
                  medium: Memory
                  sizeLimit: 15Gi
            containers:
              - name: vllm-leader
                image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --port 8000 --model /models/Qwen3-32B --tensor-parallel-size 2"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                ports:
                  - containerPort: 8000
                    name: http
                readinessProbe:
                  initialDelaySeconds: 30
                  periodSeconds: 10
                  tcpSocket:
                    port: 8000
                volumeMounts:
                  - mountPath: /models/Qwen3-32B
                    name: model
                  - mountPath: /dev/shm
                    name: dshm
        workerTemplate:
          spec:
            volumes:
              - name: model
                persistentVolumeClaim:
                  claimName: llm-model
              - name: dshm
                emptyDir:
                  medium: Memory
                  sizeLimit: 15Gi
            containers:
              - name: vllm-worker
                image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                ports:
                  - containerPort: 8000
                volumeMounts:
                  - mountPath: /models/Qwen3-32B
                    name: model
                  - mountPath: /dev/shm
                    name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: multi-nodes-service
    spec:
      type: ClusterIP
      ports:
      - port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
  2. Deploy the service:

    kubectl create -f vllm_multi.yaml

Deploy with SGLang

  1. Create a file named sglang_multi.yaml:

    Expand to view the YAML template

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: sglang-multi-nodes
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels:
              role: leader
              # for prometheus to scrape
              alibabacloud.com/inference-workload: sglang-multi-nodes
              alibabacloud.com/inference_backend: sglang
          spec:
            containers:
              - name: sglang-leader
                image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
                command:
                  - sh
                  - -c
                  - "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2 --dist-init-addr $(LWS_LEADER_ADDRESS):20000 \
                  --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --trust-remote-code --host 0.0.0.0 --port 8000 --enable-metrics"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                ports:
                  - containerPort: 8000
                    name: http
                readinessProbe:
                  tcpSocket:
                    port: 8000
                  initialDelaySeconds: 30
                  periodSeconds: 10
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
                  - mountPath: /models/Qwen3-32B
                    name: model
            volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - name: dshm
                  emptyDir:
                    medium: Memory
                    sizeLimit: 15Gi
        workerTemplate:
          spec:
            containers:
              - name: sglang-worker
                image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
                command:
                  - sh
                  - -c
                  - "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2 --dist-init-addr $(LWS_LEADER_ADDRESS):20000 \
                  --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --trust-remote-code"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
                  - mountPath: /models/Qwen3-32B
                    name: model
            volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - name: dshm
                  emptyDir:
                    medium: Memory
                    sizeLimit: 15Gi
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: multi-nodes-service
    spec:
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      ports:
        - protocol: TCP
          port: 8000
          targetPort: 8000
  2. Deploy the service:

    kubectl create -f sglang_multi.yaml

Step 3: Verify the inference service

Test with a sample request

Important

Port forwarding via kubectl port-forward is for development and debugging only. It lacks the reliability, security, and scalability required for production. For production-ready network access, see Ingress management.

  1. Forward port 8000 from the Service to your local machine:

    kubectl port-forward svc/multi-nodes-service 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send a test inference request:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model": "/models/Qwen3-32B", "messages": [{"role": "user", "content": "Test it"}], "max_tokens": 30, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    A successful response looks like:

    {"id":"chatcmpl-ee6b347a8bd049f9a502669db0817938","object":"chat.completion","created":1753685847,"model":"/models/Qwen3-32B","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user sent \"Test it\". I need to confirm their request first. They might be testing my functionality or want to see my reaction.","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":40,"completion_tokens":30,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

    The response confirms that the distributed inference service is running correctly.

What's next

  • Configure an Ingress to expose the service for production use.

  • Enable autoscaling for your LWS workload to handle variable inference traffic.

  • Monitor GPU utilization and throughput using the Prometheus labels (alibabacloud.com/inference_backend) already applied to the leader Pod.