All Products
Search
Document Center

Container Service for Kubernetes:Deploy multi-node distributed inference services

Last Updated:Sep 17, 2025

This topic uses the Qwen3-32B model as an example to demonstrate how to deploy multi-node, distributed model inference services in a Container Service for Kubernetes (ACK) cluster using the vLLM and SGLang frameworks.

Background

  • Qwen3-32B

    Qwen3-32B represents the latest evolution in the Qwen series, featuring a 32.8B-parameter dense architecture optimized for both reasoning efficiency and conversational fluency.

    Key features:

    • Dual-mode performance: Excels at complex tasks like logical reasoning, math, and code generation, while remaining highly efficient for general text generation.

    • Advanced capabilities: Demonstrates excellent performance in instruction following, multi-turn dialog, creative writing, and best-in-class tool use for AI agent tasks.

    • Large context window: Natively handles up to 32,000 tokens of context, which can be extended to 131,000 tokens using YaRN technology.

    • Multilingual support: Understands and translates over 100 languages, making it ideal for global applications.

    For more information, see the blog, GitHub, and documentation.

  • vLLM

    vLLM is a fast and lightweight library designed to optimize LLM inference and serving, significantly increasing throughput and reducing latency.

    Core optimizations:

    • PagedAttention: An innovative attention algorithm that efficiently manages the Key-Value (KV) cache to minimize memory waste and increase throughput.

    • Advanced inference: Improves speed and utilization with continuous batching, speculative decoding, and CUDA/HIP graph acceleration.

    • Wide range of parallelism: Supports Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), and Expert Parallelism (EP) to scale across multiple GPUs.

    • Quantization support: Compatible with popular quantization formats like GPTQ, AWQ, INT4/8, and FP8 to reduce the model's memory footprint.

    Broad compatibility:

    • Hardware and models: Runs on NVIDIA, AMD, and Intel GPUs and supports mainstream models from Hugging Face and ModelScope (such as Qwen, Llama, Deepseek, and E5-Mistral).

    • Standard API: Provides an OpenAI-compatible API, making it easy to integrate into existing applications.

    For more information, see vLLM GitHub.

  • SGLang

    SGLang is an inference engine that combines a high-performance backend with a flexible frontend, designed for both LLM and multimodal workloads.

    High-performance backend:

    • Advanced caching: Features RadixAttention (an efficient prefix cache) and PagedAttention to maximize throughput during complex inference tasks.

    • Efficient execution: Uses continuous batching, speculative decoding, PD separation, and multi-LoRA batching to efficiently serve multiple users and fine-tuned models.

    • Full parallelism and quantization: Supports TP, PP, DP, and EP parallelism, along with various quantization methods (FP8, INT4, AWQ, GPTQ).

    Flexible frontend:

    • Powerful programming interface: Enables developers to easily build complex applications with features such as chained generation, control flow, and parallel processing.

    • Multimodal and external interaction: Natively supports multimodal inputs (such as text and images) and allows for interaction with external tools, making it ideal for advanced agent workflows.

    • Broad model support: Supports generative models (Qwen, DeepSeek, Llama), embedding models (E5-Mistral), and reward models (Skywork).

    For more information, see SGLang GitHub.

  • Distributed deployment

    As LLMs increase in size, their parameters often exceed the memory of a single GPU. To run these large models, various parallelization strategies are used to split the inference task into multiple subtasks. These subtasks are assigned across GPUs. The results are then aggregated to complete the LLM inference task quickly. Common parallelization strategies:

    Data Parallelism (DP)

    Each GPU holds a complete copy of the model but processes a different batch of data. This is the simplest and most common strategy.

    image.png

    Tensor Parallelism (TP)

    Splits the model's weight matrices (tensors) across multiple GPUs. Each GPU holds only a slice of the model's weights and computes on that portion.

    image.png

    Pipeline Parallelism (PP)

    Assigns different layers of the model to different GPUs, creating a pipeline. The output of one layer on a GPU is passed as input to the next layer on another GPU.

    image.png

    Expert Parallelism (EP)

    Models with a Mixture-of-Experts (MoE) architecture contain many "expert" sub-models. Only a subset of these experts is activated to process each request. Therefore, these expert sub-models can be stored on different GPUs. When an inference workload requires a specific expert, the data is routed to the relevant GPU.

    image.png

Prerequisites

  • You have an ACK managed cluster running Kubernetes 1.28 or later with two or more GPU-accelerated nodes. Each GPU-accelerated node must have at least 32 GB of memory. For instructions, see Create an ACK managed cluster and Create an ACK cluster with GPU-accelerated nodes.

    The ecs.gn8is.4xlarge instance type is recommended. For details, see GPU-accelerated compute-optimized instance family gn8is.
  • The LeaderWorkerSet component V0.6.0 or later is installed. You can install it via the ACK console:

    1. Log on to the ACK console.

    2. In the navigation pane on the left, click Clusters, then click the name of the cluster you created.

    3. In the navigation pane on the left, click Applications > Helm. On the Helm page, click Deploy.

    4. In the Basic Information step, enter the Application Name and Namespace, find lws in the Chart section, and click Next. In this example, the application name (lws) and namespace (lws-system) are used.

    5. In the Parameters step, select the latest Chart Version, and click OK to install lws.image

Model deployment

Step 1: Prepare the Qwen3-32B model files

  1. Run the following command to download the Qwen3-32B model from ModelScope.

    If the git-lfs plugin is not installed, run yum install git-lfs or apt-get install git-lfs to install it. For more installation methods, see Installing Git Large File Storage.
    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
    cd Qwen3-32B/
    git lfs pull
  2. Log on to the OSS console and record the name of your bucket. If you haven't created one, see Create buckets. Create a directory in Object Storage Service (OSS) and upload the model to it.

    For more information about how to install and use ossutil, see Install ossutil.
    ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
    ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B
  3. Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) for your cluster. For detailed instructions, see Create a PV and a PVC.

    Example using console

    1. Create a PV

      • Log on to the ACK console. In the navigation pane on the left, click Clusters.

      • On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volumes.

      • In the upper-right corner of the Persistent Volumes page, click Create.

      • In the Create PV dialog box, configure the parameters that are described in the following table.

        The following table describes the basic configuration of the sample PV:

        Parameter

        Description

        PV Type

        In this example, select OSS.

        Volume Name

        In this example, enter llm-model.

        Access Certificate

        Configure the AccessKey ID and AccessKey secret used to access the OSS bucket.

        Bucket ID

        Select the OSS bucket you created in the preceding step.

        OSS Path

        Enter the path where the model is located, such as /Qwen3-32B.

    2. Create a PVC

      • On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volume Claims.

      • In the upper-right corner of the Persistent Volume Claims page, click Create.

      • In the Create PVC dialog box, configure the parameters that are described in the following table.

        The following table describes the basic configuration of the sample PVC.

        Configuration Item

        Description

        PVC Type

        In this example, select OSS.

        Name

        In this example, enter llm-model.

        Allocation Mode

        In this example, select Existing Volumes

        Existing Volumes

        Click the Select PV hyperlink and select the PV that you created.

    Example using kubectl

    1. Use the following YAML template to create a file named llm-model.yaml, containing configurations for a Secret, a static PV, and a static PVC.

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
      stringData:
        akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
        akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: llm-model
        labels:
          alicloud-pvname: llm-model
      spec:
        capacity:
          storage: 30Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: llm-model
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: <your-bucket-name> # The bucket name.
            url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: <your-model-path> # In this example, the path is /Qwen3-32B/.
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: llm-model
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30Gi
        selector:
          matchLabels:
            alicloud-pvname: llm-model
    2. Create the Secret, static PV, and static PVC.

      kubectl create -f llm-model.yaml

Step 2: Deploy the distributed inference service

This topic uses a LeaderWorkerSet workload to deploy a inference service on two GPU-accelerated nodes with a TP size of 2 .

Deploy with vLLM

  1. Create a file named vllm_multi.yaml.

    Expand to view the YAML template

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: vllm-multi-nodes
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels:
              role: leader
              # for prometheus to scrape
              alibabacloud.com/inference-workload: vllm-multi-nodes
              alibabacloud.com/inference_backend: vllm
          spec:
            volumes:
              - name: model
                persistentVolumeClaim:
                  claimName: llm-model
              - name: dshm
                emptyDir:
                  medium: Memory
                  sizeLimit: 15Gi
            containers:
              - name: vllm-leader
                image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --port 8000 --model /models/Qwen3-32B --tensor-parallel-size 2"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                ports:
                  - containerPort: 8000
                    name: http
                readinessProbe:
                  initialDelaySeconds: 30
                  periodSeconds: 10
                  tcpSocket:
                    port: 8000
                volumeMounts:
                  - mountPath: /models/Qwen3-32B
                    name: model
                  - mountPath: /dev/shm
                    name: dshm
        workerTemplate:
          spec:
            volumes:
              - name: model
                persistentVolumeClaim:
                  claimName: llm-model
              - name: dshm
                emptyDir:
                  medium: Memory
                  sizeLimit: 15Gi
            containers:
              - name: vllm-worker
                image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
                command:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                ports:
                  - containerPort: 8000
                volumeMounts:
                  - mountPath: /models/Qwen3-32B
                    name: model
                  - mountPath: /dev/shm
                    name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: multi-nodes-service
    spec:
      type: ClusterIP
      ports:
      - port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
  2. Run the following command to deploy the multi-node LLM inference service using the vLLM framework:

    kubectl create -f vllm_multi.yaml

Deploy with SGLang

  1. Create a file named sglang_multi.yaml.

    Expand to view the YAML template

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: sglang-multi-nodes
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels:
              role: leader
              # for prometheus to scrape
              alibabacloud.com/inference-workload: sglang-multi-nodes
              alibabacloud.com/inference_backend: sglang
          spec:
            containers:
              - name: sglang-leader
                image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
                command:
                  - sh
                  - -c
                  - "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2 --dist-init-addr $(LWS_LEADER_ADDRESS):20000 \
                  --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --trust-remote-code --host 0.0.0.0 --port 8000 --enable-metrics"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                ports:
                  - containerPort: 8000
                    name: http
                readinessProbe:
                  tcpSocket:
                    port: 8000
                  initialDelaySeconds: 30
                  periodSeconds: 10
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
                  - mountPath: /models/Qwen3-32B
                    name: model
            volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - name: dshm
                  emptyDir:
                    medium: Memory
                    sizeLimit: 15Gi
        workerTemplate:
          spec:
            containers:
              - name: sglang-worker
                image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
                command:
                  - sh
                  - -c
                  - "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2 --dist-init-addr $(LWS_LEADER_ADDRESS):20000 \
                  --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX) --trust-remote-code"
                resources:
                  limits:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                  requests:
                    nvidia.com/gpu: "1"
                    memory: "24Gi"
                    cpu: "8"
                volumeMounts:
                  - mountPath: /dev/shm
                    name: dshm
                  - mountPath: /models/Qwen3-32B
                    name: model
            volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - name: dshm
                  emptyDir:
                    medium: Memory
                    sizeLimit: 15Gi
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: multi-nodes-service
    spec:
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      ports:
        - protocol: TCP
          port: 8000
          targetPort: 8000
  2. Run the following command to deploy the multi-node LLM inference service using the SGLang framework:

    kubectl create -f sglang_multi.yaml

Step 3: Validate the inference service

  1. Run the following command to establish port forwarding between the inference service and your local environment.

    Important

    Port forwarding established by kubectl port-forward lacks production-grade reliability, security, and scalability. It is suitable for development and debugging purposes only and should not be used in production environment. For production-ready network solutions in Kubernetes clusters, see Ingress management.

    kubectl port-forward svc/multi-nodes-service 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Run the following command to send a sample inference request to service:

    curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "/models/Qwen3-32B", "messages": [{"role": "user", "content": "Test it"}], "max_tokens": 30, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"chatcmpl-ee6b347a8bd049f9a502669db0817938","object":"chat.completion","created":1753685847,"model":"/models/Qwen3-32B","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user sent "Test it". I need to confirm their request first. They might be testing my functionality or want to see my reaction.","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":40,"completion_tokens":30,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

    The output confirms that the distributed model service is working properly and can generate responses.