All Products
Search
Document Center

Container Service for Kubernetes:Deploy a Dynamo inference service with PD disaggregation

Last Updated:Sep 18, 2025

This topic uses the Qwen3-32B model as an example to demonstrate how to deploy a model inference service in a Container Service for Kubernetes (ACK) cluster using the Dynamo framework with a prefill-decode (PD) disaggregation architecture.

Background

  • Qwen3-32B

    Qwen3-32B represents the latest evolution in the Qwen series, featuring a 32.8B-parameter dense architecture optimized for both reasoning efficiency and conversational fluency.

    Key features:

    • Dual-mode performance: Excels at complex tasks like logical reasoning, math, and code generation, while remaining highly efficient for general text generation.

    • Advanced capabilities: Demonstrates excellent performance in instruction following, multi-turn dialog, creative writing, and best-in-class tool use for AI agent tasks.

    • Large context window: Natively handles up to 32,000 tokens of context, which can be extended to 131,000 tokens using YaRN technology.

    • Multilingual support: Understands and translates over 100 languages, making it ideal for global applications.

    For more information, see the blog, GitHub, and documentation.

  • Dynamo

    Dynamo is a high-throughput, low-latency inference framework from NVIDIA, designed specifically for serving large language models (LLMs) in multi-node, distributed environments.image.png

    Key features:

    • Engine-agnostic: Dynamo is not tied to a specific inference engine and supports various backends such as TensorRT-LLM, vLLM, and SGLang.

    • LLM-specific optimization capabilities:

      • PD disaggregation: It decouples the compute-intensive prefill stage from the memory-bound decode stage, reducing latency and boosting throughput.

      • Dynamic GPU scheduling: It optimizes performance based on real-time load changes.

      • Smart LLM routing: It routes requests based on the key-value (KV) cache of a node to avoid unnecessary KV cache recalculations.

      • Accelerated data transmission: It uses NVIDIA Inference Xfer Library (NIXL) technology to speed up the transfer of intermediate computation results and KV cache.

      • KV cache offloading: It can offload KV cache to memory, disks, or even cloud disks to increase the total system throughput.

      • High performance and extensibility: The core is built in Rust for maximum performance, while providing a Python interface for user extensibility.

      • Fully open source: Dynamo is fully open source and follows a transparent, open source-first development philosophy.

    For more information about the Dynamo framework, see the Dynamo GitHub and the Dynamo documentation.

  • Prefill/Decode separation

    The Prefill/Decode separation architecture is a mainstream optimization technique for large language model (LLM) inference. It aims to resolve the resource conflict between the two core stages of the inference process. The LLM inference process can be divided into two stages:

    • Prefill (prompt processing) stage: In this stage, the entire user-input prompt is processed at once. The attention for all input tokens is calculated in parallel to generate the initial KV cache. This process is compute-intensive, requires powerful parallel computing capabilities, and is executed only once at the beginning of each request.

    • Decode (token generation) stage: This stage is an autoregressive process where the model generates new tokens one by one based on the existing KV cache. The computation for each step is small, but it requires repeatedly and quickly loading large model weights and the KV cache from video memory. Therefore, this process is memory-bound.image.png

    The core conflict is that scheduling these two very different tasks on the same GPU is highly inefficient. When processing multiple user requests, inference engines often use continuous batching to schedule the prefill and decode stages of different requests in the same batch. The prefill stage processes the entire prompt and is computationally complex. The decode stage generates only a single token and is computationally simple. If both are scheduled in the same batch, the decode stage experiences increased latency because of differences in sequence length and resource competition. This increases the overall system latency and reduces throughput.

    image.png

    The Prefill/Decode separation architecture solves this problem by decoupling these two stages and deploying them on different GPUs. This separation allows the system to be optimized for the different characteristics of the prefill and decode stages. It avoids resource competition, significantly reduces the average time per output token (TPOT), and improves system throughput.

  • RoleBasedGroup

    RoleBasedGroup (RBG) is a new workload designed by the Alibaba Cloud Container Service for Kubernetes (ACK) team to address the challenges of large-scale deployment and O&M of the Prefill/Decode separation architecture in Kubernetes clusters. This project is open source. For more information, see the RBG GitHub.

    The RBG API design is shown in the following figure. A group consists of a set of roles, and each role can be built based on a StatefulSet, deployment, or LWS. Its core features are as follows:

    • Flexible multi-role definition: RBG lets you define any number of roles with any names. It supports defining dependencies between roles, allowing them to start in a specified order. It also supports elastic scaling at the role level.

    • Runtime: It provides automatic service discovery within the group. It supports multiple restart policies, rolling updates, and gang scheduling.

      image.png

Prerequisites

  • An ACK managed cluster running Kubernetes 1.22 or later with at least 6 GPUs, where each GPU has at least 32 GB of memory. For more information, see Create an ACK managed cluster and Add a GPU node to a cluster.

    The ecs.ebmgn8is.32xlarge instance type is recommended. For more information about instance types, see ECS Bare Metal Instance families.
  • Install the ack-rbgs component as follows.

    Log on to the Container Service Management Console. In the navigation pane on the left, select Cluster List. Click the name of the target cluster. On the cluster details page, use Helm to install the ack-rbgs component. You do not need to configure the Application Name or Namespace for the component. Click Next. In the Confirm dialog box that appears, click Yes to use the default application name (ack-rbgs) and namespace (rbgs-system). Then, select the latest Chart version and click OK to complete the installation.

    image

Model deployment

The following sequence diagram shows the request lifecycle in the Dynamo PD disaggregation architecture:

  • Request ingestion: The user's request is first sent to the processor component. The router within the processor selects an available decode worker and forwards the request to it.

  • Prefill decision: The decode worker determines whether the prefill computation should be performed locally or delegated to a remote prefill worker. If remote computation is required, it sends a prefill request to the prefill queue.

  • Prefill execution: A prefill worker retrieves the request from the queue and executes the prefill computation.

  • KV cache transfer: Once the computation is complete, the prefill worker transfers the resulting KV cache to the designated decode worker, which then proceeds with the decode stage.

image.png

Step 1: Prepare the Qwen3-32B model files

  1. Run the following command to download the Qwen3-32B model from ModelScope.

    If the git-lfs plugin is not installed, run yum install git-lfs or apt-get install git-lfs to install it. For more installation methods, see Installing Git Large File Storage.
    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
    cd Qwen3-32B/
    git lfs pull
  2. Log on to the OSS console and record the name of your bucket. If you haven't created one, see Create buckets. Create a directory in Object Storage Service (OSS) and upload the model to it.

    For more information about how to install and use ossutil, see Install ossutil.
    ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
    ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B
  3. Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) for your cluster. For detailed instructions, see Create a PV and a PVC.

    Example using console

    1. Create a PV

      • Log on to the ACK console. In the navigation pane on the left, click Clusters.

      • On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volumes.

      • In the upper-right corner of the Persistent Volumes page, click Create.

      • In the Create PV dialog box, configure the parameters that are described in the following table.

        The following table describes the basic configuration of the sample PV:

        Parameter

        Description

        PV Type

        In this example, select OSS.

        Volume Name

        In this example, enter llm-model.

        Access Certificate

        Configure the AccessKey ID and AccessKey secret used to access the OSS bucket.

        Bucket ID

        Select the OSS bucket you created in the preceding step.

        OSS Path

        Enter the path where the model is located, such as /Qwen3-32B.

    2. Create a PVC

      • On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Volumes > Persistent Volume Claims.

      • In the upper-right corner of the Persistent Volume Claims page, click Create.

      • In the Create PVC dialog box, configure the parameters that are described in the following table.

        The following table describes the basic configuration of the sample PVC.

        Configuration Item

        Description

        PVC Type

        In this example, select OSS.

        Name

        In this example, enter llm-model.

        Allocation Mode

        In this example, select Existing Volumes

        Existing Volumes

        Click the Select PV hyperlink and select the PV that you created.

    Example using kubectl

    1. Use the following YAML template to create a file named llm-model.yaml, containing configurations for a Secret, a static PV, and a static PVC.

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
      stringData:
        akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
        akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: llm-model
        labels:
          alicloud-pvname: llm-model
      spec:
        capacity:
          storage: 30Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: llm-model
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: <your-bucket-name> # The bucket name.
            url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: <your-model-path> # In this example, the path is /Qwen3-32B/.
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: llm-model
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30Gi
        selector:
          matchLabels:
            alicloud-pvname: llm-model
    2. Create the Secret, static PV, and static PVC.

      kubectl create -f llm-model.yaml

Step 2: Install etcd and NATS services

The Dynamo framework relies on two key external services: etcd for service discovery and NATS for messaging. Specifically, Dynamo uses NIXL for cross-node communication, which registers with etcd to discover other nodes. NATS is used as the message bus between the prefill and decode workers. Therefore, both etcd and NATS must be deployed before starting the inference service.

  1. Create a file named etcd.yaml.

    YAML template

    apiVersion: v1
    kind: Service
    metadata:
      name: etcd
      labels:
        app: etcd
    spec:
      ports:
        - port: 2379
          name: client
        - port: 2380
          name: peer
      clusterIP: None # Enables headless service mode
      selector:
        app: etcd
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: etcd
      labels:
        app: etcd
    spec:
      selector:
        matchLabels:
          app: etcd
      serviceName: "etcd"
      replicas: 1
      template:
        metadata:
          labels:
            app: etcd
        spec:
          containers:
            - name: etcd
              # image: bitnami/etcd:3.5.19
              image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/etcd:3.6.1
              volumeMounts:
                - name: data
                  mountPath: /var/lib/etcd
              env:
                - name: ETCDCTL_API
                  value: "3"
                - name: ALLOW_NONE_AUTHENTICATION
                  value: "yes"
          volumes:
            - name: data
              emptyDir: {}

    Deploy the etcd service.

    kubectl apply -f etcd.yaml
  2. Create a file named nats.yaml.

    YAML template

    apiVersion: v1
    kind: Service
    metadata:
      name: nats
      labels:
        app: nats
    spec:
      ports:
        - port: 4222
          name: client
        - port: 8222
          name: management
        - port: 6222
          name: cluster
      selector:
        app: nats
      type: ClusterIP
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nats
      labels:
        app: nats
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nats
      template:
        metadata:
          labels:
            app: nats
        spec:
          containers:
            - name: nats
              # image: nats:latest
              image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/nats:2.11.5
              args:
                - -js
                - --trace
                - -m
                - "8222"
              ports:
                - containerPort: 4222
                - containerPort: 8222
                - containerPort: 6222

    Deploy the NATS service.

    kubectl apply -f nats.yaml

Step 3: Deploy the Dynamo PD-disaggregated inference service

This topic uses an RBG to deploy a 2 prefill, 1 decode (2P1D) Dynamo service. Both the prefill and decode roles will use a Tensor Parallelism (TP) size of 2. Deployment architecture:

image.png

  1. Create a ConfigMap named dynamo-configs.yaml to store the Dynamo and qwen3 model configurations.

    YAML template

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: dynamo-configs
    data:
      pd_disagg.py: |
        from components.frontend import Frontend
        from components.kv_router import Router
        from components.processor import Processor
    
        Frontend.link(Processor).link(Router)
    
      qwen3.yaml: |
        Common:
          model: /models/Qwen3-32B/
          kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
          router: round-robin
          # Number of tokens in a batch for more efficient chunked transfers to GPUs.
          block-size: 128
          max-model-len: 2048
          max-num-batched-tokens: 2048
          disable-log-requests: true
    
        Frontend:
          served_model_name: qwen
          endpoint: dynamo.Processor.chat/completions
          port: 8000
    
        Processor:
          common-configs: [ model, router ]
    
        VllmWorker:
          common-configs: [ model, kv-transfer-config, router, block-size, max-model-len, disable-log-requests ]
          # Enable prefill at different workers.
          remote-prefill: true
          # Disable local prefill so only disaggregated prefill is used.
          conditional-disagg: false
          gpu-memory-utilization: 0.95
          tensor-parallel-size: 1
          ServiceArgs:
            workers: 1
            resources:
              gpu: 1
    
        PrefillWorker:
          common-configs: [ model, kv-transfer-config, block-size, max-model-len, max-num-batched-tokens, gpu-memory-utilization, disable-log-requests ]
          tensor-parallel-size: 1
          gpu-memory-utilization: 0.95
          ServiceArgs:
            workers: 1
            resources:
              gpu: 1
    
    kubectl apply -f dynamo-configs.yaml
  2. Prepare the Dynamo runtime image.

    Follow the instructions in the Dynamo community to build or pull an image with vLLM as the inference framework.

  3. Create a file named dynamo.yaml to define the RBG. Ensure you replace the placeholder with your Dynamo runtime image address.

    YAML template

    apiVersion: workloads.x-k8s.io/v1alpha1
    kind: RoleBasedGroup
    metadata:
      name: dynamo-pd
      namespace: default
    spec:
      roles:
        - name: processor
          replicas: 1
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve graphs.pd_disagg:Frontend -f ./configs/qwen3.yaml
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: # The address of the Dynamo Runtime image built in Step 2
                  name: processor
                  ports:
                    - containerPort: 8000
                      name: health
                      protocol: TCP
                    - containerPort: 9345
                      name: request
                      protocol: TCP
                    - containerPort: 443
                      name: api
                      protocol: TCP
                    - containerPort: 9347
                      name: metrics
                      protocol: TCP
                  readinessProbe:
                    initialDelaySeconds: 30
                    periodSeconds: 30
                    tcpSocket:
                      port: 8000
                  resources:
                    limits:
                      cpu: "8"
                      memory: 12Gi
                    requests:
                      cpu: "8"
                      memory: 12Gi
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
                    - mountPath: /workspace/examples/llm/graphs/pd_disagg.py
                      name: dynamo-configs
                      subPath: pd_disagg.py
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
        - name: prefill
          replicas: 2
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve components.prefill_worker:PrefillWorker -f ./configs/qwen3.yaml
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: # The address of the Dynamo runtime image built in Step 2
                  name: prefill-worker
                  resources:
                    limits:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                    requests:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
        - name: decoder
          replicas: 1
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve components.worker:VllmWorker -f ./configs/qwen3.yaml --service-name VllmWorker
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: # The address of the Dynamo Runtime image built in Step 2
                  name: vllm-worker
                  resources:
                    limits:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                    requests:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: dynamo-service
    spec:
      type: ClusterIP
      ports:
        - port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        rolebasedgroup.workloads.x-k8s.io/name: dynamo-pd
        rolebasedgroup.workloads.x-k8s.io/role: processor

    Deploy the service.

    kubectl apply -f ./dynamo.yaml

Step 4: Validate the inference service

  1. Establish port forwarding between the inference service and your local environment for testing.

    Important

    Port forwarding established by kubectl port-forward lacks production-grade reliability, security, and scalability. It is suitable for development and debugging purposes only and should not be used in production environment. For production-ready network solutions in Kubernetes clusters, see Ingress management.

    kubectl port-forward svc/dynamo-service 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send a sample request to the model inference service.

    curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{"model": "qwen","messages": [{"role": "user","content": "Let's test it"}],"stream":false,"max_tokens": 30}'

    Expected output:

    {"id":"31ac3203-c5f9-4b06-a4cd-4435a78d3b35","choices":[{"index":0,"message":{"content":"<think>\nOkay, the user sent 'Let's test it'. I need to confirm their intent first. They might be testing my response speed or functionality, or maybe they want to","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"length","logprobs":null}],"created":1753702438,"model":"qwen","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}

    A successful JSON response indicates that your Dynamo PD inference service is running correctly.

References

  • Configure auto scaling for LLM inference services

    LLM workloads often fluctuate, leading to either over-provisioned resources or poor performance during traffic spikes. The Kubernetes Horizontal Pod Autoscaler (HPA), integrated with ack-alibaba-cloud-metrics-adapter, solves this by:

    • Automatically scaling your pods based on real-time GPU, CPU, and memory utilization.

    • Allowing you to define custom metrics for more sophisticated scaling triggers.

    • Ensuring high availability during peak demand while reducing costs during idle periods.

  • Accelerate model loading with Fluid distributed caching

    Large model files (>10 GB) stored in services like OSS or File Storage NAS can cause slow pod startups (cold starts) due to long download times. Fluid solves this problem by creating a distributed caching layer across your cluster's nodes. This significantly accelerates model loading in two key ways:

    • Accelerated data throughput: Fluid pools the storage capacity and network bandwidth of all nodes in the cluster. This creates a high-speed, parallel data layer that overcomes the bottleneck of pulling large files from a single remote source.

    • Reduced I/O latency: By caching model files directly on the compute nodes where they are needed, Fluid provides applications with local, near-instant access to data. This optimized read mechanism eliminates the long delays associated with network I/O.