Deploy a Dynamo inference service with PD disaggregation - Container Service for Kubernetes

Background

When to use PD disaggregation

PD disaggregation improves performance when your workload has long input prompts, latency-sensitive decode requirements, or high throughput targets. If your workload has short prompts and low concurrency, aggregated deployment may be simpler and sufficient.

Qwen3-32B

Qwen3-32B is a 32.8-billion-parameter dense model from the Qwen series. It supports a native context window of 32,000 tokens, extendable to 131,000 tokens via YaRN, and handles over 100 languages. The model supports both thinking and non-thinking modes and performs well on logical reasoning, code generation, multi-turn dialog, and tool use.

For more information, see the Qwen blog, GitHub repository, and documentation.

Dynamo

Dynamo is a high-throughput, low-latency inference framework from NVIDIA, built for serving large language models (LLMs) in multi-node, distributed environments.

Key capabilities:

Engine-agnostic: Supports TensorRT-LLM, vLLM, and SGLang as inference backends.
PD disaggregation: Decouples compute-intensive prefill from memory-bound decode, reducing latency and improving throughput.
Dynamic GPU scheduling: Adjusts resource allocation based on real-time load.
KV cache routing: Routes requests to nodes that already hold the relevant KV cache, avoiding redundant recomputation.
Accelerated KV transfer: Uses NIXL (NVIDIA Inference Xfer Library) to move KV cache between nodes with minimal overhead.
KV cache offloading: Extends effective cache capacity by spilling to memory, local disk, or cloud storage.
Rust core with Python interface: Delivers maximum runtime performance while remaining extensible via Python.
Fully open source: Dynamo is fully open source and follows a transparent, open source-first development philosophy.

For more information, see the Dynamo GitHub and Dynamo documentation.

Prefill/decode disaggregation

LLM inference has two stages with conflicting resource profiles:

Prefill: Processes the entire input prompt in one pass, computing attention for all tokens in parallel to produce the initial KV cache. This stage is compute-intensive and runs once per request.
Decode: Generates output tokens one at a time in an autoregressive loop, repeatedly reading large model weights and the KV cache from GPU memory. This stage is memory-bound.

When both stages share the same GPU, continuous batching intermixes prefill and decode work. Because prefill processes full prompts while decode generates single tokens, their compute demands differ sharply. The result is decode latency spikes caused by resource contention, which degrades throughput and raises the average time per output token (TPOT).

PD disaggregation solves this by routing prefill and decode work to separate GPU pools. Each pool is tuned for its stage's resource profile, eliminating contention and lowering TPOT.

RoleBasedGroup

RoleBasedGroup (RBG) is an open-source Kubernetes workload type developed by the ACK team to simplify large-scale deployment and operations of PD-disaggregated inference services. For more information, see the RBG GitHub.

An RBG consists of named roles, each backed by a StatefulSet, Deployment, or LWS. Key features include:

Flexible role definition: Define any number of roles with explicit startup-order dependencies and role-level elastic scaling.
Built-in service discovery: Automatic discovery within the group, with support for multiple restart policies, rolling updates, and gang scheduling.

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster running Kubernetes 1.22 or later, with at least 6 GPUs and at least 32 GB of GPU memory per GPU. See Create an ACK managed cluster and Add a GPU node to a cluster.

The ecs.ebmgn8is.32xlarge instance type is recommended. For supported instance types, see ECS Bare Metal Instance families.
The ack-rbgs component installed in your cluster. To install it, log on to the Container Service Management Console, go to Cluster List, click the target cluster name, and use Helm to install ack-rbgs. In the installation dialog box, skip the Application Name and Namespace fields, click Next, then click Yes in the Confirm dialog box to accept the defaults (application name: ack-rbgs, namespace: rbgs-system). Select the latest chart version and click OK.

How it works

The following diagram shows the request lifecycle in the Dynamo PD disaggregation architecture:

A request arrives at the processor, which contains a router that selects an available decode worker and forwards the request to it.
The decode worker decides whether to run prefill locally or offload it to a remote prefill worker. For disaggregated mode, it enqueues a prefill request.
A prefill worker dequeues and runs the prefill computation.
The prefill worker transfers the resulting KV cache to the designated decode worker via NIXL. The decode worker then generates tokens.

Two external services underpin this flow:

etcd — used for service discovery. NIXL registers with etcd so that workers can locate each other across nodes.
NATS — used as the message bus between prefill and decode workers.

Deploy the inference service

This example deploys a 2P1D topology (2 prefill workers, 1 decode worker) with Tensor Parallelism (TP) size 2 per role.

Step 1: Prepare the Qwen3-32B model files

Download the Qwen3-32B model from ModelScope.

If git-lfs is not installed, run yum install git-lfs or apt-get install git-lfs first. For other installation methods, see Installing Git Large File Storage.
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull
```
Upload the model files to Object Storage Service (OSS). Log on to the OSS console and note your bucket name. If you haven't created a bucket yet, see Create buckets. Then run:

For ossutil installation instructions, see Install ossutil.
```
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B
```

Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) to mount the model in your pods. For full instructions, see Create a PV and a PVC. Option A: ACK console Option B: kubectl Create llm-model.yaml with the following content, then apply it.

Create a PV. Log on to the ACK console. Go to your cluster and choose Volumes > Persistent Volumes. Click Create and fill in the fields:

Parameter	Value
PV Type	OSS
Volume Name	`llm-model`
Access Certificate	Your AccessKey ID and AccessKey secret for the OSS bucket
Bucket ID	The OSS bucket you created
OSS Path	`/Qwen3-32B`

Create a PVC. Go to Volumes > Persistent Volume Claims. Click Create and fill in the fields:

Parameter	Value
PVC Type	OSS
Name	`llm-model`
Allocation Mode	Existing Volumes
Existing Volumes	Click Select PV and select the PV you just created

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>       # AccessKey ID for the OSS bucket
  akSecret: <your-oss-sk>   # AccessKey secret for the OSS bucket
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>       # OSS bucket name
      url: <your-bucket-endpoint>      # e.g., oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path>          # e.g., /Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

kubectl create -f llm-model.yaml

Step 2: Install etcd and NATS

Both services must be running before you start the inference service. etcd provides service discovery — NIXL registers with etcd so that workers can locate each other across nodes. NATS serves as the message bus between prefill and decode workers.

Create etcd.yaml with the following content.

YAML template

apiVersion: v1
kind: Service
metadata:
  name: etcd
  labels:
    app: etcd
spec:
  ports:
    - port: 2379
      name: client
    - port: 2380
      name: peer
  clusterIP: None   # Headless service; required for NIXL node discovery
  selector:
    app: etcd
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
  labels:
    app: etcd
spec:
  selector:
    matchLabels:
      app: etcd
  serviceName: "etcd"
  replicas: 1
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
        - name: etcd
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/etcd:3.6.1
          volumeMounts:
            - name: data
              mountPath: /var/lib/etcd
          env:
            - name: ETCDCTL_API
              value: "3"
            - name: ALLOW_NONE_AUTHENTICATION
              value: "yes"
      volumes:
        - name: data
          emptyDir: {}

Deploy etcd.

kubectl apply -f etcd.yaml

Create nats.yaml with the following content.

YAML template

apiVersion: v1
kind: Service
metadata:
  name: nats
  labels:
    app: nats
spec:
  ports:
    - port: 4222
      name: client
    - port: 8222
      name: management
    - port: 6222
      name: cluster
  selector:
    app: nats
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nats
  labels:
    app: nats
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nats
  template:
    metadata:
      labels:
        app: nats
    spec:
      containers:
        - name: nats
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/nats:2.11.5
          args:
            - -js       # Enable JetStream (required for persistent messaging)
            - --trace
            - -m
            - "8222"    # Management port
          ports:
            - containerPort: 4222
            - containerPort: 8222
            - containerPort: 6222

Deploy NATS.

kubectl apply -f nats.yaml

Step 3: Deploy the Dynamo PD-disaggregated inference service

Create dynamo-configs.yaml to store the Dynamo graph definition and model configuration.

YAML template

apiVersion: v1
kind: ConfigMap
metadata:
  name: dynamo-configs
data:
  pd_disagg.py: |
    from components.frontend import Frontend
    from components.kv_router import Router
    from components.processor import Processor

    Frontend.link(Processor).link(Router)

  qwen3.yaml: |
    Common:
      model: /models/Qwen3-32B/
      kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'  # Use NIXL for cross-node KV transfer
      router: round-robin
      block-size: 128          # Token block size for chunked GPU transfers
      max-model-len: 2048
      max-num-batched-tokens: 2048
      disable-log-requests: true

    Frontend:
      served_model_name: qwen
      endpoint: dynamo.Processor.chat/completions
      port: 8000

    Processor:
      common-configs: [ model, router ]

    VllmWorker:                # Decode worker configuration
      common-configs: [ model, kv-transfer-config, router, block-size, max-model-len, disable-log-requests ]
      remote-prefill: true     # Offload prefill to dedicated prefill workers
      conditional-disagg: false  # Always use disaggregated prefill; disable local fallback
      gpu-memory-utilization: 0.95
      tensor-parallel-size: 1  # Per-worker TP size; total TP across the role = replicas x tensor-parallel-size
      ServiceArgs:
        workers: 1
        resources:
          gpu: 1

    PrefillWorker:
      common-configs: [ model, kv-transfer-config, block-size, max-model-len, max-num-batched-tokens, gpu-memory-utilization, disable-log-requests ]
      tensor-parallel-size: 1  # Per-worker TP size; total TP across the role = replicas x tensor-parallel-size
      gpu-memory-utilization: 0.95
      ServiceArgs:
        workers: 1
        resources:
          gpu: 1

kubectl apply -f dynamo-configs.yaml

Build or pull the Dynamo runtime container image with vLLM as the inference backend. Follow the instructions in the Dynamo community.

Create dynamo.yaml to define the RBG. Replace the image placeholders with your Dynamo runtime image address.

YAML template

apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: dynamo-pd
  namespace: default
spec:
  roles:
    - name: processor         # Frontend + router; no GPU required
      replicas: 1
      template:
        spec:
          containers:
            - name: processor
              image: # Your Dynamo runtime image address
              command:
                - sh
                - -c
                - cd /workspace/examples/llm; dynamo serve graphs.pd_disagg:Frontend -f ./configs/qwen3.yaml
              env:
                - name: DYNAMO_NAME
                  value: dynamo
                - name: DYNAMO_NAMESPACE
                  value: default
                - name: ETCD_ENDPOINTS
                  value: http://etcd:2379
                - name: NATS_SERVER
                  value: nats://nats:4222
                - name: DYNAMO_RP_TIMEOUT
                  value: "60"          # Seconds to wait for a remote prefill response
              ports:
                - containerPort: 8000
                  name: health
                  protocol: TCP
                - containerPort: 9345
                  name: request
                  protocol: TCP
                - containerPort: 443
                  name: api
                  protocol: TCP
                - containerPort: 9347
                  name: metrics
                  protocol: TCP
              readinessProbe:
                initialDelaySeconds: 30
                periodSeconds: 30
                tcpSocket:
                  port: 8000
              resources:
                limits:
                  cpu: "8"
                  memory: 12Gi
                requests:
                  cpu: "8"
                  memory: 12Gi
              volumeMounts:
                - mountPath: /models/Qwen3-32B/
                  name: model
                - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                  name: dynamo-configs
                  subPath: qwen3.yaml
                - mountPath: /workspace/examples/llm/graphs/pd_disagg.py
                  name: dynamo-configs
                  subPath: pd_disagg.py
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - name: dynamo-configs
              configMap:
                name: dynamo-configs

    - name: prefill            # 2 prefill workers, each using 2 GPUs (TP=2)
      replicas: 2
      template:
        spec:
          containers:
            - name: prefill-worker
              image: # Your Dynamo runtime image address
              command:
                - sh
                - -c
                - cd /workspace/examples/llm; dynamo serve components.prefill_worker:PrefillWorker -f ./configs/qwen3.yaml
              env:
                - name: DYNAMO_NAME
                  value: dynamo
                - name: DYNAMO_NAMESPACE
                  value: default
                - name: ETCD_ENDPOINTS
                  value: http://etcd:2379
                - name: NATS_SERVER
                  value: nats://nats:4222
                - name: DYNAMO_RP_TIMEOUT
                  value: "60"
              resources:
                limits:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"   # 2 GPUs per prefill replica (TP=2)
                requests:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"
              volumeMounts:
                - mountPath: /models/Qwen3-32B/
                  name: model
                - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                  name: dynamo-configs
                  subPath: qwen3.yaml
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - name: dynamo-configs
              configMap:
                name: dynamo-configs

    - name: decoder            # 1 decode worker using 2 GPUs (TP=2)
      replicas: 1
      template:
        spec:
          containers:
            - name: vllm-worker
              image: # Your Dynamo runtime image address
              command:
                - sh
                - -c
                - cd /workspace/examples/llm; dynamo serve components.worker:VllmWorker -f ./configs/qwen3.yaml --service-name VllmWorker
              env:
                - name: DYNAMO_NAME
                  value: dynamo
                - name: DYNAMO_NAMESPACE
                  value: default
                - name: ETCD_ENDPOINTS
                  value: http://etcd:2379
                - name: NATS_SERVER
                  value: nats://nats:4222
                - name: DYNAMO_RP_TIMEOUT
                  value: "60"
              resources:
                limits:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"   # 2 GPUs per decode replica (TP=2)
                requests:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"
              volumeMounts:
                - mountPath: /models/Qwen3-32B/
                  name: model
                - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                  name: dynamo-configs
                  subPath: qwen3.yaml
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - name: dynamo-configs
              configMap:
                name: dynamo-configs
---
apiVersion: v1
kind: Service
metadata:
  name: dynamo-service
spec:
  type: ClusterIP
  ports:
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    rolebasedgroup.workloads.x-k8s.io/name: dynamo-pd
    rolebasedgroup.workloads.x-k8s.io/role: processor

Deploy the service.

kubectl apply -f ./dynamo.yaml

Step 4: Verify the deployment

Monitor pod startup

GPU worker pods need time to pull the container image and initialize the model weights. Use the following command to track progress:

kubectl get pods -l rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd -w

Wait until all pods show Running and all containers are Ready. The sequence is:

Pod status	What it means
`Pending`	Waiting for GPU node scheduling or image pull
`Init:0/N`	Init containers running (for example, model pre-checks)
`ContainerCreating`	Image pulled; container starting
`Running 0/1` (not ready)	Container started; readiness probe not yet passing
`Running 1/1` (ready)	Container healthy; ready to serve requests

If a pod stays in Pending for more than a few minutes, check events with kubectl describe pod <pod-name>.

Send a test request

Forward the inference service port to your local machine.

Important
Port forwarding via kubectl port-forward is for development and debugging only. For production-ready network access, see Ingress management.
```
kubectl port-forward svc/dynamo-service 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Send a chat completion request to the model.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen", "messages": [{"role": "user", "content": "Let'\''s test it"}], "stream": false, "max_tokens": 30}'

A successful response looks like:

{"id":"31ac3203-c5f9-4b06-a4cd-4435a78d3b35","choices":[{"index":0,"message":{"content":"<think>\nOkay, the user sent 'Let's test it'. I need to confirm their intent first. They might be testing my response speed or functionality, or maybe they want to","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"length","logprobs":null}],"created":1753702438,"model":"qwen","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}

A JSON response with "object": "chat.completion" confirms the service is running.

What's next

Configure auto scaling for LLM inference services — Use the Horizontal Pod Autoscaler (HPA) with ack-alibaba-cloud-metrics-adapter to scale pods based on GPU, CPU, and memory utilization. This keeps the service responsive during traffic spikes while reducing costs during idle periods.
Accelerate model loading with Fluid distributed caching — Large model files stored in OSS can cause slow pod startups due to network I/O. Fluid creates a distributed caching layer across cluster nodes, pooling storage capacity and network bandwidth to deliver near-local read speeds for model weights.