This topic uses the Qwen3-32B model as an example to demonstrate how to deploy a model inference service in a Container Service for Kubernetes (ACK) cluster using the Dynamo framework with a prefill-decode (PD) disaggregation architecture.
Background
Qwen3-32B
Qwen3-32B represents the latest evolution in the Qwen series, featuring a 32.8B-parameter dense architecture optimized for both reasoning efficiency and conversational fluency.
Key features:
Dual-mode performance: Excels at complex tasks like logical reasoning, math, and code generation, while remaining highly efficient for general text generation.
Advanced capabilities: Demonstrates excellent performance in instruction following, multi-turn dialog, creative writing, and best-in-class tool use for AI agent tasks.
Large context window: Natively handles up to 32,000 tokens of context, which can be extended to 131,000 tokens using YaRN technology.
Multilingual support: Understands and translates over 100 languages, making it ideal for global applications.
For more information, see the blog, GitHub, and documentation.
Dynamo
Dynamo is a high-throughput, low-latency inference framework from NVIDIA, designed specifically for serving large language models (LLMs) in multi-node, distributed environments.

Key features:
Engine-agnostic: Dynamo is not tied to a specific inference engine and supports various backends such as TensorRT-LLM, vLLM, and SGLang.
LLM-specific optimization capabilities:
PD disaggregation: It decouples the compute-intensive prefill stage from the memory-bound decode stage, reducing latency and boosting throughput.
Dynamic GPU scheduling: It optimizes performance based on real-time load changes.
Smart LLM routing: It routes requests based on the key-value (KV) cache of a node to avoid unnecessary KV cache recalculations.
Accelerated data transmission: It uses NVIDIA Inference Xfer Library (NIXL) technology to speed up the transfer of intermediate computation results and KV cache.
KV cache offloading: It can offload KV cache to memory, disks, or even cloud disks to increase the total system throughput.
High performance and extensibility: The core is built in Rust for maximum performance, while providing a Python interface for user extensibility.
Fully open source: Dynamo is fully open source and follows a transparent, open source-first development philosophy.
For more information about the Dynamo framework, see the Dynamo GitHub and the Dynamo documentation.
Prefill/Decode separation
The Prefill/Decode separation architecture is a mainstream optimization technique for large language model (LLM) inference. It aims to resolve the resource conflict between the two core stages of the inference process. The LLM inference process can be divided into two stages:
Prefill (prompt processing) stage: In this stage, the entire user-input prompt is processed at once. The attention for all input tokens is calculated in parallel to generate the initial KV cache. This process is compute-intensive, requires powerful parallel computing capabilities, and is executed only once at the beginning of each request.
Decode (token generation) stage: This stage is an autoregressive process where the model generates new tokens one by one based on the existing KV cache. The computation for each step is small, but it requires repeatedly and quickly loading large model weights and the KV cache from video memory. Therefore, this process is memory-bound.

The core conflict is that scheduling these two very different tasks on the same GPU is highly inefficient. When processing multiple user requests, inference engines often use continuous batching to schedule the prefill and decode stages of different requests in the same batch. The prefill stage processes the entire prompt and is computationally complex. The decode stage generates only a single token and is computationally simple. If both are scheduled in the same batch, the decode stage experiences increased latency because of differences in sequence length and resource competition. This increases the overall system latency and reduces throughput.

The Prefill/Decode separation architecture solves this problem by decoupling these two stages and deploying them on different GPUs. This separation allows the system to be optimized for the different characteristics of the prefill and decode stages. It avoids resource competition, significantly reduces the average time per output token (TPOT), and improves system throughput.
RoleBasedGroup
RoleBasedGroup (RBG) is a new workload designed by the Alibaba Cloud Container Service for Kubernetes (ACK) team to address the challenges of large-scale deployment and O&M of the Prefill/Decode separation architecture in Kubernetes clusters. This project is open source. For more information, see the RBG GitHub.
The RBG API design is shown in the following figure. A group consists of a set of roles, and each role can be built based on a StatefulSet, deployment, or LWS. Its core features are as follows:
Flexible multi-role definition: RBG lets you define any number of roles with any names. It supports defining dependencies between roles, allowing them to start in a specified order. It also supports elastic scaling at the role level.
Runtime: It provides automatic service discovery within the group. It supports multiple restart policies, rolling updates, and gang scheduling.

Prerequisites
An ACK managed cluster running Kubernetes 1.22 or later with at least 6 GPUs, where each GPU has at least 32 GB of memory. For more information, see Create an ACK managed cluster and Add a GPU node to a cluster.
The
ecs.ebmgn8is.32xlargeinstance type is recommended. For more information about instance types, see ECS Bare Metal Instance families.Install the ack-rbgs component as follows.
Log on to the Container Service Management Console. In the navigation pane on the left, select Cluster List. Click the name of the target cluster. On the cluster details page, use Helm to install the ack-rbgs component. You do not need to configure the Application Name or Namespace for the component. Click Next. In the Confirm dialog box that appears, click Yes to use the default application name (ack-rbgs) and namespace (rbgs-system). Then, select the latest Chart version and click OK to complete the installation.

Model deployment
The following sequence diagram shows the request lifecycle in the Dynamo PD disaggregation architecture:
Request ingestion: The user's request is first sent to the processor component. The router within the processor selects an available decode worker and forwards the request to it.
Prefill decision: The decode worker determines whether the prefill computation should be performed locally or delegated to a remote prefill worker. If remote computation is required, it sends a prefill request to the prefill queue.
Prefill execution: A prefill worker retrieves the request from the queue and executes the prefill computation.
KV cache transfer: Once the computation is complete, the prefill worker transfers the resulting KV cache to the designated decode worker, which then proceeds with the decode stage.

Step 1: Prepare the Qwen3-32B model files
Run the following command to download the Qwen3-32B model from ModelScope.
If the
git-lfsplugin is not installed, runyum install git-lfsorapt-get install git-lfsto install it. For more installation methods, see Installing Git Large File Storage.git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git cd Qwen3-32B/ git lfs pullLog on to the OSS console and record the name of your bucket. If you haven't created one, see Create buckets. Create a directory in Object Storage Service (OSS) and upload the model to it.
For more information about how to install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32BCreate a persistent volume (PV) named
llm-modeland a persistent volume claim (PVC) for your cluster. For detailed instructions, see Create a PV and a PVC.Example using console
Create a PV
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose .
In the upper-right corner of the Persistent Volumes page, click Create.
In the Create PV dialog box, configure the parameters that are described in the following table.
The following table describes the basic configuration of the sample PV:
Parameter
Description
PV Type
In this example, select OSS.
Volume Name
In this example, enter llm-model.
Access Certificate
Configure the AccessKey ID and AccessKey secret used to access the OSS bucket.
Bucket ID
Select the OSS bucket you created in the preceding step.
OSS Path
Enter the path where the model is located, such as
/Qwen3-32B.
Create a PVC
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose .
In the upper-right corner of the Persistent Volume Claims page, click Create.
In the Create PVC dialog box, configure the parameters that are described in the following table.
The following table describes the basic configuration of the sample PVC.
Configuration Item
Description
PVC Type
In this example, select OSS.
Name
In this example, enter llm-model.
Allocation Mode
In this example, select Existing Volumes.
Existing Volumes
Click the Select PV hyperlink and select the PV that you created.
Example using kubectl
Use the following YAML template to create a file named
llm-model.yaml, containing configurations for a Secret, a static PV, and a static PVC.apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket. akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket. --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <your-bucket-name> # The bucket name. url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com. otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <your-model-path> # In this example, the path is /Qwen3-32B/. --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-modelCreate the Secret, static PV, and static PVC.
kubectl create -f llm-model.yaml
Step 2: Install etcd and NATS services
The Dynamo framework relies on two key external services: etcd for service discovery and NATS for messaging. Specifically, Dynamo uses NIXL for cross-node communication, which registers with etcd to discover other nodes. NATS is used as the message bus between the prefill and decode workers. Therefore, both etcd and NATS must be deployed before starting the inference service.
Create a file named
etcd.yaml.Deploy the etcd service.
kubectl apply -f etcd.yamlCreate a file named
nats.yaml.Deploy the NATS service.
kubectl apply -f nats.yaml
Step 3: Deploy the Dynamo PD-disaggregated inference service
This topic uses an RBG to deploy a 2 prefill, 1 decode (2P1D) Dynamo service. Both the prefill and decode roles will use a Tensor Parallelism (TP) size of 2. Deployment architecture:

Create a ConfigMap named
dynamo-configs.yamlto store the Dynamo and qwen3 model configurations.kubectl apply -f dynamo-configs.yamlPrepare the Dynamo runtime image.
Follow the instructions in the Dynamo community to build or pull an image with vLLM as the inference framework.
Create a file named
dynamo.yamlto define the RBG. Ensure you replace the placeholder with your Dynamo runtime image address.Deploy the service.
kubectl apply -f ./dynamo.yaml
Step 4: Validate the inference service
Establish port forwarding between the inference service and your local environment for testing.
ImportantPort forwarding established by
kubectl port-forwardlacks production-grade reliability, security, and scalability. It is suitable for development and debugging purposes only and should not be used in production environment. For production-ready network solutions in Kubernetes clusters, see Ingress management.kubectl port-forward svc/dynamo-service 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Send a sample request to the model inference service.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "qwen","messages": [{"role": "user","content": "Let's test it"}],"stream":false,"max_tokens": 30}'Expected output:
{"id":"31ac3203-c5f9-4b06-a4cd-4435a78d3b35","choices":[{"index":0,"message":{"content":"<think>\nOkay, the user sent 'Let's test it'. I need to confirm their intent first. They might be testing my response speed or functionality, or maybe they want to","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"length","logprobs":null}],"created":1753702438,"model":"qwen","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}A successful JSON response indicates that your Dynamo PD inference service is running correctly.
References
Configure auto scaling for LLM inference services
LLM workloads often fluctuate, leading to either over-provisioned resources or poor performance during traffic spikes. The Kubernetes Horizontal Pod Autoscaler (HPA), integrated with ack-alibaba-cloud-metrics-adapter, solves this by:
Automatically scaling your pods based on real-time GPU, CPU, and memory utilization.
Allowing you to define custom metrics for more sophisticated scaling triggers.
Ensuring high availability during peak demand while reducing costs during idle periods.
Accelerate model loading with Fluid distributed caching
Large model files (>10 GB) stored in services like OSS or File Storage NAS can cause slow pod startups (cold starts) due to long download times. Fluid solves this problem by creating a distributed caching layer across your cluster's nodes. This significantly accelerates model loading in two key ways:
Accelerated data throughput: Fluid pools the storage capacity and network bandwidth of all nodes in the cluster. This creates a high-speed, parallel data layer that overcomes the bottleneck of pulling large files from a single remote source.
Reduced I/O latency: By caching model files directly on the compute nodes where they are needed, Fluid provides applications with local, near-instant access to data. This optimized read mechanism eliminates the long delays associated with network I/O.