By SGLang RBG team and Mooncake team

Currently, LLM inference architecture is evolving from monolithic patterns to distributed systems, with mainstream approaches including Prefill-Decode (PD) Disaggregation, Attention-FFN (AF) Disaggregation, and Externalized KVCache. The fundamental driver behind this evolution is memory pressure caused by expanding model scales: In long-context or high-concurrency scenarios, KVCache often occupies over 70% of GPU memory. Relying solely on GPU HBM and CPU DRAM has become unsustainable. Decoupling and externalizing KVCache not only breaks through storage capacity bottlenecks but also enables critical capabilities such as cross-request cache sharing, elastic scaling, and fault isolation. Especially in machine-driven token consumption scenarios like RAG, AI Agents, and long text generation—where prompt templating and reusability are the norm—externalized KVCache has become essential for ensuring low latency, high throughput, and cost-effectiveness.
Mooncake, as an industry-leading distributed KVCache storage engine, was specifically designed to address this challenge. It provides high-throughput, low-latency distributed KVCache services for inference frameworks like SGLang through dedicated cache clusters.
However, managing distributed KVCache systems like Mooncake in production environments to achieve stable high performance still faces new challenges:
To fundamentally address these pain points, RoleBasedGroup (RBG) emerges. As a Kubernetes-native API for AI inference, RBG orchestrates multi-role collaboration by treating Mooncake caches and SGLang inference nodes as different roles of the same service, managing their deployment, upgrades, and elasticity in a unified manner. Leveraging RBG’s in-place upgrade and topology-aware capabilities, it can minimize cache loss while ensuring consistency in upgrade, scheduling, and scaling strategies between compute and cache components, thereby maximizing performance while guaranteeing production environment stability and operational efficiency.
This article aims to systematically illustrate how to implement production-grade externalized KVCache capabilities by using Mooncake Store as a complementary role in SGLang PD-disaggregated inference services orchestrated by RBG.
Project URL: https://github.com/kvcache-ai/Mooncake
Mooncake serves as the high-performance distributed L3 storage backend for SGLang HiCache (Hierarchical Cache), enabling cross-machine KVCache sharing via RDMA and breaking through single-machine GPU/CPU cache capacity limitations.
Components:
● Master Service: Manages cluster storage pools, metadata, and node lifecycle
● Store Service: Provides distributed cache storage, supporting multi-replica, striped transmission, and hotspot load balancing
Features:
● RDMA acceleration + zero-copy mechanism for high-bandwidth, low-latency data access
● Intelligent prefetching and GPU direct transfer to maximize I/O efficiency
● Support for PD-disaggregated architecture to enhance token throughput in large-scale clusters
Quick Start:
# Start Master
mooncake_master --http_metadata_server_port=9080
# Start Store Service (configure RDMA devices and memory pool)
python -m mooncake.mooncake_store_service --config=config.json
# Start SGLang (enable Mooncake backend)
python -m sglang.launch_server \
--enable-hierarchical-cache \
--hicache-storage-backend mooncake \
--model-path <model_path>
Project Repo: https://github.com/sgl-project/rbg
Large model inference is evolving into “the most expensive microservice”—requiring both the ultimate performance of HPC clusters and the agile elasticity of cloud-native systems. Current production environments face five fundamental challenges:
Fundamental Contradiction: Traditional microservices are designed for stateless, weak-topology scenarios, whereas large model inference is a stateful application with strong state, topology awareness, and extreme performance requirements.
RBG originates from the SGLang community and is jointly contributed by Xiaohongshu, Siflow, Alibaba Cloud, and Nanjing University. Its core objective is to build a management paradigm tailored to LLM inference characteristics by using “Role” as the atomic unit for scheduling and orchestration while balancing performance and stability.
RBG views an inference service as a topologized, stateful, and collaborative “Role Organism” rather than a collection of isolated Deployments. Based on this philosophy, RBG proposes the SCOPE core capability framework for production environments:
● S – Stable: Topology-aware deterministic operations
● C – Coordination: Cross-role coordination policy engine
● O – Orchestration: Role and service discovery with orchestration semantics
● P – Performance: Topology-aware high-performance scheduling
● E – Extensible: Declarative abstraction for the future
Stability is the cornerstone of RBG. By injecting a globally unique RoleID into each Pod and following the ” Minimum Replacement Domain” principle, RBG ensures that operational actions are completed within the original hardware topology scope (GPU-NVLink domain, NUMA node, etc.), minimizing performance jitter caused by topology drift.
roles:
- name: prefill
replicas: 3
rolloutStrategy:
rollingUpdate:
type: InplaceIfPossible
maxUnavailable: 1
RBG incorporates a built-in declarative coordination engine that precisely defines inter-role dependencies through the Coordination mechanism:
● Deployment Coordination: For example, Prefill and Decode are scheduled in pairs and become ready in groups at specific ratios;
● Upgrade Coordination: Supports “proportional protocol” upgrades to ensure version consistency across roles and avoid protocol incompatibility caused by partial upgrades;
● Fault Coordination: Predefined linkage strategies trigger automatic remediation or migration of associated roles when a role fails;
● Scaling Coordination: Adjusts instances in groups according to role relationship ratios during scaling to maintain stable throughput and latency performance.
This fine-grained coordination capability manages complex distributed inference services as a unified lifecycle entity, significantly reducing operational complexity.
# Example: Collaborative upgrade of Prefill and Decode roles in PD-disaggregated architecture
coordination:
- name: prefill-decode-co-update
type: RollingUpdate
roles:
- prefill
- decode
strategy:
maxUnavailable: 5%
maxSkew: 1% # Maximum deviation in new version ratio between two roles during
partition: 20%
roles:
- name: prefill
replicas: 200
template: ...
- name: decode
replicas: 100
template: ...
RBG explicitly defines role dependencies and precise startup order to achieve orchestrated management. More critically, it provides topology-aware built-in service discovery, injecting complete topology information (role IPs, attributes, relationships, etc.) into environment variables or configuration files during Pod startup.
Inference engines (SGLang, vLLM, etc.) can directly read topology views from local configurations without relying on external service discovery systems like etcd or Consul, making services more self-contained across environment migrations and significantly reducing integration complexity.
Single-request latency and throughput are highly dependent on hardware topology and resource affinity. RBG introduces topology-aware binning strategies, supporting multi-dimensional performance optimization:
● GPU topology priority (e.g., GPU-NVLink > PCIe > RDMA > VPC);
● Affinity and anti-affinity constraints between roles;
● Layout balance for instances of the same role;
● Short-circuit read optimization after deployment.
Through these constraints and strategies, RBG can, during large-scale deployments, adhere to optimal hardware topology as closely as possible without sacrificing stability, thereby ensuring critical performance metrics such as TTFT and TPOT.
RBG decouples ”role relationship definition” from ”deployment/model management/elasticity policies” through declarative APIs (RBG, RBGs, EngineRuntimeProile, etc.) and a plugin-based mechanism.
When the community evolves new architectures (such as new routing layer patterns, disaggregated architectures, etc.), no modification to RBG code is required. New role templates and relationships can be quickly implemented simply by defining them through YAML. This platform design of “declarative API + plugin mechanism” significantly shortens the time-to-production cycle for new architectures.
Through Kubernetes-native APIs, RBG provides a unified hosting layer for large model inference services that is Stable, Coordinated, Orchestrated, High-Performance, and Extensible—a novel deployment and operational abstraction for modern LLM inference workloads.

Through RoleBasedGroup, a highly available and elastic SGLang PD-disaggregated inference system can be deployed, with core components as follows:
The entire system consists of the following core roles:
● SGLang Model Gateway: Acts as a unified request entry point and traffic scheduler, responsible for receiving user inference requests and intelligently selecting appropriate Prefill and Decode nodes for processing based on load status, context length, and model configuration.
● Prefill Serving Backend: Dedicated to processing forward computation of prompts, generating initial KVCache; typically compute-intensive and sensitive to memory bandwidth.
● Decode Serving Backend: Focuses on token-by-token decoding during autoregressive generation, relying on generated KVCache for efficient inference; extremely sensitive to cache access latency.
● Mooncake Master/Store: As an independent external KVCache storage role, it provides high-throughput, low-latency distributed caching services, persistently storing Key-Value Cache for all inference sessions. It not only breaks through capacity limitations of single-GPU HBM and CPU DRAM but also supports cross-request cache reuse and fine-grained cache eviction policies (e.g., LRU + high-watermark eviction).
These roles do not operate in isolation but are tightly integrated through RBG’s native multi-role collaboration capabilities. Additionally, EngineRuntime, as a Sidecar injected by RBG into engine service Pods, serves as a bridge between the inference engine and the upper-level orchestration system, providing critical runtime capabilities such as service registration and metadata reporting, dynamic LoRA loading/unloading, traffic state control, and observability integration.
● Install RBG:
● https://github.com/sgl-project/rbg/blob/main/doc/install.md
● How to build docker image in Appendix 8.1
● Deployment
After preparing the container images, you can deploy a SGLang PD-disaggregated inference service with KVCache offloading capabilities using the following YAML file: https://github.com/sgl-project/rbg/blob/main/examples/mooncake/pd-disaggregated-with-mooncake.yaml
For explanations of environment variables involved in the YAML, please refer to: https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/design/mooncake-store.md
● Check deployment results:
kubectl get pods -l rolebasedgroup.workloads.x-k8s.io/name=sglang-pd-with-mooncake-demo
sglang-pd-with-mooncake-demo-router-0 1/1 Running 0 71s
sglang-pd-with-mooncake-demo-prefill-0 1/1 Running 0 3m42s
sglang-pd-with-mooncake-demo-decode-0 1/1 Running 0 3m42s
sglang-pd-with-mooncake-demo-mooncake-master-0 1/1 Running 0 4m2s
sglang-pd-with-mooncake-demo-mooncake-store-bh9xs 1/1 Running 0 3m42s
sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 1/1 Running 0 3m42s
sglang-pd-with-mooncake-demo-mooncake-store-tqjvt 1/1 Running 0 3m42s
● Check network and location information for one instance of the Mooncake Store role:
kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.spec.nodeName}'
kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.status.podIP}'
Multi-turn conversation scenario tests demonstrate that multi-level cache architecture is crucial for improving KVCache hit rates and inference performance:
● Baseline (GPU Memory Only): Low cache hit rate, average TTFT 5.91s, P90 12.16s, system throughput limited, InputToken throughput only 6576.85 token/s
● L2 DRAM HiCache: Hit rate increased to 40.62%, average TTFT reduced to 3.77s (↓36.2%), P90 reduced to 10.88s, InputToken throughput increased to 10054.21 token/s (↑52.89%)
● L3 Mooncake Cache: Hit rate further jumped, average TTFT reduced to 2.58s (↓56.3%), P90 significantly improved to 6.97s (↓42.7%), InputToken throughput increased to 15022.80 token/s (↑49.41%)

Overall Service Throughput Metrics in Multi-Turn Conversation Test Scenario


KVCache Hit Rates and Corresponding TTFT Metrics in Multi-Turn Conversation Test Scenario
Test details can be found in Appendix 8.2
Since the transfer-engine built into Mooncake must maintain strict version consistency with the transfer-engine in SGLang Serving Backend (Prefill/Decode) to ensure KVCache transport protocol compatibility, Mooncake needs to be updated simultaneously when the inference engine is upgraded.
However, as a stateful caching service, Mooncake’s KVCache data typically resides only in memory. During traditional Kubernetes rolling updates, when old Pods are terminated, their in-memory cache data is immediately lost; after new Pods start, they need to go through rescheduling and recreation processes. This forces all active inference sessions that depend on that node’s cache to be interrupted and must re-execute the complete Prefill computation—a process that not only incurs enormous computational overhead but also triggers:
● P99 first-token latency spikes (skyrocketing from seconds to tens of seconds);
● System throughput cliff caused by large numbers of requests queuing for Prefill;
● Severe user experience jitter, destroying service stability in production environments.
Solution: Mooncake Cache Local Persistence + RBG In-Place Upgrade:
● Mooncake Cache Local Persistence: In Mooncake community PR#1031, Mooncake supports persisting KVCache metadata and hot data snapshots on node ShareMemory and local disks (or high-performance NVMe), ensuring rapid cache state recovery after process restarts and avoiding Prefill recomputation caused by cache invalidation;
● RBG In-Place Upgrade: Through RBG’s fine-grained role control capabilities, when upgrading Mooncake roles, Pod reconstruction is avoided. Instead, container images are replaced in-place while reusing the node’s local disk or shared memory, thereby preserving persisted cache data and achieving “seamless” version switching.
The combination of these two approaches allows KVCache state to persist during joint upgrades of Serving Backend and Mooncake, preventing active sessions from falling back to the Prefill stage, effectively avoiding latency spikes and throughput drops, and ensuring end-to-end stability and high availability of large model inference services during version iterations.

In other words, RBG not only solves the complexity of multi-role collaborative deployment but also transforms the industry challenge of “smooth evolution of stateful caching services” into standardized, automatable operational capabilities through in-place upgrades, truly achieving the production-grade goal of “upgrade without perception, service without jitter.”
We performed an engine version update on the just-deployed service, upgrading from v0.5.5 to v0.5.6:
kubectl patch rolebasedgroup sglang-pd-with-mooncake-demo \
--type='json' \
-p='[{"op": "replace", "path": "/spec/roles/1/template/spec/containers/0/image", "value": "lmsysorg/sglang:v0.5.6"}]'
By checking Pod status, we can see that only one container restart occurred after the Mooncake Store role’s docker image version update.
kubectl get pods -l rolebasedgroup.workloads.x-k8s.io/name=sglang-pd-with-mooncake-demo -owide
NAME READY STATUS RESTARTS AGE
sglang-pd-with-mooncake-demo-decode-0 1/1 Running 0 7m4s
sglang-pd-with-mooncake-demo-mooncake-master-0 1/1 Running 0 7m24s
sglang-pd-with-mooncake-demo-mooncake-store-bh9xs 1/1 Running 1 7m4s
sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 1/1 Running 1 7m4s
sglang-pd-with-mooncake-demo-mooncake-store-tqjvt 1/1 Running 1 7m4s
sglang-pd-with-mooncake-demo-prefill-0 1/1 Running 0 7m4s
sglang-pd-with-mooncake-demo-router-0 1/1 Running 0 4m33s
The restart reason can be confirmed by checking Pod events:
kubectl describe pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned default/sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 to cn-beijing.10.134.25.229
Normal AllocIPSucceed 27m terway-daemon Alloc IP 10.134.25.238/16 took 584.019653ms
Normal Created 27m kubelet Created container: store
Normal Pulled 27m kubelet Container image "lmsysorg/sglang:v0.5.5" already present on machine
Normal Started 27m kubelet Started container store
Normal Killing 21m kubelet Container store definition changed, will be restarted
By checking the status of the restarted Mooncake instance, we can see that the Pod’s network and topology information did not change after the in-place upgrade. Combined with Mooncake’s cache persistence capability, this ensures that the KVCache before restart was not lost and was successfully recovered as expected after the in-place upgrade.
kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.spec.nodeName}'
kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.status.podIP}'
This article systematically explains how to build a production-grade, stable, high-performance PD-disaggregated inference service through the collaborative design of RoleBasedGroup (RBG) and Mooncake. The conclusions are as follows:
● RBG Redefines the Orchestration Paradigm for LLM Inference Services: By making multi-role collaboration (PD disaggregation, Mooncake caching) and topology-aware scheduling first-class citizens, RBG not only solves the complexity of distributed deployment but also conquers the industry challenge of “smooth evolution of stateful caching services” through its in-place upgrade capability, achieving the production-grade goal of “upgrade without perception, service without jitter.”
● Mooncake Unlocks Infinite Possibilities for KVCache: As an L3 cache layer, Mooncake, through distributed memory pools and RDMA acceleration, significantly increases cache hit rates, reduces TTFT by 56.3%, improves P90 latency by 42.7%, and simultaneously raises GPU average utilization from below 30% to a sustainably elastic level, truly balancing performance and cost.
● Hierarchical Cache Architecture is the Inevitable Path for Long-Context Inference: The three-level cache system from GPU HBM → DRAM → Mooncake has proven its effectiveness in benchmarks. Especially in machine-driven scenarios like multi-turn conversations, RAG, and AI Agents, the diminishing marginal cost effect from cache reuse becomes increasingly significant.
The practice of RBG + Mooncake demonstrates that only through deep integration of high-performance system design and cloud-native operational capabilities can large model inference truly evolve from “usable” to “user-friendly,” from “laboratory” to “production-grade.” We look forward to advancing this paradigm with the community to lay the foundation for next-generation AI infrastructure.
● Xiaohongshu: Sun Weixiang, Song Yang, Xiong Feng
● SGLang: Yang Yanbo
● Qujing Technology: Yang Ke
● Mooncake: Ma Teng, Cai Shangming
● Alibaba Cloud: Yizhai, Bai Cun, Dong Yun
In the example used in this article, we can directly use the official SGLang community container image lmsysorg/sglang:v0.5.5 (mooncake-transfer-engine >= 0.3.7), which already includes Mooncake-related dependencies by default. For customization needs, you can refer to the Dockerfile provided in the link to build a specific version of the Mooncake image: https://github.com/sgl-project/rbg/blob/main/examples/mooncake/Dockerfile.mooncake
| Configuration item | Configuration |
|---|---|
| Kubernets Nodes | 2 * (8 × nvidia-h20-141gb GPUs) |
| Model | Qwen3-235B |
| Mooncake TransferEngine version | v0.3.7 |
| SGLang engine version | v0.5.6 |
| Benchmark Tools | hicache/bench_multiturn |
Using the multi-turn conversation stress testing tool provided by HiCache to simulate multi-turn conversation scenarios, we tested the benefits in throughput metrics and SLO indicators of inference services with L3 Mooncake + L2 Hicache enabled compared to those with only L2 Hicache enabled and those without Hicache enabled, all in KVCache-reusable scenarios.
● Test Configuration
| Category | L2 Configuration (Hicache) | L3 Configuration(Mooncake) | Command |
| L1 GPU Only (baseline) | ❌ | ❌ | python -m sglang.launch_server --model /models/Qwen3 --tp-size 4 --mem-fraction-static 0.9 |
| L2 DRAM Hicache | 320Gi (80Gi CPU RAM/ per Rank) | ❌ | python -m sglang.launch_server --model /models/Qwen3 --tp-size 4 --mem-fraction-static 0.9 --enable-hierarchical-cache --hicache-size=80 |
| L2 DRAM Hicache + L3 Mooncake | 320Gi (80Gi CPU RAM/ per Rank) | 500Gi CPU RAM | python -m sglang.launch_server --model /models/Qwen3 --tp-size 4 --mem-fraction-static 0.9 --enable-hierarchical-cache --hicache-size=80 --hicache-storage-backend mooncake |
● Test commands
python3 benchmark/hicache/bench_multiturn.py \
--model-path /models/Qwen3-235B/Qwen3-235B-A22B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--disable-random-sample \
--output-length 1 \
--request-length 2048 \
--num-clients 150 \
--num-rounds 10 \
--max-parallel 4 \
--request-rate 16 \
--ready-queue-policy random \
--disable-auto-run \
--enable-round-barrier
● Group records:
| Category | TTFT (Avg/P90) | Input Token Throughput | Cache Hit Rate |
|---|---|---|---|
| L1 GPU Only (baseline) | 5.91s/12.16s | 6576.86 | 2.22% |
| L2 DRAM Hicache | 3.77s/10.88 | 10054.21 | 40.62% |
| L2 DRAM Hicache + L3 Mooncake | 2.58s/6.97s | 15022.80 | 64.67% |
SysOM MCP: Open-Source Intelligent O&M Assistant for AI-Powered System Diagnostics
104 posts | 6 followers
FollowAlibaba Container Service - January 15, 2026
ApsaraDB - December 29, 2025
ApsaraDB - February 4, 2026
ApsaraDB - February 4, 2026
ApsaraDB - February 4, 2026
Alibaba Cloud Community - October 14, 2025
104 posts | 6 followers
Follow
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
Container Compute Service (ACS)
A cloud computing service that provides container compute resources that comply with the container specifications of Kubernetes
Learn More
Tongyi Qianwen (Qwen)
Top-performance foundation models from Alibaba Cloud
Learn More
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by OpenAnolis