×
Community Blog Building a Production-Grade Cloud-Native Large Model Inference Platform with SGlang RBG + Mooncake

Building a Production-Grade Cloud-Native Large Model Inference Platform with SGlang RBG + Mooncake

This article shows how SGLang RBG + Mooncake enable production-grade, cloud-native LLM inference with PD-disaggregation.

By SGLang RBG team and Mooncake team

1

Background

Currently, LLM inference architecture is evolving from monolithic patterns to distributed systems, with mainstream approaches including Prefill-Decode (PD) Disaggregation, Attention-FFN (AF) Disaggregation, and Externalized KVCache. The fundamental driver behind this evolution is memory pressure caused by expanding model scales: In long-context or high-concurrency scenarios, KVCache often occupies over 70% of GPU memory. Relying solely on GPU HBM and CPU DRAM has become unsustainable. Decoupling and externalizing KVCache not only breaks through storage capacity bottlenecks but also enables critical capabilities such as cross-request cache sharing, elastic scaling, and fault isolation. Especially in machine-driven token consumption scenarios like RAG, AI Agents, and long text generation—where prompt templating and reusability are the norm—externalized KVCache has become essential for ensuring low latency, high throughput, and cost-effectiveness.

Mooncake, as an industry-leading distributed KVCache storage engine, was specifically designed to address this challenge. It provides high-throughput, low-latency distributed KVCache services for inference frameworks like SGLang through dedicated cache clusters.

However, managing distributed KVCache systems like Mooncake in production environments to achieve stable high performance still faces new challenges:

  1. High Deployment and Operations Complexity: Inference services are not limited to a single Pod but may consist of distributed systems composed of Prefill/Decode compute nodes and Mooncake cache nodes. Both require deep coordination in topology affinity, lifecycle, and scaling strategies. However, Kubernetes-native Workloads (Deployment/StatefulSet) struggle to express this strong multi-role collaboration semantics, leading to cumbersome configuration, resource waste, or performance degradation.
  2. Stability Risks During Rolling Upgrades: Cache loss during upgrades of Prefill and Mooncake instances forces active sessions to recalculate the Prefill phase, causing P99 latency spikes and throughput cliffs, severely impacting service stability.

To fundamentally address these pain points, RoleBasedGroup (RBG) emerges. As a Kubernetes-native API for AI inference, RBG orchestrates multi-role collaboration by treating Mooncake caches and SGLang inference nodes as different roles of the same service, managing their deployment, upgrades, and elasticity in a unified manner. Leveraging RBG’s in-place upgrade and topology-aware capabilities, it can minimize cache loss while ensuring consistency in upgrade, scheduling, and scaling strategies between compute and cache components, thereby maximizing performance while guaranteeing production environment stability and operational efficiency.

This article aims to systematically illustrate how to implement production-grade externalized KVCache capabilities by using Mooncake Store as a complementary role in SGLang PD-disaggregated inference services orchestrated by RBG.

Mooncake: A Distributed KVCache Storage Engine for Large Model Inference

Project URL: https://github.com/kvcache-ai/Mooncake

Mooncake serves as the high-performance distributed L3 storage backend for SGLang HiCache (Hierarchical Cache), enabling cross-machine KVCache sharing via RDMA and breaking through single-machine GPU/CPU cache capacity limitations.

Components:

Master Service: Manages cluster storage pools, metadata, and node lifecycle

Store Service: Provides distributed cache storage, supporting multi-replica, striped transmission, and hotspot load balancing

Features:

● RDMA acceleration + zero-copy mechanism for high-bandwidth, low-latency data access

● Intelligent prefetching and GPU direct transfer to maximize I/O efficiency

● Support for PD-disaggregated architecture to enhance token throughput in large-scale clusters

Quick Start:

# Start Master
mooncake_master --http_metadata_server_port=9080

# Start Store Service (configure RDMA devices and memory pool)
python -m mooncake.mooncake_store_service --config=config.json

# Start SGLang (enable Mooncake backend)
python -m sglang.launch_server \
    --enable-hierarchical-cache \
    --hicache-storage-backend mooncake \
    --model-path <model_path>

RoleBasedGroup (RBG): An Elastic Role Orchestration Engine for Large Model Inference

Project Repo: https://github.com/sgl-project/rbg

3.1 Core Problem: Five Major Challenges in Production Deployment of Large Model Inference

Large model inference is evolving into “the most expensive microservice”—requiring both the ultimate performance of HPC clusters and the agile elasticity of cloud-native systems. Current production environments face five fundamental challenges:

  1. Rapid Architecture Iteration: Disaggregated large model inference architectures (e.g., Prefill/Decode disaggregation, multi-level Router/Gateway) evolve extremely fast, making traditional platforms that rely on fixed abstractions unable to adapt to new architectures in a timely manner.
  2. Performance Sensitivity: Critical performance metrics like TTFT (Time To First Token) and TPOT (Time Per Output Token) have sub-millisecond sensitivity to GPU topology (NVLink/PCIe), RDMA affinity, and other factors. Arbitrary migration or improper scheduling amplifies initial and final response latency.
  3. Strong Component Dependencies: Strong dependency relationships exist between critical roles (e.g., Prefill and Decode roles require strong binding relationships like 1:1, N:1). Version upgrades and rollbacks must maintain atomicity across multiple roles; otherwise, request failures or data inconsistencies may occur.
  4. Low Operational Efficiency: Existing platforms lack a unified perspective on multi-role ensembles during operations like restarts, scaling, and fault migration. Up to 5% of daily time is consumed by manual coordination during restart/scale/upgrade processes, leading to idle GPU resource waste.
  5. Significant Resource Tides and Underutilization: Online traffic peaks and valleys often differ by more than 10x, yet statically configured inference services maintain long-term GPU average utilization below 30%, making it difficult to balance performance and cost.

Fundamental Contradiction: Traditional microservices are designed for stateless, weak-topology scenarios, whereas large model inference is a stateful application with strong state, topology awareness, and extreme performance requirements.

3.2 RBG Design Philosophy: Roles as First-Class Citizens, Role Collaboration as the Core Scenario

RBG originates from the SGLang community and is jointly contributed by Xiaohongshu, Siflow, Alibaba Cloud, and Nanjing University. Its core objective is to build a management paradigm tailored to LLM inference characteristics by using “Role” as the atomic unit for scheduling and orchestration while balancing performance and stability.

RBG views an inference service as a topologized, stateful, and collaborative “Role Organism” rather than a collection of isolated Deployments. Based on this philosophy, RBG proposes the SCOPE core capability framework for production environments:

S – Stable: Topology-aware deterministic operations

C – Coordination: Cross-role coordination policy engine

O – Orchestration: Role and service discovery with orchestration semantics

P – Performance: Topology-aware high-performance scheduling

E – Extensible: Declarative abstraction for the future

3.3 SCOPE Core Capability Analysis

3.3.1 Stable: Topology-aware deterministic operations

Stability is the cornerstone of RBG. By injecting a globally unique RoleID into each Pod and following the ” Minimum Replacement Domain” principle, RBG ensures that operational actions are completed within the original hardware topology scope (GPU-NVLink domain, NUMA node, etc.), minimizing performance jitter caused by topology drift.

roles:
- name: prefill
  replicas: 3
  rolloutStrategy:
    rollingUpdate:
      type: InplaceIfPossible
      maxUnavailable: 1

3.3.2 Coordination: Cross-role coordination policy engine

RBG incorporates a built-in declarative coordination engine that precisely defines inter-role dependencies through the Coordination mechanism:

Deployment Coordination: For example, Prefill and Decode are scheduled in pairs and become ready in groups at specific ratios;

Upgrade Coordination: Supports “proportional protocol” upgrades to ensure version consistency across roles and avoid protocol incompatibility caused by partial upgrades;

Fault Coordination: Predefined linkage strategies trigger automatic remediation or migration of associated roles when a role fails;

Scaling Coordination: Adjusts instances in groups according to role relationship ratios during scaling to maintain stable throughput and latency performance.

This fine-grained coordination capability manages complex distributed inference services as a unified lifecycle entity, significantly reducing operational complexity.

# Example: Collaborative upgrade of Prefill and Decode roles in PD-disaggregated architecture
coordination:
- name: prefill-decode-co-update
  type: RollingUpdate
  roles:
  - prefill
  - decode
  strategy:
    maxUnavailable: 5%
    maxSkew: 1% # Maximum deviation in new version ratio between two roles during 
    partition: 20%
roles:
- name: prefill
  replicas: 200
  template: ...
- name: decode
  replicas: 100
  template: ...

3.3.3 Orchestration: Orchestrated role and service discovery

RBG explicitly defines role dependencies and precise startup order to achieve orchestrated management. More critically, it provides topology-aware built-in service discovery, injecting complete topology information (role IPs, attributes, relationships, etc.) into environment variables or configuration files during Pod startup.

Inference engines (SGLang, vLLM, etc.) can directly read topology views from local configurations without relying on external service discovery systems like etcd or Consul, making services more self-contained across environment migrations and significantly reducing integration complexity.

3.3.4 Performance: Topology-aware high-performance scheduling

Single-request latency and throughput are highly dependent on hardware topology and resource affinity. RBG introduces topology-aware binning strategies, supporting multi-dimensional performance optimization:

● GPU topology priority (e.g., GPU-NVLink > PCIe > RDMA > VPC);

● Affinity and anti-affinity constraints between roles;

● Layout balance for instances of the same role;

● Short-circuit read optimization after deployment.

Through these constraints and strategies, RBG can, during large-scale deployments, adhere to optimal hardware topology as closely as possible without sacrificing stability, thereby ensuring critical performance metrics such as TTFT and TPOT.

3.3.5 Extensible: Deployment abstraction for change

RBG decouples ”role relationship definition” from ”deployment/model management/elasticity policies” through declarative APIs (RBG, RBGs, EngineRuntimeProile, etc.) and a plugin-based mechanism.

When the community evolves new architectures (such as new routing layer patterns, disaggregated architectures, etc.), no modification to RBG code is required. New role templates and relationships can be quickly implemented simply by defining them through YAML. This platform design of “declarative API + plugin mechanism” significantly shortens the time-to-production cycle for new architectures.

Through Kubernetes-native APIs, RBG provides a unified hosting layer for large model inference services that is Stable, Coordinated, Orchestrated, High-Performance, and Extensible—a novel deployment and operational abstraction for modern LLM inference workloads.

Deploying PD-Disaggregated Architecture + Mooncake Inference Service with RBG

4.1 Deployment Architecture

2

Through RoleBasedGroup, a highly available and elastic SGLang PD-disaggregated inference system can be deployed, with core components as follows:

The entire system consists of the following core roles:

SGLang Model Gateway: Acts as a unified request entry point and traffic scheduler, responsible for receiving user inference requests and intelligently selecting appropriate Prefill and Decode nodes for processing based on load status, context length, and model configuration.

Prefill Serving Backend: Dedicated to processing forward computation of prompts, generating initial KVCache; typically compute-intensive and sensitive to memory bandwidth.

Decode Serving Backend: Focuses on token-by-token decoding during autoregressive generation, relying on generated KVCache for efficient inference; extremely sensitive to cache access latency.

Mooncake Master/Store: As an independent external KVCache storage role, it provides high-throughput, low-latency distributed caching services, persistently storing Key-Value Cache for all inference sessions. It not only breaks through capacity limitations of single-GPU HBM and CPU DRAM but also supports cross-request cache reuse and fine-grained cache eviction policies (e.g., LRU + high-watermark eviction).

These roles do not operate in isolation but are tightly integrated through RBG’s native multi-role collaboration capabilities. Additionally, EngineRuntime, as a Sidecar injected by RBG into engine service Pods, serves as a bridge between the inference engine and the upper-level orchestration system, providing critical runtime capabilities such as service registration and metadata reporting, dynamic LoRA loading/unloading, traffic state control, and observability integration.

4.2 Deploying Mooncake + SGLang PD-Disaggregated Inference Service with RBG

● Install RBG:

https://github.com/sgl-project/rbg/blob/main/doc/install.md

● How to build docker image in Appendix 8.1

● Deployment

After preparing the container images, you can deploy a SGLang PD-disaggregated inference service with KVCache offloading capabilities using the following YAML file: https://github.com/sgl-project/rbg/blob/main/examples/mooncake/pd-disaggregated-with-mooncake.yaml

For explanations of environment variables involved in the YAML, please refer to: https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/design/mooncake-store.md

● Check deployment results:

kubectl get pods -l rolebasedgroup.workloads.x-k8s.io/name=sglang-pd-with-mooncake-demo

sglang-pd-with-mooncake-demo-router-0               1/1     Running   0          71s
sglang-pd-with-mooncake-demo-prefill-0              1/1     Running   0          3m42s
sglang-pd-with-mooncake-demo-decode-0               1/1     Running   0          3m42s
sglang-pd-with-mooncake-demo-mooncake-master-0      1/1     Running   0          4m2s
sglang-pd-with-mooncake-demo-mooncake-store-bh9xs   1/1     Running   0          3m42s
sglang-pd-with-mooncake-demo-mooncake-store-dsrv4   1/1     Running   0          3m42s
sglang-pd-with-mooncake-demo-mooncake-store-tqjvt   1/1     Running   0          3m42s

● Check network and location information for one instance of the Mooncake Store role:

kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.spec.nodeName}'

kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.status.podIP}'

4.3 Benchmark Results: Multi-Level Cache Acceleration is Significant

Multi-turn conversation scenario tests demonstrate that multi-level cache architecture is crucial for improving KVCache hit rates and inference performance:

Baseline (GPU Memory Only): Low cache hit rate, average TTFT 5.91s, P90 12.16s, system throughput limited, InputToken throughput only 6576.85 token/s

L2 DRAM HiCache: Hit rate increased to 40.62%, average TTFT reduced to 3.77s (↓36.2%), P90 reduced to 10.88s, InputToken throughput increased to 10054.21 token/s (↑52.89%)

L3 Mooncake Cache: Hit rate further jumped, average TTFT reduced to 2.58s (↓56.3%), P90 significantly improved to 6.97s (↓42.7%), InputToken throughput increased to 15022.80 token/s (↑49.41%)

3
Overall Service Throughput Metrics in Multi-Turn Conversation Test Scenario

4
5

KVCache Hit Rates and Corresponding TTFT Metrics in Multi-Turn Conversation Test Scenario

Test details can be found in Appendix 8.2

Achieving Smooth Mooncake Version Upgrades Through In-Place Upgrade Capability

Since the transfer-engine built into Mooncake must maintain strict version consistency with the transfer-engine in SGLang Serving Backend (Prefill/Decode) to ensure KVCache transport protocol compatibility, Mooncake needs to be updated simultaneously when the inference engine is upgraded.

However, as a stateful caching service, Mooncake’s KVCache data typically resides only in memory. During traditional Kubernetes rolling updates, when old Pods are terminated, their in-memory cache data is immediately lost; after new Pods start, they need to go through rescheduling and recreation processes. This forces all active inference sessions that depend on that node’s cache to be interrupted and must re-execute the complete Prefill computation—a process that not only incurs enormous computational overhead but also triggers:

● P99 first-token latency spikes (skyrocketing from seconds to tens of seconds);

● System throughput cliff caused by large numbers of requests queuing for Prefill;

● Severe user experience jitter, destroying service stability in production environments.

Solution: Mooncake Cache Local Persistence + RBG In-Place Upgrade:

Mooncake Cache Local Persistence: In Mooncake community PR#1031, Mooncake supports persisting KVCache metadata and hot data snapshots on node ShareMemory and local disks (or high-performance NVMe), ensuring rapid cache state recovery after process restarts and avoiding Prefill recomputation caused by cache invalidation;

RBG In-Place Upgrade: Through RBG’s fine-grained role control capabilities, when upgrading Mooncake roles, Pod reconstruction is avoided. Instead, container images are replaced in-place while reusing the node’s local disk or shared memory, thereby preserving persisted cache data and achieving “seamless” version switching.

The combination of these two approaches allows KVCache state to persist during joint upgrades of Serving Backend and Mooncake, preventing active sessions from falling back to the Prefill stage, effectively avoiding latency spikes and throughput drops, and ensuring end-to-end stability and high availability of large model inference services during version iterations.

6

In other words, RBG not only solves the complexity of multi-role collaborative deployment but also transforms the industry challenge of “smooth evolution of stateful caching services” into standardized, automatable operational capabilities through in-place upgrades, truly achieving the production-grade goal of “upgrade without perception, service without jitter.”

We performed an engine version update on the just-deployed service, upgrading from v0.5.5 to v0.5.6:

kubectl patch rolebasedgroup sglang-pd-with-mooncake-demo \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/roles/1/template/spec/containers/0/image", "value": "lmsysorg/sglang:v0.5.6"}]'

By checking Pod status, we can see that only one container restart occurred after the Mooncake Store role’s docker image version update.

kubectl get pods -l rolebasedgroup.workloads.x-k8s.io/name=sglang-pd-with-mooncake-demo -owide

NAME                                                READY   STATUS             RESTARTS   AGE
sglang-pd-with-mooncake-demo-decode-0               1/1     Running            0          7m4s
sglang-pd-with-mooncake-demo-mooncake-master-0      1/1     Running            0          7m24s
sglang-pd-with-mooncake-demo-mooncake-store-bh9xs   1/1     Running            1          7m4s
sglang-pd-with-mooncake-demo-mooncake-store-dsrv4   1/1     Running            1          7m4s
sglang-pd-with-mooncake-demo-mooncake-store-tqjvt   1/1     Running            1          7m4s
sglang-pd-with-mooncake-demo-prefill-0              1/1     Running            0          7m4s
sglang-pd-with-mooncake-demo-router-0               1/1     Running            0          4m33s

The restart reason can be confirmed by checking Pod events:

kubectl describe pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4

Events:
  Type     Reason          Age                  From               Message
  ----     ------          ----                 ----               -------
  Normal   Scheduled       27m                  default-scheduler  Successfully assigned default/sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 to cn-beijing.10.134.25.229
  Normal   AllocIPSucceed  27m                  terway-daemon      Alloc IP 10.134.25.238/16 took 584.019653ms
  Normal   Created         27m                  kubelet            Created container: store
  Normal   Pulled          27m                  kubelet            Container image "lmsysorg/sglang:v0.5.5" already present on machine
  Normal   Started         27m                  kubelet            Started container store
  Normal   Killing         21m                  kubelet            Container store definition changed, will be restarted

By checking the status of the restarted Mooncake instance, we can see that the Pod’s network and topology information did not change after the in-place upgrade. Combined with Mooncake’s cache persistence capability, this ensures that the KVCache before restart was not lost and was successfully recovered as expected after the in-place upgrade.

kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.spec.nodeName}'

kubectl get pods sglang-pd-with-mooncake-demo-mooncake-store-dsrv4 -o jsonpath='{.status.podIP}'

Summary and Outlook

This article systematically explains how to build a production-grade, stable, high-performance PD-disaggregated inference service through the collaborative design of RoleBasedGroup (RBG) and Mooncake. The conclusions are as follows:

RBG Redefines the Orchestration Paradigm for LLM Inference Services: By making multi-role collaboration (PD disaggregation, Mooncake caching) and topology-aware scheduling first-class citizens, RBG not only solves the complexity of distributed deployment but also conquers the industry challenge of “smooth evolution of stateful caching services” through its in-place upgrade capability, achieving the production-grade goal of “upgrade without perception, service without jitter.”

Mooncake Unlocks Infinite Possibilities for KVCache: As an L3 cache layer, Mooncake, through distributed memory pools and RDMA acceleration, significantly increases cache hit rates, reduces TTFT by 56.3%, improves P90 latency by 42.7%, and simultaneously raises GPU average utilization from below 30% to a sustainably elastic level, truly balancing performance and cost.

Hierarchical Cache Architecture is the Inevitable Path for Long-Context Inference: The three-level cache system from GPU HBM → DRAM → Mooncake has proven its effectiveness in benchmarks. Especially in machine-driven scenarios like multi-turn conversations, RAG, and AI Agents, the diminishing marginal cost effect from cache reuse becomes increasingly significant.

The practice of RBG + Mooncake demonstrates that only through deep integration of high-performance system design and cloud-native operational capabilities can large model inference truly evolve from “usable” to “user-friendly,” from “laboratory” to “production-grade.” We look forward to advancing this paradigm with the community to lay the foundation for next-generation AI infrastructure.

Acknowledgment

● Xiaohongshu: Sun Weixiang, Song Yang, Xiong Feng

● SGLang: Yang Yanbo

● Qujing Technology: Yang Ke

● Mooncake: Ma Teng, Cai Shangming

● Alibaba Cloud: Yizhai, Bai Cun, Dong Yun

Appendix

8.1 Image Building

In the example used in this article, we can directly use the official SGLang community container image lmsysorg/sglang:v0.5.5 (mooncake-transfer-engine >= 0.3.7), which already includes Mooncake-related dependencies by default. For customization needs, you can refer to the Dockerfile provided in the link to build a specific version of the Mooncake image: https://github.com/sgl-project/rbg/blob/main/examples/mooncake/Dockerfile.mooncake

8.2 Benchmark Testing

8.2.1 Environment Configuration

Configuration item Configuration
Kubernets Nodes 2 * (8 × nvidia-h20-141gb GPUs)
Model Qwen3-235B
Mooncake TransferEngine version v0.3.7
SGLang engine version v0.5.6
Benchmark Tools hicache/bench_multiturn

8.2.2 Testing Methodology

Using the multi-turn conversation stress testing tool provided by HiCache to simulate multi-turn conversation scenarios, we tested the benefits in throughput metrics and SLO indicators of inference services with L3 Mooncake + L2 Hicache enabled compared to those with only L2 Hicache enabled and those without Hicache enabled, all in KVCache-reusable scenarios.

● Test Configuration

Category L2 Configuration (Hicache) L3 Configuration(Mooncake) Command
L1 GPU Only (baseline) python -m sglang.launch_server --model /models/Qwen3 --tp-size 4 --mem-fraction-static 0.9
L2 DRAM Hicache 320Gi (80Gi CPU RAM/ per Rank) python -m sglang.launch_server --model /models/Qwen3 --tp-size 4 --mem-fraction-static 0.9 --enable-hierarchical-cache --hicache-size=80
L2 DRAM Hicache + L3 Mooncake 320Gi (80Gi CPU RAM/ per Rank) 500Gi CPU RAM python -m sglang.launch_server --model /models/Qwen3 --tp-size 4 --mem-fraction-static 0.9 --enable-hierarchical-cache --hicache-size=80 --hicache-storage-backend mooncake

● Test commands

python3 benchmark/hicache/bench_multiturn.py \
--model-path /models/Qwen3-235B/Qwen3-235B-A22B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--disable-random-sample \
--output-length 1 \
--request-length 2048 \
--num-clients 150 \
--num-rounds 10 \
--max-parallel 4 \
--request-rate 16 \
--ready-queue-policy random \
--disable-auto-run \
--enable-round-barrier

● Group records:

Category TTFT (Avg/P90) Input Token Throughput Cache Hit Rate
L1 GPU Only (baseline) 5.91s/12.16s 6576.86 2.22%
L2 DRAM Hicache 3.77s/10.88 10054.21 40.62%
L2 DRAM Hicache + L3 Mooncake 2.58s/6.97s 15022.80 64.67%
0 1 0
Share on

OpenAnolis

104 posts | 6 followers

You may also like

Comments