×
Community Blog Alibaba Cloud Tair KVCache Simulation Analysis: High-Precision Computational and Caching Simulation Design and Implementation

Alibaba Cloud Tair KVCache Simulation Analysis: High-Precision Computational and Caching Simulation Design and Implementation

This article introduces Tair-KVCache-HiSim, a high-fidelity CPU-based simulator for optimizing multi-tier KV Cache configurations in LLM inference.

Today, as large model inference enters the "Age of Agents," KV Cache has evolved from a performance optimization technique into a system-level infrastructure. The traditional "in-GPU-memory caching" paradigm is no longer sustainable under demanding scenarios such as long-context processing and multi-turn interactions. Although multi-tier KV Cache architectures—embodying the "storage-over-computation" principle—have overcome capacity bottlenecks, they introduce a high-dimensional configuration space shaped by the interplay of model architecture, hardware platforms, inference engines, and caching policies. The central challenge in large-scale deployment now lies in identifying the optimal trade-off among latency, throughput, and cost while satisfying Service Level Objectives (SLOs) such as latency and throughput targets.

To address this challenge, the Alibaba Cloud Tair KVCache team, in collaboration with the Heterogeneous Computing Software-Hardware Co-Design team, introduces Tair-KVCache-HiSimthe first high-fidelity simulation and analysis tool specifically designed for distributed multi-tier KV Cache management in large language model (LLM) inference. By comprehensively modeling the full request lifecycle, multi-level KV Cache behaviors, and heterogeneous batched execution, Tair-KVCache-HiSim achieves end-to-end performance prediction with <5% error on commodity CPUs at a cost advantage of 390,000× compared to real hardware trials.

More importantly, Tair-KVCache-HiSim leverages real-world workloads to automatically explore the Pareto frontier under user-specified Service Level Objective (SLO) constraints, thereby enabling three critical decision-making capabilities:

  • Compute Platform Selection and Optimization Configuration: Evaluates the impact of different GPU models, parallelization strategies, quantization schemes, and operator implementations on Time-To-First-Token (TTFT) and Tokens-Per-Second (TPOT), recommending the most cost-effective configuration.
  • Storage Tiering and Media Planning: Quantifies the performance ceiling of multi-tier caching architectures and supports fine-grained selection of storage media types per tier. It co-optimizes bandwidth allocation, capacity distribution, prefetching policies, and eviction algorithms to maximize cache hit rates and I/O efficiency.
  • Coordinated Global and Local Scheduling: Jointly analyzes how global routing policies and local scheduling mechanisms affect queuing delay, batch composition, and GPU utilization, enabling end-to-end optimization—from cluster-wide load balancing to single-node pipeline efficiency.

This series of technical articles will systematically disassemble the evolution path of KVCache technology for agent inference:

  1. Alibaba Cloud Tair Partners with SGLang to Build HiCache: Constructing a New Cache Paradigm for "Agentic Inference"
  2. Alibaba Cloud Tair KVCache Engineering Implementation Based on 3FS: Enterprise-Grade Deployment, High-Availability Operations, and Performance Optimization Practices
  3. Hybrid Model Support: SGLang's Support for Hybrid architecture models such as Mamba-Transformer
  4. Tair KVCache Manager: architecture design and implementation of enterprise-level global KVCache management service
  5. This article | KVCache simulation analysis: high-precision calculation and cache simulation design and implementation.
  6. Hierarchical Sparse Attention: KV Hierarchical management and on-demand loading under the framework of Hierarchical Sparse Attention
  7. Outlook: The Evolution of KVCache-driven software and hardware combination

Tair KVCache, as an extension of the Tair product capabilities of Alibaba Cloud Database, is essentially a three-time transition of the caching paradigm:
🔹From Redis's "cache data → Reduce I/O"
🔹To GPU KVCache's "cache calculation intermediate state → reduce repeated calculation"
🔹Then to Tair KVCache's "large-scale and agent attention state management → reconstruction of large-scale model inference cost model", it indicates that the cache is being upgraded from auxiliary components to the core capability of AI infrastructure layer-making "state" storable, shareable and schedulable, supporting the large-scale inference base in the era of agent.

1. Introduction

In the current context of rapid deployment and large-scale adoption of Large Language Model (LLM) inference services, the performance of inference systems directly determines user experience, service cost, and resource efficiency. Key performance metrics—such as Time to First Token (TTFT), Time per Output Token (TPOT), and system-level throughput—have become core criteria for evaluating the effectiveness of inference engines. In real-world deployment scenarios, these metrics are highly sensitive to a complex interplay of multiple factors, including model architecture (e.g., parameter count, sparsity), hardware platforms (e.g., computational capacity and memory bandwidth characteristics of GPUs such as A100 or H100), inference engine implementations (e.g., scheduling and KV cache management strategies in vLLM, SGLang, TensorRT-LLM), and runtime configurations (e.g., quantization schemes, batching policies, and parallelization modes).

To support the design and optimization of efficient, low-cost inference systems, we require a performance evaluation methodology that is high-fidelity, scalable, and reproducible. Traditional approaches relying on end-to-end stress testing over real GPU clusters are not only prohibitively expensive in hardware costs and time-consuming in experimentation, but also ill-suited for systematically exploring the vast combinatorial space of configurations.

In this context, the need for a CPU-based inference performance simulation framework has emerged. Specifically, we aim to replay realistic inference workload traces—collected from production environments or representative scenarios—on commodity CPU platforms to enable fast, low-cost, and high-accuracy prediction and comparison of key performance metrics, including TTFT, TPOT, and throughput, across diverse combinations of models, target GPUs, inference engines (e.g., vLLM, SGLang, TensorRT-LLM), and their runtime configurations.

Extending further to configurations of remote KV Cache, a set of interrelated factors critically influences inference performance:

  • the throughput and transfer latency of the selected storage media (e.g., DDR4, HBM, NVMe SSD, CXL memory pools, or remote GPU VRAM)
  • the total capacity limit
  • the cache eviction policy (e.g., LRU, LFU, Clock, or custom priority-aware strategies)
  • the Time-to-Live (TTL) expiration and reclamation mechanisms (e.g., lazy cleanup, periodic background reclamation, or proactive eviction triggered by memory pressure)

These elements become especially pivotal in agent-centric applications where inference offloading to remote storage is required. When the KV Cache cannot fully reside in local GPU VRAM and must be partially or entirely migrated to remote storage, access latency directly impacts TTFT and TPOT . Moreover, if the eviction policy mismatches the request pattern—such as frequently evicting high-reuse early-layer KV states during long-context dialogues—it can cause a sharp drop in cache hit rate, triggering redundant computation or excessive I/O overhead. Additionally, improper TTL settings may lead to either premature invalidation (reducing reuse efficiency) or memory exhaustion (starving subsequent requests of resources).

Consequently, the design of remote KV Cache systems fundamentally entails a fine-grained trade-off under four-dimensional constraints: capacity, latency, throughput, and cost. To navigate this complex space, we require a simulation methodology that integrates inference performance modeling with accurate emulation of cache hit rates and data transfer dynamics. This enables quantitative evaluation of how different KV Cache configuration combinations affect end-to-end inference SLOs—such as P99 latency ≤ 200 ms and throughput ≥ 50 req/s—thereby providing actionable insights for optimizing cache hierarchies in heterogeneous inference architectures.

2.Current Approaches to Inference Simulation: Methods and Their Advantages and Limitations

2.1 Overall Architecture of the Inference Engine

To build an effective performance simulator, one must first accurately model the execution logic of real-world inference engines. Typical LLM inference serving systems—such as vLLM, SGLang, and TensorRT-LLM—commonly adopt an architecture based on asynchronous request scheduling coupled with continuous or dynamic batching, dynamically aggregating Prefill and Decode requests to maximize hardware utilization. The core components and workflow are as follows:

2.1.1 Request Processing Pipeline and Lifecycle

In high-performance LLM inference engines such as SGLang, a single request is not processed sequentially from arrival to completion; instead, it is embedded into a deeply pipelined and asynchronously coordinated processing pipeline. This pipeline leverages mechanisms including CPU-GPU co-execution, multi-level cache prefetching, and dynamic batching to maximize throughput while ensuring low latency.

1

Taking a typical scenario as an example: a user submits a prompt of 1K tokens and expects to generate 512 output tokens, where the first 512 tokens of the prompt match a prefix from historical dialogue (i.e., achieving a 50% KV Cache hit rate). The complete lifecycle of this request is as follows:

1. Request Ingestion and Frontend Processing

The request is first routed by a load balancer to a specific inference instance. On the CPU, the service performs tokenization, converting the input text into a sequence of token IDs, which is then passed to the scheduler.

2. Prefix Cache Matching and State Identification

The engine rapidly searches for the prompt’s historical context on the CPU using a Radix Tree. If the first 512 tokens are found in the KV Cache, this segment is marked as “reusable”, and only the corresponding key/value tensors are loaded—eliminating redundant computation.

3. Asynchronous Cache Prefetching and Zero-Overhead Scheduling

  1. Stage 1 Prefetch (L3 → L2): Upon entering the waiting queue, the system immediately initiates asynchronous I/O to migrate the hit KV Cache entries from SSD (L3) to host DRAM (L2). This occurs in the CPU background without interfering with GPU inference.
  2. Stage 2 Loading (L2 → L1): When the scheduler decides to include the request in the next batch, it checks whether the L2 cache is ready. If so, it triggers data transfer from host DRAM to GPU HBM (L1).
  3. Zero-Overhead Scheduling: The CPU-based scheduling decision overlaps with the execution of the previous GPU batch, thereby avoiding pipeline stalls caused by scheduling latency and maximizing GPU utilization and system throughput.

4. Dynamic Batching Scheduling

The scheduler combines multiple ready requests—including both new Prefill requests and ongoing Decode requests—into a heterogeneous batch, based on available GPU memory, request priority, and cache readiness.

5. Phased Forward Model Execution

  1. Prefill Phase: Processes the remaining 512 non-cached tokens. This phase features long input sequences and is compute-intensive, primarily constrained by model parallelism, quantization, and GPU compute capacity.
  2. Decode Phase: Generates tokens one at a time. Each step computes only a single new token but requires reading the entire historical KV Cache, making it memory-bandwidth-bound.

6. Post-processing and Streaming Output (Detokenization)

The output logits are sampled to produce token IDs, which are then converted back to text via a detokenizer. To enhance user experience, results are returned in a streaming fashion as soon as each token is generated.

2.1.2 Introduction to Request Scheduling Policies

Since LLM serving backends continuously receive new inference requests, determining the scheduling order of requests prior to each inference step is one of the core design considerations for the framework. With respect to scheduling policies for Prefill and Decode requests, four primary strategies can be identified:

  • Prefill-First:
    Exemplified by SGLang, this policy prioritizes the Prefill phase of newly arrived requests by temporarily pausing ongoing Decode operations. After completing the Prefill of the new request, it is combined with existing Decode requests into a larger batch for subsequent inference steps. This approach maximizes system throughput but can introduce significant fluctuations in TPOT.
  • Decode-First :
    Represented by TensorRT-LLM, this strategy—also known as _inflight batching_—avoids interrupting currently executing Decode requests. New requests are only scheduled for Prefill if sufficient resources remain after accommodating all in-flight Decode requests in the next batch; otherwise, they wait until resources become available. This mitigates TPOT jitter and is particularly suited for workloads with short input sequences.
  • Chunked Prefill:
    Splits the Prefill of a long prompt into smaller chunks, allowing these chunks to be batched and executed concurrently with Decode requests. This alleviates resource contention caused by long Prefill operations blocking Decode tasks, thereby improving both TTFT and overall throughput and TPOT. It is especially effective for scenarios such as long-document summarization, multi-turn dialogues, and other applications requiring long-sequence processing with high concurrency.
  • Prefill/Decode Separation (PD Separation):
    Decouples the Prefill and Decode stages into separate execution pipelines with independent schedulers and resource pools. By isolating their distinct computational and memory access patterns, this architecture minimizes mutual interference and enables finer-grained trade-offs between TTFT and TPOT.
Framework Prefill Priority Decode Priority ChunkPrefill Prefill/Decode Separation
SGLang ✅ default ☐︎
vLLM ✅ default
TensorRT-LLM ✅ default

The strategies described above pertain to the scheduling priority between the Prefill and Decode phases. In practice, within the set of newly arrived requests (i.e., those in the Prefill phase), additional scheduling mechanisms can be layered on top. These include the widely adopted First-Come-First-Served (FCFS) policy, as well as others such as long-output-first and cache-aware longest-common-prefix-first scheduling.

2.1.3 Request Scheduling Logic in SGLang

As illustrated in the figure, the Prefill-first scheduling logic in SGLang structures a complete event loop of LLM inference into five main phases:

  1. fetching new requests from the HTTP server
  2. processing incoming requests (enqueuing for scheduling and HiCache prefetch queuing)
  3. request scheduling
  4. LLM step inference
  5. post-processing

Among these, we focus primarily on the scheduling logic, which operates as follows:

Scheduling Resource Constraints:
Whether a request can be promoted from the waiting queue to active execution is governed by four key resource limits:

  • maximum Chunk Size
  • maximum number of tokens allowed in a Prefill batch
  • maximum number of concurrently running requests
  • total capacity of the KV Cache Pool

All these parameters can be configured directly or indirectly via launch-time arguments.

Scheduling Execution Order:
During scheduling, requests that were chunked in the previous iteration are prioritized for continuation. The remaining eligible requests are then ordered according to a specified priority policy (e.g., FCFS) and selected accordingly. Additionally, the scheduler dynamically decides whether to apply chunking to a request based on the remaining token capacity of the current batch.

HiCache-Aware Scheduling with Multi-Level KV Cache:
When HiCache—a multi-tier KV Cache storage system—is enabled, the decision to schedule a request also depends on the configured prefetch policy, which can be one of:

  • best_effort
  • wait_complete
  • timeout

If the prefetch operation has not met its termination condition under the chosen policy, the request remains in the prefetch queue and is not scheduled, allowing KV Cache prefetching to continue.

2
Default Request Scheduling Logic for New Requests in SGLang

2.1.4 Discrepancies Between Inference Computation Models and Framework Implementations

3
Qwen3 Model Architecture Diagram

Current mainstream large language models (e.g., the Qwen series) predominantly adopt a Decoder-Only Transformer architecture, in which the inference process sequentially processes the input token sequence through a series of structured modules. Taking Qwen3 as an example (architectural schematic shown in the figure above), a typical forward pass comprises the following key components:

  • Embedding Layer: Maps discrete token IDs into continuous high-dimensional vectors as the network input.
  • Position Encoding Layer: Employs Rotary Position Embedding (RoPE), which injects positional information into attention computations via rotation matrices, enabling effective extrapolation to longer sequence lengths.
  • Stacked Decoder Blocks (N layers total), each layer consists of:

    • RMSNorm: An efficient normalization operation that replaces traditional LayerNorm
    • Attention Mechanism: Models global contextual dependencies, with implementations varying across architectures—common variants include MHA (Multi-Head Attention), MQA (Multi-Query Attention), GQA (Grouped-Query Attention), MLA (Multi-Latent Attention), Linear Attention, and Sparse Attention
    • Feed-Forward Network (FFN): Typically implemented as an MLP or MoE (Mixture-of-Experts) structure to capture complex nonlinear transformations
  • Final RMSNorm Layer: Normalizes the output representation prior to sampling.

4
Hardware-Specific Operator Backend Implementations

Although different LLMs are functionally highly similar, their actual inference performance is significantly influenced by low-level implementation details. Taking mainstream inference frameworks such as SGLang as an example, the same model architecture can trigger entirely different GPU kernel implementations depending on the underlying hardware and runtime configuration. More critically, even for the same operator, its launch parameters—such as block size and tiling configuration—are dynamically adjusted based on input characteristics (e.g., prompt length and KV cache length). These optimizations are typically selected automatically by the operator scheduler either at compile time or runtime to maximize hardware utilization. Consequently, during GPU-based inference, the performance impact of different operator implementations must be explicitly accounted for when modeling execution time.

2.2 Core Challenges of LLM Inference Simulation

LLM inference exhibits significant dynamic heterogeneity, strong state dependencies, and high sensitivity to millisecond-level Service Level Objectives (SLOs). These characteristics make traditional static performance modeling methods inadequate for accurately reproducing real system behavior. Specifically, current LLM inference simulation faces the following four key challenges:

1. High Complexity and State Density in the Full Lifecycle of Inference Requests

The lifecycle of an LLM inference request involves dynamic transitions across multiple stages, queues, and cache levels. As previously described, a typical request sequentially undergoes Tokenization → Scheduling Enqueue (Waiting Queue) → Prefill Execution → Multi-Round Decode Batching (RunBatch) → Detokenization. During this process, requests migrate between Waiting, Running, and Swapped queues under scheduler management, accompanied by multi-level KV Cache loading and eviction behaviors (e.g., triggering L3→L2 prefetching while in the Waiting Queue and completing L2→L1 Cache transfers before Prefill execution). This end-to-end state transition path, coupled with deep integration among caching, computation, and scheduling, means that any simplified modeling that ignores intermediate state transitions or cache interactions will result in significant deviations.

2. Error Cascading Due to Strong Coupling Among System Components

The core components of an LLM inference system—Scheduler, KV Cache Manager, and GPU Execution Engine—are tightly interconnected through feedback loops. For example:

  • Scheduling Decisions Impact KV Cache and Computation: Scheduling strategies determine when requests enter the execution queue, directly affecting their dwell time in the Waiting Queue and thus the amount of data prefetched from L3 to L2. Additionally, the batch composition formed by scheduling (e.g., mix ratio of Prefill/Decode, context length distribution) directly influences GPU kernel parallel efficiency and memory access patterns, thereby impacting actual execution latency.
  • KV Cache States React Back on Scheduling and Computation: The hit rate of the KV Cache determines the number of tokens that need recomputation during the Prefill phase, directly affecting computational load and latency. The required recompute length also constrains token budget allocation within batches, influencing the scheduler’s decisions on admitting and splitting new requests.
  • Batch Execution Latency Estimates Influence Scheduling and Cache Behavior: Batch latency affects the number of newly arriving requests in subsequent batches, influencing whether the scheduler inserts or how many new Prefill requests are inserted. It also determines the size of the KV Cache loading window and TTL settings.

This multidirectional dependency means that modeling errors in any single component can propagate and amplify through the system pipeline, leading to severe distortions in end-to-end latency predictions.

3. Nonlinear Coupling Effects on Single-Step Latency and Lack of Generalizable Fine-Grained Modeling Methods

The latency of an LLM inference batch is not solely determined by coarse-grained parameters such as batch size and input length but is influenced by a nonlinear coupling of multiple dimensions:

  • Model Level: Number of layers, attention heads, use of operator optimizations like FlashAttention or PagedAttention.
  • System Configuration: Tensor/Stream/Data/Expert Parallelism (TP/PP/DP/EP), quantization schemes (e.g., INT4, FP8).
  • Hardware Platform: GPU model, memory bandwidth, inter-node topology.
  • Dynamic Request States: Prompt length, generated token count, occupied KV Cache blocks per request.
  • Batch Processing Heterogeneity: Due to continuous batching mechanisms, the context lengths and cache states within the same batch are highly heterogeneous, causing significant fluctuations in GPU kernel compute intensity and memory access patterns.

Meanwhile, given the rapidly evolving landscape of model architectures and hardware ecosystems, exhaustively testing every "model–configuration–hardware" combination is neither economical nor scalable. Therefore, constructing a latency prediction mechanism that can precisely characterize single-step execution behavior while maintaining generalizability across models and platforms is a core challenge for high-fidelity LLM inference simulation.

4. Efficiency Bottlenecks in Optimal Solution Search Across High-Dimensional Configuration Spaces

Even with a high-fidelity simulator, its practical value in deployment tuning remains constrained by configuration search efficiency. Typical deployment configuration spaces encompass dimensions such as parallelism, batch size, cache strategies, and quantization bitwidth, leading to combinatorial explosion issues. If each simulation run takes 1 minute, exhaustive search could take days, far exceeding acceptable tuning cycles. Thus, efficiently exploring and identifying Pareto frontiers for cost-latency-throughput trade-offs under SLO constraints becomes a critical bottleneck for the practical application of simulators.

2.3 Key Requirements for a KV Cache-Centric LLM Inference Simulator

To address the aforementioned challenges, a production-grade LLM inference system simulator must go beyond the limitations of traditional performance models and establish a hierarchical, decoupled, high-fidelity, verifiable, and efficient optimization simulation framework. Based on the preceding analysis, we propose the following four core requirements:

1. Support for Hierarchical Abstraction of End-to-End Inference Workflows

The simulator should be capable of fully reproducing the entire lifecycle behavior of requests from ingestion to response in real inference engines, including request generation, scheduling decisions, state transitions, batch execution, and result return phases. Specifically, it must meet the following criteria:

  • Simulate User Request Loads with Realistic Distributions: Model user request loads that exhibit realistic distribution characteristics.
  • Support Multi-Node Deployment Scenarios: Accurately model request routing and cross-node collaboration behaviors in multi-node deployments.
  • Modular Abstraction of Internal Processing Stages: Abstract each processing stage within inference instances (e.g., tokenization, scheduling, KV Cache management, batch inference execution, detokenization) while maintaining their execution order and dependencies consistent with the real system.

This capability ensures that simulation results align both macroscopically and microscopically with actual system behavior.

2. High-Fidelity, Independently Verifiable Delay Modeling at the Component Level

To mitigate error cascading caused by component coupling, the simulator must decouple core functional modules and ensure the accuracy and verifiability of each module's behavior:

  • Scheduling Behavior Modeling: Accurately replicate the impact of scheduling policies on request states and the decision logic for batch composition and execution timing.
  • KV Cache Behavior Modeling: Model cache hit/miss rates, data prefetching, eviction, and cross-storage-tier migrations, including their latency and resource consumption.
  • Batch Inference Execution Modeling: Predict overall execution latency based on dynamic states (e.g., context length, generation progress) of requests within a batch.
  • Global Temporal Consistency: Maintain a unified time model to correctly reflect overlaps and dependencies between CPU scheduling, GPU computation, and memory transfers.

All modules should support independent verification to ensure local errors are controlled and end-to-end deviations are traceable.

3. Fine-Grained, Generalizable Single-Step Latency Prediction

Given that single-step inference latency is highly dependent on batch composition and request prompt & cache length, the simulator must provide fine-grained latency characterization:

  • Model Computation and Communication Separately for Different Requests in the Same Batch: Differentiate between computational and communication latencies for individual requests within a batch.
  • Predict Latency Based on Request-Level State Features: Use request-level state features rather than relying solely on coarse-grained batch statistics.
  • Provide Reliable Latency Estimates for Unseen Configurations: Offer reasonable and reliable latency estimates even for unseen model architectures, hardware platforms, or system configurations without requiring exhaustive empirical data.

This capability forms the foundation for achieving high-precision, low-cost simulations.

4. Efficient Exploration of Configuration Spaces Under SLO Constraints

To support practical deployment decisions, the simulator should enable efficient exploration of the deployment configuration space:

  • Avoid Exhaustive Evaluation of High-Dimensional Configuration Spaces: Significantly reduce tuning time by quickly identifying feasible configurations under user-specified Service Level Objectives (SLOs).
  • Support Multi-Objective Optimization: Optimize across multiple objectives (e.g., cost, latency, throughput) and output Pareto-optimal solution sets for user selection.

This capability transforms the simulator from a passive performance evaluation tool into an active decision-support system for production deployments.

To systematically address these requirements and challenges, the subsequent sections will detail the overall architecture design and key technical implementation paths of Tair-KVCache-HiSim, providing a robust foundation for high-fidelity LLM inference simulation.

3.Architecture and Key Features of the Tair-KVCache-HiSim Simulator

3.1 Overall Architecture

To meet the demand for high-fidelity modeling of the full lifecycle of LLM inference, we design and implement Tair-KVCache-HiSim—a lightweight, high-accuracy simulation tool tailored for large language model inference services. Tair-KVCache-HiSim can efficiently predict key performance metrics—such as TTFT, TPOT, and system throughput—without requiring actual deployment of models onto GPUs, by injecting either synthetic or real-world request traces.

Compared to existing simulation approaches, Tair-KVCache-HiSim is the first to support multi-level KV Cache storage hierarchy simulation (based on the HiRadixCache architecture), providing critical decision support for users in optimizing cache resource allocation and cost-performance trade-offs.

As shown in the figure, Tair-KVCache-HiSim adopts a modular architecture, comprising the following three core components that work in concert to fully reproduce the end-to-end inference process from request ingestion to result return:

5
Tair-KVCache-HiSim Architecture Diagram

3.2 Component Overview

Tair-KVCache-HiSim consists of several critical components designed to comprehensively simulate the end-to-end inference process:

1. Workload Generator: Storage-Optimized User Load Generator for Simulating Real-World Scenarios

This module supports two flexible load injection modes to accommodate different data availability conditions:

Random Dataset Generation: Suitable for scenarios lacking original traces, it supports modeling based on open-source datasets or random tokens. Beyond typical parameters like input/output lengths, request rates, and concurrency levels, it introduces higher-order variables for scenarios with surging KV Cache demands:

  • Scenario Selection: Supports complex scenarios such as multi-turn dialogues and Agent applications.
  • Multi-Turn Dialogue Modeling: Models dialogue rounds, prompt length per round, and inter-round time intervals, providing a more accurate representation of real-world business scenarios.

Timestamp Dataset Replay: Supports importing real user loads with original timestamps. By precisely replaying historical loads, it provides customized performance evaluations and configuration optimization suggestions for specific business lines.

2. Global Router Simulator: Global Request Scheduling Simulator

Responsible for routing pending requests to the optimal computing instance (Worker) using specific algorithms. Supported scheduling strategies include:

  • Random: Randomly selects one worker from all available workers.
  • Round-Robin: Allocates requests in a sequential, round-robin manner across workers.
  • Cache-Aware: Intelligent cache routing strategy that maintains a radix tree for each worker, selecting the worker with the highest cache reuse through prefix matching.
  • Power-of-Two: Shortest queue strategy where two workers are randomly selected, compared based on real-time load (active request count & queue length), and the less loaded worker is chosen.
  • Bucket: Length-based bucketing strategy that divides requests by prompt length into different ranges, directing requests within specific length ranges to designated workers. Bucket boundaries dynamically adjust based on overall cluster load.

3. Inference Engine Simulator: Instance Inference Engine Simulator

This module provides fine-grained modeling of internal behaviors within a single inference instance, faithfully replicating the core behaviors of real inference frameworks:

  • Discrete Execution Steps: Divides the inference process into discrete steps, including tokenization, scheduling enqueue, Prefill/Decode batch processing, KV Cache loading/eviction, and detokenization.
  • State Transitions: Simulates state transitions between waiting queues, running queues, and swapped queues.
  • Timing Overlap Modeling: Models the overlap of CPU scheduling and GPU execution timings to ensure micro-level temporal fidelity.
  • Performance Metrics Collection: Automatically captures the time spent by each request at various stages—from arrival in the system, entering the waiting queue, being scheduled to the execution queue, completing inference, and returning the output result—and aggregates these into end-to-end performance metrics such as TTFT, TPOT, and throughput.

Through the coordinated simulation of these three layers, Tair-KVCache-HiSim achieves high alignment with real systems both in terms of macro load characteristics and micro-execution timing. This robust foundation enables subsequent performance analysis and configuration optimization, providing critical decision support for optimizing cache resource allocation and cost-performance trade-offs.

3.3 Inference Engine Simulator: The Core of High-Fidelity Simulation

The Inference Engine Simulator is the core module of Tair-KVCache-HiSim responsible for end-to-end simulation of LLM inference behavior. It achieves high-fidelity and independently verifiable modeling by decoupling three key subsystems—request scheduling, KV Cache management, and batched execution—and integrating them under a unified global clock mechanism to ensure temporal consistency across components. As illustrated in the figure, the simulator comprises the following three co-operating submodules:

6
Inference Engine Simulator Structure Diagram

3.3.1 SchedulerSimulator: High-Fidelity Replication of Scheduling Behavior

The SchedulerSimulator faithfully replicates the scheduling logic of mainstream LLM inference frameworks(e.g., SGLang and vLLM) and explicitly models the state transitions of requests throughout their lifecycle. Its overall workflow aligns with the scheduling pipeline described in Section 2. At the implementation level, the system explicitly maintains four key queues:

  • Waiting Queue: The initial holding queue for newly arrived requests.
  • Prefetch Queue: Requests currently undergoing KV Cache prefetching.
  • Running Queue: Requests actively being executed by the inference engine.
  • Swapped Queue: Requests evicted from GPU memory due to insufficient VRAM and temporarily stored in host memory.

The scheduler supports the various scheduling policies introduced in Section 2 (e.g., Prefill-first, Decode-first, Chunked Prefill, PD separation), which are not repeated here for brevity.

Moreover, the SchedulerSimulator tightly interacts with the KVCacheManagerSimulator. Before deciding whether to promote a request from the Waiting Queue to the Running Queue, it queries the request’s KV Cache prefetch status and enforces the configured prefetch policy—best_effort, wait_complete, or timeout—to determine whether scheduling should be blocked. This mechanism ensures that the simulation accurately captures the real-system impact of “prefetch progress” on scheduling latency, thereby preserving high fidelity in end-to-end performance prediction.

3.3.2 KVCacheManagerSimulator: Modeling Multi-Level Distributed Cache Behavior

7
Interaction Flow Diagram Between Scheduler and KVCacheManager

The KVCacheManagerSimulator is the first open-source inference simulator to fully model a three-level KV Cache storage hierarchy (L3/L2/L1), supporting heterogeneous storage media(such as SSD, host DRAM, and GPU HBM) with differentiated configurations in terms of capacity, bandwidth, and cost.

Its core workflow is as follows:

  • Prefix Matching Before Queue Entry: Before a request enters the Waiting Queue, a prefix-matching query is performed against the multi-level cache pools to determine hit status at each level.
  • L3 → L2 Asynchronous Prefetch: If a sufficient portion of the context (e.g., exceeding a configurable length threshold) is found in L3 (e.g., SSD), an asynchronous prefetch from L3 to L2 (host DRAM) is triggered.
  • Prefetch Policy Enforcement at Scheduling Time: When the scheduler prepares to execute the Prefill phase of the request, it consults the configured prefetch policy (best_effort, wait_complete, or timeout) to decide whether to block scheduling until prefetch completion.
  • L2 → L1 Migration During GPU Overlap Window: Once the request is admitted into the Running Queue, the matched KV Cache blocks are transferred from L2 to L1 (GPU HBM) during the CPU–GPU time overlap window—i.e., while the previous batch is still executing on the GPU.
  • Compute Launch Only After L1 Load Completion: The model’s forward computation is initiated only after the required KV Cache blocks have been fully loaded into L1.

By implementing a multi-level Radix Tree-based prefix cache and modeling memory pools, asynchronous data movement, and prefetch policies for each cache tier, this module can accurately simulate real-world execution behavior—producing precise estimates of cache hit rates, I/O volumes per level, and transfer latencies—without performing actual memory allocation or data movement. These metrics serve as critical inputs for scheduling decisions and performance prediction.

Furthermore, the simulation of HiCache eviction policies (e.g., LRU, LFU) and the Radix Tree structure ensures that cache management behavior remains consistent with that of real systems.

3.3.3 BatchRunnerEstimator: Fine-Grained and Highly Generalizable Single-Step Latency Prediction

8
Implementation of BatchRunnerEstimator

To meet the core requirement of high-accuracy, low-cost simulation of LLM inference latency, BatchRunnerEstimator is designed as a fine-grained, multi-paradigm, and pluggable single-step latency prediction engine. Its primary objective is to accurately capture the nonlinear performance variations in execution latency caused by intra-batch request heterogeneity—such as differing prompt lengths and degrees of KV cache reuse—under dynamic batching scenarios, while maintaining reliable generalization capabilities when confronted with new models, hardware platforms, or system configurations.

BatchRunnerEstimator departs from the conventional simulation approach that relies on coarse-grained batch-level statistics (e.g., average input length) and instead adopts request-level state descriptors as the fundamental unit for latency prediction. Each batch is represented as a list of requests, where each request is characterized by a (cache_len, input_len) tuple:

  • cache_len: the length of reusable historical KV Cache (i.e., the number of tokens already cached),
  • input_len: the number of new tokens to be computed in the current step.

Building upon this fine-grained representation, we design a pluggable hybrid latency modeling framework that supports multiple prediction strategies to balance accuracy and generalization:

Sample-Based Interpolation/Regression Models:
Construct model-specific latency mapping functions via offline profiling, suitable for known hardware–model combinations.
Operator-Level Latency Composition:
To enhance generalization—especially for unseen scenarios such as new hardware or novel model architectures—latency is estimated by summing predicted durations of individual operators:

  • Operator Categorization: Operators are classified into two broad types:

    • Compute-bound: including GEMM, MoE (cache-agnostic), attention (cache-dependent), elementwise operations, embedding lookups, etc.;
    • Communication-bound: primarily data movement across memory hierarchies or devices.
  • Roofline Model: Provides a theoretical performance upper bound for compute operators on a given hardware platform. The core idea is that GPU performance is constrained by two key hardware limits:Peak Compute Throughput (in FLOP/s) and Memory Bandwidth (in Bytes/s). For any compute operator, given its required floating-point operations (FLOPs) and memory traffic (Bytes), the theoretical minimum execution latency is: 9
    This reflects whether the operator is compute-bound or memory-bound, with the longer of the two terms defining the lower latency bound.For communication operators, latency is dominated by data volume and link bandwidth, yielding a simplified estimate: 10
  • Sample-Based Operator Regression:Build operator-level latency mappings through offline profiling.
  • Theory-Guided Scaled Regression:Starting from the Roofline estimate, a small set of real measurements is used to learn a scale factor, producing a more realistic prediction: 11
  • Integration with External Batch Latency Predictors:Supports plugging in third-party tools such as AIConfigurator for batch-level latency estimation.

Users can dynamically switch between prediction backends based on scenario requirements—e.g., favoring maximum accuracy for known workloads versus rapid generalization to new models or configurations. This design enables Tair-KVCache-HiSim to deliver reliable latency estimates for previously unseen models, quantization schemes, or parallelism configurations—without requiring exhaustive offline profiling.

3.3.4 Global Clock and Event-Driven Temporal Model

To accurately capture the overlaps and dependencies among asynchronous operations—such as CPU scheduling, GPU computation, and KV Cache data transfers—Tair-KVCache-HiSim introduces a unified virtual global clock as the temporal reference for all modules and employs a discrete-event simulation paradigm to drive the entire simulation workflow.

3.4 Independently Verifiable Latency Modeling

To prevent error cascading and ensure simulation fidelity, Tair-KVCache-HiSim provides isolated validation interfaces for each core module, enabling end-to-end comparison against real-system behavior independently of other components. The specific validation strategies are as follows:

BatchRunnerEstimator (Batch Inference Execution)
Accuracy is validated via micro-benchmarking:

  1. Run a fixed batch—defined by a specific list of (cache_len, input_len) tuples—on a real GPU and record the actual execution latency for Prefill and Decode phases.
  2. Inject the identical batch configuration into the simulator and invoke BatchRunnerEstimator in isolation to predict latency.
  3. Compare simulated vs. measured latencies and compute the Mean Absolute Percentage Error (MAPE).

SchedulerSimulator (Scheduling Behavior)
Validated through scheduling trace replay:

1.  Export a complete scheduling log from a real inference engine, including:

  • Per-request timestamps (arrival, exit from waiting queue, entry into running queue),
  • Reasons for skipped or delayed scheduling,
  • Batch snapshots at each scheduling decision (request IDs and states in all queues).

2.  In the simulator, freeze KVCacheManager and BatchRunner behaviors (e.g., force all requests to have cache_miss = 0 and fix batch latency to a constant), and enable only SchedulerSimulator.

3.  Replay the same request sequence and validate:

  • Consistency in scheduling order,
  • Deviation in queue residence times,
  • Match in the set of skipped or delayed requests.

KVCacheManagerSimulator (Cache Management)
Validated via cache event tracing:

1.  Inject a synthetically generated multi-turn dialogue workload into a real system and use profiling tools to capture:

  • Initial cache hit/miss counts across L1/L2/L3 upon request arrival,
  • Volume and latency of L3→L2 data transfers during waiting,
  • Volume and latency of L2→L1 transfers just before Prefill execution,
  • Final L1 cache hit rate during batch inference.

2.  In the simulator, freeze the scheduler (fix scheduling order) and BatchRunner (fix latency), and run only KVCacheManagerSimulator with the same request sequence.
3.  Compare the simulated cache event stream against ground truth, validating:

  • Error in per-level cache hit rates,
  • Deviation in prefetched data volume,
  • Consistency in eviction policy trigger conditions.

Detailed experimental results and quantitative validation metrics will be presented in Chapter 4.

3.5 Efficient Configuration Space Exploration

To support efficient deployment decisions under SLOconstraints, Tair-KVCache-HiSim implements a hierarchical, progressive configuration space exploration mechanism. The process operates in three stages:

1. Low-Dimensional SLO-Feasible Configuration Screening:

Given user-specified TTFT and TPOT SLOs, the system leverages the high-fidelity BatchRunnerEstimator to rapidly evaluate key execution-layer configurations—such as tensor parallelism degree, quantization schemes, and operator optimizations. Using an adaptive binary search over the single-step latency prediction model, it efficiently identifies the boundary of configurations that satisfy the SLOs and constructs an initial low-dimensional Pareto candidate set.

2. Global Routing Strategy Co-Evaluation:

For this candidate set, the system jointly evaluates the impact of multiple global routing policies—including cache_aware, power_of_two, and bucket—on request queuing delay, load balancing, and cache reuse efficiency. This step ensures that inter-node scheduling behavior aligns with end-to-end performance targets.

3. Multi-Level KV Cache Structure Optimization:

Building on the feasible configurations identified above, the system further optimizes the multi-level KV Cache storage architecture, including:

  • Storage tiers (HBM / DRAM / DISK)
  • Capacity allocation across levels
  • Prefetching policies
  • Eviction algorithms (e.g., LRU, LFU)

By decomposing the high-dimensional combinatorial explosion problem into manageable subtasks, this three-stage pipeline can generate a three-dimensional Pareto frontier across latency, throughput, and cost within just a few hundred simulation runs. This transforms Tair-KVCache-HiSim from a passive performance evaluation tool into an active deployment recommendation engine, enabling practical, SLO-aware inference system optimization.

4.Simulation Performance

We conducted a comprehensive evaluation of its simulation speed and accuracy under real-world production-level workloads.

4.1 Speed: Extreme Cost Advantage — Simulation Overhead Reduced by Over 390,000×

Taking a typical production scenario as an example:

Dataset A real user trace comprising 40,520 requests over a 6-hour (21,602-second) period, based on the Qwen3-Coder-480B model.
Real GPU Benchmark Cost Running a full benchmark on a GPU instance with 8 × 141 GB of VRAM (priced at ¥110,000/month) would take 6 hours, incurring a total cost of ¥1,833.5.
Simulation Execution Cost On a general-purpose cloud instance with 2 vCPUs and 4 GiB memory (priced at ¥47.5/month), Tair-KVCache-HiSim completes the full simulation in just 257 seconds, with a single-run cost as low as ¥0.0047.

This achieves a cost reduction to 1/390,106 of the original expense and shortens the evaluation cycle from days to minutes, dramatically accelerating the deployment tuning and capacity planning process for LLM services.

Accuracy: High-Precision Prediction with Controllable End-to-End Error

We validate the simulator’s accuracy at two levels:

1. BatchRunnerEstimator: Single-Step Latency Prediction Accuracy

In dynamic batching scenarios, we used profiling tools to capture 958 heterogeneous batch configurations (with batch sizes ranging from 1 to 28) from a real-world deployed inference service and compared the measured execution latencies against the simulator’s predictions. The results show an average latency prediction error of only 4.24%.

12
The shaded area in the figure represents the coverage range of all sample points.

2. InferenceEngineSimulator: End-to-End System Metric Accuracy

We evaluated the Qwen3-8B model on an A100-SXM4-80GB GPU using the SGLang v0.5.6 inference engine, with a multi-turn dialogue workload constructed from the ShareGPT dataset. Four KV Cache configurations were tested:

  • IDLE: Radix Cache disabled
  • DEVICE: KV Cache stored exclusively in GPU HBM
  • HOST: Two-level storage enabled (HBM + Host DRAM)
  • DISK: Three-level storage enabled (HBM + DRAM + DISK)

The simulation results are compared against real-system measurements as follows:

Configuration Avg. TTFT Error (%) Avg. Throughput Error (%) Cache Hit Rate Error (%)
IDLE 3.25% 1.37%
DEVICE 4.49% 1.95% 0%
HOST 5.06% 1.99% 0%
DISK 10.75% 1.88% 1.2%

13

Tair-KVCache-HiSim achieves end-to-end high fidelity—with average errors below 5%—while reducing the cost and time overhead of LLM inference performance evaluation by five orders of magnitude.

Future Work

The value of KV Cache simulation and analysis extends beyond optimizing existing systems—it also provides forward-looking guidance for the evolution of future AI infrastructure. By enabling end-to-end simulation across diverse workload patterns and heterogeneous hardware resources, we can rapidly respond to changing business demands, precisely identify performance bottlenecks—whether computational or storage-related—and automatically generate SLO-aware (e.g., latency, throughput) optimal configurations and tuning recommendations based on known models and hardware platforms.

Given the rapid iteration of large language models—including emerging architectures such as Mamba and hybrid attention, evolving sparsification strategies, and advanced techniques like speculative decoding—the traditional paradigm of “build hardware first, adapt software later” is no longer sustainable. Future infrastructure design must shift toward a co-design, workload-driven paradigm, where compute capabilities and KV Cache storage hierarchies are jointly evolved across multiple dimensions: server form factors, memory hierarchies, interconnect topologies, and even hyperscale node architectures—to maximize throughput and minimize cost under strict SLO constraints.

Realizing this vision requires deep integration of KV Cache as the central state carrier. The cache is no longer a mere auxiliary component but a critical nexus connecting algorithms, systems, and hardware. High-fidelity simulation serves as the core engine enabling such scientifically grounded decision-making. Looking ahead, we will continue to enhance the simulator with broader support for cutting-edge models, along with ongoing improvements in simulation speed and accuracy.

0 1 0
Share on

ApsaraDB

619 posts | 184 followers

You may also like

Comments