Alibaba Cloud Tair KVCache Team joined forces with SGLang HiCache Team, Ant AI Infra-Inference Service Team, and Alibaba Cloud server vibrato heterogeneous computing Team to jointly launch a layered and Sparse framework for Sparse Attention,This article details the architecture design and implementation details of the framework.
In this section, we describe in detail how HiCache breaks through the capacity bottleneck of KVCache through a tiered storage architecture (GPU → CPU → remote storage) and expands the effective cache capacity from 40GB to TB, enabling large-scale deployment of long-context, high-concurrency LLM inference services.
However, when the context length spans 128K or even reaches one million tokens, two new bottlenecks begin to emerge:
This article introduces hierarchical sparse attention: the full KV Cache is stored on the CPU, while the GPU keeps only a Top-k LRU Buffer. This provides a practical path through both constraints. We will cover SGLang’s hierarchical sparse attention framework in three parts::
Layered sparseness marks another leap in the KVCache management paradigm: from HiCache's "layered storage → extended capacity" to this article's "sparseness + layering → breaking through the dual constraints of bandwidth and capacity", it opens up a new technical path for ultra-long context reasoning.
This series of technical articles will systematically disassemble the evolution path of KVCache technology for agent inference:
Tair KVCache, as an extension of the Tair product capabilities of Alibaba Cloud Database, is essentially a three-time transition of the caching paradigm:
🔹From Redis's "cache data → Reduce I/O"
🔹To GPU KVCache's "cache calculation intermediate state → reduce repeated calculation"
🔹Then to Tair KVCache's "large-scale and agent attention state management → reconstruction of large-scale model inference cost model", it indicates that the cache is being upgraded from auxiliary components to the core capability of AI infrastructure layer-making "state" storable, shareable and schedulable, supporting the large-scale inference base in the era of agent.
In the previous article, "Challenges Agentic Inference Poses to KVCache and a Deep Dive into SGLang HiCache," we explained how HiCache uses a hierarchical storage architecture, GPU memory -> CPU memory -> 3FS remote storage, to break through the KVCache capacity bottleneck. With heat-aware scheduling and asynchronous prefetching, HiCache expands the effective cache capacity from the original 40 GB of GPU memory to the TB scale, enabling production scale deployment of long context, high concurrency LLM inference.
In real production environments, HiCache has already shown clear value: cache hit rate increased from 40% to 80%, average TTFT dropped by 56%, and inference QPS improved by 2x.
But once we look at more extreme long context scenarios, such as 128K or even million token context windows, new bottlenecks appear.
Attention has a distinct performance profile in long context inference. Each Decode step needs to load the full KV Cache from HBM into the compute units before running Attention. Since the KV Cache grows linearly with sequence length, the Attention cost grows linearly as well.
The core issue is that Attention has low arithmetic intensity. Compared with the amount of memory access, the actual floating point work is not enough to saturate the GPU compute units. Attention is therefore a typical memory-bound operation, limited by HBM bandwidth. As context length grows from 32K to 128K and beyond, this bandwidth bottleneck becomes one of the main limits on long context inference performance.
Dynamic Sparse Attention, or DSA, starts from a basic property of Attention: during autoregressive generation, not every historical token contributes equally to the current output. Attention distributions often have a clear long tail. A small number of important tokens account for most of the Attention Score, while many tokens contribute very little. More importantly, the set of important tokens changes dynamically across queries, so it cannot be fixed in advance with static rules.
DSA turns this observation into a "Select-then-Compute" workflow with three stages:
Representative DSA algorithms include:
Although sparse attention has made breakthroughs at the computational level, its execution process has inherent sequential dependencies:

In Stage 1, the algorithm evaluates the importance of each token or page. In Stage 2, it selects Top-k entries based on the scores. Only after Stage 2 completes does Stage 3 know which KV entries it should compute over. This creates a fundamental problem: before Top-k is known, the system cannot know which KV data will be needed. As a result, the full KVCache still has to remain on the GPU. The key constraint is that sparse attention reduces compute complexity, from O(n) to O(k), but GPU memory complexity remains O(n). After DSA is introduced, the main performance bottleneck shifts from HBM bandwidth to HBM capacity. This capacity constraint causes three problems:
The key insight to crack the video memory wall is: since Attention calculation only needs the Topk part, why not just store the Topk part in the GPU and dynamically load the incremental Topk part after calculating the Topk in combination with CPU HICache?
The key to layered sparseness is to change the storage location and loading timing of KVCache (DeepSeek DSA is used as an example below):
Hierarchical sparseness process: after Prefill, the complete Latent Cache(8GB) is offloaded to the Host memory, and only the lightweight Sparse Indexer metadata is retained in the GPU. When decoding, the Indexer selection Top-2k is executed on the GPU based on the metadata,The Host filters the corresponding Latent subset and incrementally transfers it to the GPU, and finally performs the Attention calculation;
;
Represents the maximum CPU memory capacity that can be allocated by a single card;
Represents the maximum Batch Size that meets the SLO latency requirement.
Core Advantages:
Layered sparseness not only solves the computational problem, but also fundamentally breaks the rigid constraint of video memory capacity, realizes the collaborative optimization of computational efficiency and storage efficiency, and opens up a new technical path for ultra-long context reasoning.
SGLang's layered sparseness framework adopts a modular, pluggable three-tier architecture design, and implements algorithm decoupling, back-end compatibility and non-intrusive integration through standardized interfaces. The framework core consists of the following modules:
SparseCoordinator (coordination layer): orchestrates the collaborative work of the three functional modules through lifecycle hooks
The architecture not only natively supports the built-in sparseness mechanism of the model (such as DeepSeekV32 DSA), but also allows flexible combination of Training Free sparseness algorithm (Quest / SnapKV) and general Attention backend (FlashAttention/Triton),A unified and highly scalable hierarchical sparse attention scheme is provided for long context inference scenarios.

SparseCoordinator is the central controller of the hierarchical sparse framework. It uses Lifecycle Hooks to accurately orchestrate the collaborative work of the Algorithm, backendadapter, and SparseKVCacheManager modules at the key nodes of model inference.Its design follows the event-driven model, which decouples the complete process of the Retrievable Sparse into a standardized hook interface, and realizes the zero-intrusion integration of the algorithm and the model.

SparseCoordinator divides sparse inference into two core phases:
Under the arrangement of SparseCoordinator, Algorithm and BackendAdaptor, as two core function modules, are responsible for the questions of "what to choose" and "how to map" respectively, and achieve a high degree of pluggability and extensibility through clear interface definitions.

The Algorithm layer uses an abstract base class, BaseSparseAlgorithm, to define a unified interface. It decouples sparse algorithm logic into three methods:
retrieve_topk(queries, layer_id, ...):
construct_representations(...): Builds semantic representations for retrieval during Prefill or the early Decode stage, such as compressed Key representations.update_representations(...): Incrementally updates the Representation Pool during Decode.Take the Quest algorithm as an example:
construct_representations phase, the algorithm traverses all Pages to extract Keys and calculates the minimum/maximum values of Keys in each dimension and stores them in page_k_min/max Representation Pool (the memory overhead is about 1% of the complete Key storage);retrieve_topk phase, the upper bound estimation of the Attention Score is calculated by the criticality, and the Top-k Pages are quickly filtered and then sent to the backendadapter to complete the physical address conversion.
The backendadapter layer solves the problem of mapping the "logical world" to the "physical world. Different Attention backends (DSA Backend, FlashAttention, and Triton FA3) have different requirements for the format and indexing of input data, Adaptor is responsible for masking these differences.
Take FlashAttention adapter as an example: flashattentionadapter converts logical Page IDs into physical Page numbers through the req_to_token mapping table, reconstructs the PageTable, and updates the sequence length metadata (cache_seqlens, cu_seqlens_k),Enables FlashAttention to perform attention calculations based on Top-k selected sparse pages.
Compared with the DeepSeek-V3.1, the DeepSeek-V3.2 architectural change is the introduction of DeepSeek Sparse Attention(DSA) during the continuous training process.
The prototype design of DSA consists of two parts, Lightning Indexer and Fine-grained Token Selection Mechanism. It starts with a lightweight indexer,Quickly filter out the candidate Tokens that are most relevant to the current query Token, and then perform high-precision attention calculations only on this sparse candidate set.

(Note: picture from DeepSeek paper)

Key designs include:
Motivation: The Space for Optimization Due to Temporal Locality
The Top-k selection results of DSA show significant locality in the time dimension: the Top-k sets of adjacent Decode Steps are highly overlapped. Experiments show that the Top-k coincidence of adjacent Steps usually reaches 80% ~ 90%, which means that each Decode Step theoretically only needs to load less than 20% of the new Cache,It provides a natural optimization space for incremental transmission.

The Trade-off Between Buffer Capacity and Hit Rate
However, as the sequence length increases, the candidate range of Top-k selection expands linearly, and the difference between adjacent steps is gradually enlarged. Different LRU Buffer capacity configurations directly affect the Cache hit rate.
It can be seen that when the Buffer capacity is only the Top-k size (2k), the hit rate decreases significantly in the long sequence scenario, and the I/O delay becomes the bottleneck. However, expanding Buffer to 4K ~ 8K can exchange controllable video memory overhead for multiple I/O efficiency improvements.

Design and Implementation of LRU Diff Kernel
To take full advantage of the temporal locality of DSA, we design a Diff Kernel based on LRU elimination strategy. The core idea is to maintain a Top-k 2-4 times capacity LRU Buffer (typically configured as 4k -8k Token) on the GPU side,Accommodates short-term fluctuations in Top-k through intelligent elimination strategies.
The Kernel workflow is divided into three phases:
Phase 1: Set Intersection Calculation
Compare prev_topk and curr_topk to identify the two-step co-selected Token. This part of the Cache already resides in the GPU and does not need to be reloaded. The PageTable (curr_device_indicators) is directly updated to reuse the existing data.
Phase 2:LRU elimination decision
This is the core difference from a strictly Top-k Buffer. Instead of simply expelling all tokens in prev_topk that do not appear in curr_topk, the Kernel:
evict_device_indices is calculated, marking the coldest physical page location that can be overwritten.Phase 3: Incremental load mapping
Extract the newly added Token (missing part) from curr_topk to generate a one-to-one load mapping relationship:
load_host_indices: the physical addresses of these tokens in the Host Memory;load_device_indices: their target physical page number in the GPU (where the reuse phase 2 is eliminated).This heuristic strategy makes full use of the time continuity of DSA Top-k selection and dynamically maintains an efficient cache window for each Request, so that the system can maintain a cache hit rate of at least 80% + in long sequence scenarios with less GPU cache space,A dynamic balance of space and efficiency is achieved.

To achieve efficient data migration between GPU-CPU heterogeneous memory layers, SGLang HiCache has designed a special IO Kernel transmission engine. The engine uses CUDA underlying optimization technology to maximize PCIe bandwidth utilization through warp-level fine-grained parallelism.
The IO Kernel supports multiple memory layout modes (layer_first, page_first, and page_head), which enables unified abstraction of MHA and MLA architectures. The pinned memory and CUDA host register mechanisms are used on the CPU side to ensure zero-copy transmission,Combined with the dynamic scheduling strategy of per-layer and all-layer transmission granularity, batch full-layer offload is carried out after the prefill stage, and incremental single-layer transmission is carried out in the decode stage, which effectively balances the transmission delay and bandwidth overhead.
Measurements show that through NUMA binding, the IO Kernel can reach close to ~ 40gb/s per GPU at 8 × H20, providing a low-latency, high-throughput data handling infrastructure for the hierarchical KV cache architecture.
We access DeepSeek V32 DSA on the SGLang hierarchical sparse attention framework, and evaluate the system performance in the long context reasoning scenario. Experiments using DeepSeek-V32 model, for 16k, 32k and 64k three sequence length configuration,The input throughput (input tokens/s) under different batch sizes was tested on 8 × H200 GPU With 1TB memory.
The experimental results show that, compared with the traditional full KV cache scheme, Hierarchical Sparse attention scheme (Hierarchical Sparse) combines KV cache Hierarchical management, GPU-CPU heterogeneous storage and dynamic TopK retrieval mechanism,Demonstrated significant performance advantages in long sequence scenarios. Specifically:
The results demonstrate the effectiveness of the hierarchical sparse attention architecture in breaking through the GPU memory wall and supporting large-scale concurrent reasoning in ultra-long contexts.

Technology deepening direction:
Performance optimization:
Architecture evolution direction: with the popularity of super-node architecture, the bandwidth of GPU accessing shared memory pool through Scale-Up network has significantly exceeded the traditional PCIe bandwidth. Under this hardware trend, KVCache's Memory Pooling has become a natural choice.We will help implement KVCache unified pooling scheduling in super nodes, give full play to the bandwidth advantages of Scale-Up networks, break through the bottleneck of traditional PCIe, and provide a more efficient layered and sparse infrastructure for ultra-long context reasoning.
ApsaraDB - December 29, 2025
ApsaraDB - February 4, 2026
ApsaraDB - May 22, 2026
ApsaraDB - February 4, 2026
Alibaba Cloud Community - October 14, 2025
ApsaraDB - February 4, 2026
Best Practices
Follow our step-by-step best practices guides to build your own business case.
Learn More
Managed Service for Prometheus
Multi-source metrics are aggregated to monitor the status of your business and services in real time.
Learn More
Database for FinTech Solution
Leverage cloud-native database solutions dedicated for FinTech.
Learn More
Oracle Database Migration Solution
Migrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreMore Posts by ApsaraDB