×
Community Blog SGLang Hierarchical Sparse Attention

SGLang Hierarchical Sparse Attention

This article introduces hierarchical sparse attention: the full KV Cache is stored on the CPU, while the GPU keeps only a Top-k LRU Buffer.

Alibaba Cloud Tair KVCache Team joined forces with SGLang HiCache Team, Ant AI Infra-Inference Service Team, and Alibaba Cloud server vibrato heterogeneous computing Team to jointly launch a layered and Sparse framework for Sparse Attention,This article details the architecture design and implementation details of the framework.

In this section, we describe in detail how HiCache breaks through the capacity bottleneck of KVCache through a tiered storage architecture (GPU → CPU → remote storage) and expands the effective cache capacity from 40GB to TB, enabling large-scale deployment of long-context, high-concurrency LLM inference services.

However, when the context length spans 128K or even reaches one million tokens, two new bottlenecks begin to emerge:

  • Computational bottleneck: Attention computational cost increases linearly with sequence length and is limited by HBM bandwidth. Dynamic sparse Attention (DSA) uses the "Select-then-Compute" paradigm to Select the Topk Token to participate in Attention computation,Successfully broke through this bottleneck.
  • Capacity Bottleneck: after DSA is introduced, the main bottleneck shifts from HBM bandwidth to HBM capacity -- to ensure low latency, the full KV Cache still needs to reside in the GPU, resulting in limited concurrent inference capability

This article introduces hierarchical sparse attention: the full KV Cache is stored on the CPU, while the GPU keeps only a Top-k LRU Buffer. This provides a practical path through both constraints. We will cover SGLang’s hierarchical sparse attention framework in three parts::

  • Overall architecture: the modular design of SparseCoordinator, Algorithm, BackendAdaptor, and SparseKVCacheManager
  • Core mechanism: incremental transmission of Sparse Diff Kernel, high-performance transmission optimization of I/O Kernel
  • Practice case: deep integration of DeepSeek DSA realizes single request video memory usage from 8GB to 200MB,3 times higher throughput of single machine

Layered sparseness marks another leap in the KVCache management paradigm: from HiCache's "layered storage → extended capacity" to this article's "sparseness + layering → breaking through the dual constraints of bandwidth and capacity", it opens up a new technical path for ultra-long context reasoning.

This series of technical articles will systematically disassemble the evolution path of KVCache technology for agent inference:

  1. Alibaba Cloud Tair Partners with SGLang to Build HiCache: Constructing a New Cache Paradigm for "Agentic Inference"
  2. Alibaba Cloud Tair KVCache Engineering Implementation Based on 3FS: Enterprise-Grade Deployment, High-Availability Operations, and Performance Optimization Practices
  3. Hybrid Model Support: SGLang's Support for Hybrid architecture models such as Mamba-Transformer
  4. Tair KVCache Manager: architecture design and implementation of enterprise-level global KVCache management service
  5. KVCache simulation analysis: high-precision calculation and cache simulation design and implementation.
  6. This article | Hierarchical Sparse Attention: KV Hierarchical management and on-demand loading under the framework of Hierarchical Sparse Attention
  7. Outlook: The Evolution of KVCache-driven software and hardware combination

Tair KVCache, as an extension of the Tair product capabilities of Alibaba Cloud Database, is essentially a three-time transition of the caching paradigm:
🔹From Redis's "cache data → Reduce I/O"
🔹To GPU KVCache's "cache calculation intermediate state → reduce repeated calculation"
🔹Then to Tair KVCache's "large-scale and agent attention state management → reconstruction of large-scale model inference cost model", it indicates that the cache is being upgraded from auxiliary components to the core capability of AI infrastructure layer-making "state" storable, shareable and schedulable, supporting the large-scale inference base in the era of agent.

Introduction: Two Bottlenecks and a Joint Solution

HiCache: Capacity Expansion, and the Next Problem

In the previous article, "Challenges Agentic Inference Poses to KVCache and a Deep Dive into SGLang HiCache," we explained how HiCache uses a hierarchical storage architecture, GPU memory -> CPU memory -> 3FS remote storage, to break through the KVCache capacity bottleneck. With heat-aware scheduling and asynchronous prefetching, HiCache expands the effective cache capacity from the original 40 GB of GPU memory to the TB scale, enabling production scale deployment of long context, high concurrency LLM inference.

In real production environments, HiCache has already shown clear value: cache hit rate increased from 40% to 80%, average TTFT dropped by 56%, and inference QPS improved by 2x.

But once we look at more extreme long context scenarios, such as 128K or even million token context windows, new bottlenecks appear.

The HBM Bandwidth Wall in Long Context Inference

The Bandwidth Bottleneck of Attention Computing

Attention has a distinct performance profile in long context inference. Each Decode step needs to load the full KV Cache from HBM into the compute units before running Attention. Since the KV Cache grows linearly with sequence length, the Attention cost grows linearly as well.

The core issue is that Attention has low arithmetic intensity. Compared with the amount of memory access, the actual floating point work is not enough to saturate the GPU compute units. Attention is therefore a typical memory-bound operation, limited by HBM bandwidth. As context length grows from 32K to 128K and beyond, this bandwidth bottleneck becomes one of the main limits on long context inference performance.

Select-Then-Compute Paradigm for Dynamic Sparse Attention (DSA)

Dynamic Sparse Attention, or DSA, starts from a basic property of Attention: during autoregressive generation, not every historical token contributes equally to the current output. Attention distributions often have a clear long tail. A small number of important tokens account for most of the Attention Score, while many tokens contribute very little. More importantly, the set of important tokens changes dynamically across queries, so it cannot be fixed in advance with static rules.

DSA turns this observation into a "Select-then-Compute" workflow with three stages:

  • Blocking and metadata abstraction: The KV Cache is divided into fixed size blocks, usually 32 tokens per block. Each block keeps a lightweight metadata structure. The metadata can be statistical summaries of Key vectors, such as mean and variance, a bounding box with per dimension max/min values, or a compact low dimensional representation. The storage overhead is usually less than 1% of the full KV Cache, so the metadata can stay in GPU memory.
  • Fast importance estimation: For each newly generated Query token, the algorithm does not read the full Key Cache. Instead, it uses the metadata to quickly compute a Criticality Score for each block. This computation is much cheaper than full Attention, usually O(n/32) rather than O(n), and can be efficiently parallelized. A Top-k selection algorithm, such as heap based selection, then picks the k most relevant blocks. A typical value is k = 2048, corresponding to 64 blocks.
  • On-demand sparse computation: Only the selected Top-k blocks load their full Key and Value Cache and run standard Scaled Dot-Product Attention. Unselected blocks are skipped entirely, avoiding unnecessary HBM access.

Representative DSA algorithms include:

  • Quest: A training-free heuristic algorithm that approximates the upper bound of the Attention Score using the geometry of Query-Key bounding boxes. By maintaining the maximum and minimum Key values for each dimension in each block, Quest can quickly filter out unimportant blocks without reading the full Key.
  • ClusterKV: Clusters all Key vectors during Prefill, for example with K-means, and produces C centroids. Each original key maps to its nearest centroid. During Decode, the Query is compared with the centroids to retrieve the most relevant Top-k entries.
  • DeepSeek DSA: A model-native sparse attention mechanism. It uses a specially trained Indexer module to dynamically predict token importance, and the Indexer output directly drives Top-k selection.

The Hidden Memory Wall: The Capacity Problem in Sparse Computation

Although sparse attention has made breakthroughs at the computational level, its execution process has inherent sequential dependencies:

1

In Stage 1, the algorithm evaluates the importance of each token or page. In Stage 2, it selects Top-k entries based on the scores. Only after Stage 2 completes does Stage 3 know which KV entries it should compute over. This creates a fundamental problem: before Top-k is known, the system cannot know which KV data will be needed. As a result, the full KVCache still has to remain on the GPU. The key constraint is that sparse attention reduces compute complexity, from O(n) to O(k), but GPU memory complexity remains O(n). After DSA is introduced, the main performance bottleneck shifts from HBM bandwidth to HBM capacity. This capacity constraint causes three problems:

  1. Poor HBM capacity utilization: 98.4% of the KV Cache is not accessed in each step, but still occupies valuable HBM space.
  2. Limited parallelism: Small batch sizes cannot fully use GPU parallel compute, so inference throughput is hard to improve. For example, in DeepSeek V32, a single 128K request uses 8 GB of Latent Cache. After H200 reserves memory for model weights, it can support at most Batch = 5, which badly limits GPU parallelism.
  3. Blocked value from hierarchical storage: Traditional KV Cache offload requires all data to be loaded into HBM before Decode, so it does not fit DSA’s dynamic selection pattern.

Hierarchical Sparse Attention: Optimizing Storage and Compute Together

The key insight to crack the video memory wall is: since Attention calculation only needs the Topk part, why not just store the Topk part in the GPU and dynamically load the incremental Topk part after calculating the Topk in combination with CPU HICache?

The key to layered sparseness is to change the storage location and loading timing of KVCache (DeepSeek DSA is used as an example below):

  • Traditional process: the complete Latent Cache must reside in the GPU video memory. The Decode stage executes the Indexer selection Top-2k, and then performs the Attention calculation on the selected part. Single 128k request, although the theoretical computation is reduced by 60 + times,However, the video memory usage is still O(n), 8GB,H200 supports Batch = 5 at most;
  • Hierarchical sparseness process: after Prefill, the complete Latent Cache(8GB) is offloaded to the Host memory, and only the lightweight Sparse Indexer metadata is retained in the GPU. When decoding, the Indexer selection Top-2k is executed on the GPU based on the metadata,The Host filters the corresponding Latent subset and incrementally transfers it to the GPU, and finally performs the Attention calculation;

    • Single request GPU memory usage reduced to <200MB, single GPU can support 2;
    • Among them, 3 Represents the maximum CPU memory capacity that can be allocated by a single card; 4 Represents the maximum Batch Size that meets the SLO latency requirement.

5

Core Advantages:

  • The complete KVCache is stored in the Host, breaking the physical space limit of GPU video memory;
  • The GPU side only needs to store lightweight Sparse metadata and Topk part KVCache, Req video memory takes up from O(n) to O(k);
  • High-performance transmission: combined with HICache IO Kernel, the Topk Cache high-performance transmission is implemented, and the single-layer IO delay is controlled at the us level. Combined with the Overlap capability, the IO delay is hidden in the calculation.

Layered sparseness not only solves the computational problem, but also fundamentally breaks the rigid constraint of video memory capacity, realizes the collaborative optimization of computational efficiency and storage efficiency, and opens up a new technical path for ultra-long context reasoning.

SGLang Layered Sparse Framework Design

Overall Frame Design

SGLang's layered sparseness framework adopts a modular, pluggable three-tier architecture design, and implements algorithm decoupling, back-end compatibility and non-intrusive integration through standardized interfaces. The framework core consists of the following modules:

  • SparseCoordinator (coordination layer): orchestrates the collaborative work of the three functional modules through lifecycle hooks

    • Algorithm (Algorithm layer): provides pluggable Top-k selection strategy implementation;
    • BackendAdaptor (Adaptation Layer): completes the conversion from sparse index to physical address and interfaces with the back end;
    • SparseKVCacheManager (Transport Layer): implements efficient and incremental data transmission between Host-GPU based on Diff and IO Kernel.
  • RequestTrackers (State Management): maintain sparse state management for each request

The architecture not only natively supports the built-in sparseness mechanism of the model (such as DeepSeekV32 DSA), but also allows flexible combination of Training Free sparseness algorithm (Quest / SnapKV) and general Attention backend (FlashAttention/Triton),A unified and highly scalable hierarchical sparse attention scheme is provided for long context inference scenarios.

6

SparseCoordinator: Orchestrating the Sparse Pipeline

SparseCoordinator is the central controller of the hierarchical sparse framework. It uses Lifecycle Hooks to accurately orchestrate the collaborative work of the Algorithm, backendadapter, and SparseKVCacheManager modules at the key nodes of model inference.Its design follows the event-driven model, which decouples the complete process of the Retrievable Sparse into a standardized hook interface, and realizes the zero-intrusion integration of the algorithm and the model.

7

SparseCoordinator divides sparse inference into two core phases:

  • Representation Construction Phase, at the end of Prefill or the beginning of Decode: The attention_end hook calls construct_representations() and update_representations() on Algorithm. These methods compress the raw KVCache into semantic representations and store them in the Representation Pool. Full Attention is still computed in this phase to preserve representation quality.
  • Query-Guided Decoding Phase: At each Decode step, inside the attention_begin hook, the Coordinator drives Algorithm to run retrieve_topk() from the Representation Pool based on the current Query. It selects the most relevant Top-k representations. BackendAdaptor then converts logical indices into physical indices and triggers SparseKVCacheManager for incremental data transfer. The Diff Kernel computes the difference between Top-k sets and loads only the changed entries. Finally, the framework dynamically reconstructs Attention metadata, such as FlashAttention’s PageTable, for the Attention backend to run sparse computation.
  • Through this capture-compute-convert-inject loop, SparseCoordinator keeps the framework flexible while providing efficient hierarchical KVCache management.

Pluggable Sparse Strategies

Under the arrangement of SparseCoordinator, Algorithm and BackendAdaptor, as two core function modules, are responsible for the questions of "what to choose" and "how to map" respectively, and achieve a high degree of pluggability and extensibility through clear interface definitions.

8

Algorithm: An Abstract Top-K Selection Strategy

The Algorithm layer uses an abstract base class, BaseSparseAlgorithm, to define a unified interface. It decouples sparse algorithm logic into three methods:

  • retrieve_topk(queries, layer_id, ...):

    • Retrieves the logical indices of the Top-k important tokens or pages from the Representation Pool based on the current Query.
    • The algorithm only needs to return logical indices, such as Token IDs or Page IDs. It does not need to know the physical KVCache layout or the Attention backend details, such as FlashAttention or Triton.
  • construct_representations(...): Builds semantic representations for retrieval during Prefill or the early Decode stage, such as compressed Key representations.
  • update_representations(...): Incrementally updates the Representation Pool during Decode.

Take the Quest algorithm as an example:

  • Quest is a Training Free page-wise sparse attention algorithm, which avoids the complete Query-Key dot product calculation by maintaining the Key Bounding Box(min/max value) of per-dimension for each KV Page;
  • In the construct_representations phase, the algorithm traverses all Pages to extract Keys and calculates the minimum/maximum values of Keys in each dimension and stores them in page_k_min/max Representation Pool (the memory overhead is about 1% of the complete Key storage);
  • In the retrieve_topk phase, the upper bound estimation of the Attention Score is calculated by the criticality, and the Top-k Pages are quickly filtered and then sent to the backendadapter to complete the physical address conversion.

9

BackendAdaptor: Bridging Logical Indices and Physical Backends

The backendadapter layer solves the problem of mapping the "logical world" to the "physical world. Different Attention backends (DSA Backend, FlashAttention, and Triton FA3) have different requirements for the format and indexing of input data, Adaptor is responsible for masking these differences.

Take FlashAttention adapter as an example: flashattentionadapter converts logical Page IDs into physical Page numbers through the req_to_token mapping table, reconstructs the PageTable, and updates the sequence length metadata (cache_seqlens, cu_seqlens_k),Enables FlashAttention to perform attention calculations based on Top-k selected sparse pages.

DeepSeek DSA Integration

Introduction to DeepSeek SparseAttention

Compared with the DeepSeek-V3.1, the DeepSeek-V3.2 architectural change is the introduction of DeepSeek Sparse Attention(DSA) during the continuous training process.

The prototype design of DSA consists of two parts, Lightning Indexer and Fine-grained Token Selection Mechanism. It starts with a lightweight indexer,Quickly filter out the candidate Tokens that are most relevant to the current query Token, and then perform high-precision attention calculations only on this sparse candidate set.

10
(Note: picture from DeepSeek paper)

DeepSeek DSA Integration Flow

11

Key designs include:

  • Dual cache mapping: The system maintains two independent physical address mapping tables, DSADecodeReqToTokenPool. req_to_token stores the Latent Cache LRU Buffer page table for each request, with LRU Size = 2 to 4 KB. req_to_dsa_index_k stores the indexer_k page table. During Prefill, the Indexer module generates index_k for each token and stores it on the GPU. At the same time, the full Latent Cache is offloaded to CPU memory. After Prefill ends, each request’s GPU memory footprint is fixed at the LRU size.
  • Incremental transfer: During Decode, each time a token is generated, the Indexer uses the current Query and the cached historical index_k to efficiently compute the logical indices of the Top-2K tokens. The Sparse Diff Kernel then compares prev_topk and curr_topk with a set difference algorithm and computes the exact delta, (Delta), that needs to be loaded. SparseKVCacheManager calls load_to_device_per_layer to transfer only the Latent Cache blocks corresponding to (Delta) into the GPU LRU Buffer, minimizing PCIe bandwidth use.
  • Non-invasive integration: DeepSeek DSA integrates with the model through SparseCoordinator lifecycle hooks. DeepSeekDSAAlgorithm, as an Algorithm implementation, directly calls the model-native Indexer. DSABackendAdaptor converts logical Top-k indices into physical device addresses and triggers incremental transfer. Finally, DSA Backend, supporting implementations such as flashmla_sparse, flashmla_kv, and fa3, runs Attention based on the sparse page table. With this design, GPU memory use for 128K long context inference drops from about 8 GB to about 200 MB.

Sparse Diff Kernel: Incremental Cache Transmission Cornerstone

Motivation: The Space for Optimization Due to Temporal Locality

The Top-k selection results of DSA show significant locality in the time dimension: the Top-k sets of adjacent Decode Steps are highly overlapped. Experiments show that the Top-k coincidence of adjacent Steps usually reaches 80% ~ 90%, which means that each Decode Step theoretically only needs to load less than 20% of the new Cache,It provides a natural optimization space for incremental transmission.

12

The Trade-off Between Buffer Capacity and Hit Rate

However, as the sequence length increases, the candidate range of Top-k selection expands linearly, and the difference between adjacent steps is gradually enlarged. Different LRU Buffer capacity configurations directly affect the Cache hit rate.

It can be seen that when the Buffer capacity is only the Top-k size (2k), the hit rate decreases significantly in the long sequence scenario, and the I/O delay becomes the bottleneck. However, expanding Buffer to 4K ~ 8K can exchange controllable video memory overhead for multiple I/O efficiency improvements.

13

Design and Implementation of LRU Diff Kernel

To take full advantage of the temporal locality of DSA, we design a Diff Kernel based on LRU elimination strategy. The core idea is to maintain a Top-k 2-4 times capacity LRU Buffer (typically configured as 4k -8k Token) on the GPU side,Accommodates short-term fluctuations in Top-k through intelligent elimination strategies.

The Kernel workflow is divided into three phases:

Phase 1: Set Intersection Calculation

Compare prev_topk and curr_topk to identify the two-step co-selected Token. This part of the Cache already resides in the GPU and does not need to be reloaded. The PageTable (curr_device_indicators) is directly updated to reuse the existing data.

Phase 2:LRU elimination decision

This is the core difference from a strictly Top-k Buffer. Instead of simply expelling all tokens in prev_topk that do not appear in curr_topk, the Kernel:

  • The elimination is triggered only when the Buffer space is insufficient;
  • Priority is given to eliminating Cache pages that have not been hit in multiple steps in the past (based on LRU policy);
  • The evict_device_indices is calculated, marking the coldest physical page location that can be overwritten.

Phase 3: Incremental load mapping

Extract the newly added Token (missing part) from curr_topk to generate a one-to-one load mapping relationship:

  • load_host_indices: the physical addresses of these tokens in the Host Memory;
  • load_device_indices: their target physical page number in the GPU (where the reuse phase 2 is eliminated).

This heuristic strategy makes full use of the time continuity of DSA Top-k selection and dynamically maintains an efficient cache window for each Request, so that the system can maintain a cache hit rate of at least 80% + in long sequence scenarios with less GPU cache space,A dynamic balance of space and efficiency is achieved.

14

I/O Transfer Kernel: A High Performance Transfer Engine

To achieve efficient data migration between GPU-CPU heterogeneous memory layers, SGLang HiCache has designed a special IO Kernel transmission engine. The engine uses CUDA underlying optimization technology to maximize PCIe bandwidth utilization through warp-level fine-grained parallelism.

The IO Kernel supports multiple memory layout modes (layer_first, page_first, and page_head), which enables unified abstraction of MHA and MLA architectures. The pinned memory and CUDA host register mechanisms are used on the CPU side to ensure zero-copy transmission,Combined with the dynamic scheduling strategy of per-layer and all-layer transmission granularity, batch full-layer offload is carried out after the prefill stage, and incremental single-layer transmission is carried out in the decode stage, which effectively balances the transmission delay and bandwidth overhead.

Measurements show that through NUMA binding, the IO Kernel can reach close to ~ 40gb/s per GPU at 8 × H20, providing a low-latency, high-throughput data handling infrastructure for the hierarchical KV cache architecture.

Performance Evaluation

We access DeepSeek V32 DSA on the SGLang hierarchical sparse attention framework, and evaluate the system performance in the long context reasoning scenario. Experiments using DeepSeek-V32 model, for 16k, 32k and 64k three sequence length configuration,The input throughput (input tokens/s) under different batch sizes was tested on 8 × H200 GPU With 1TB memory.

The experimental results show that, compared with the traditional full KV cache scheme, Hierarchical Sparse attention scheme (Hierarchical Sparse) combines KV cache Hierarchical management, GPU-CPU heterogeneous storage and dynamic TopK retrieval mechanism,Demonstrated significant performance advantages in long sequence scenarios. Specifically:

  1. Memory Efficiency & Throughput Breakthrough: The traditional scheme is limited by GPU video memory capacity, and can only support a maximum batch size of 32/64/128 under 64k/32k/16k sequence length respectively, while the Hierarchical Sparse scheme offloads KVCache to CPU memory,The maximum batch size that can be supported reaches 160/304/600 respectively, achieving a 5-fold increase in batch processing capacity and a 2-3-fold increase in Through.
  2. Scalability verification: With the increase of batch size, the throughput of the Hierarchical Sparse solution shows a near-linear growth trend, which verifies the good scalability of the Hierarchical cache architecture and Sparse attention mechanism in large-scale concurrent inference scenarios.

The results demonstrate the effectiveness of the hierarchical sparse attention architecture in breaking through the GPU memory wall and supporting large-scale concurrent reasoning in ultra-long contexts.

15

Outlook and Roadmap

Technology deepening direction:

  • Algorithm and backend extension: Adapts more Sparse algorithms (such as StreamingLLM and PQCache) and Attention backends (such as FlashInfer and Triton) to improve the ecological compatibility of the framework.
  • Performance optimization:

    • I/O concealment: uses technologies such as TwoBatch Overlap and Kernel Fused to further reduce I/O latency overhead and approach the theoretical performance limit.
    • Asynchronous retrieval: Query based on adjacent tokens has the principle of high similarity, and the Topk of the current Step is asynchronously retrieved in advance through the Query of the previous Token, thus reducing the retrieval overhead.

Architecture evolution direction: with the popularity of super-node architecture, the bandwidth of GPU accessing shared memory pool through Scale-Up network has significantly exceeded the traditional PCIe bandwidth. Under this hardware trend, KVCache's Memory Pooling has become a natural choice.We will help implement KVCache unified pooling scheduling in super nodes, give full play to the bandwidth advantages of Scale-Up networks, break through the bottleneck of traditional PCIe, and provide a more efficient layered and sparse infrastructure for ultra-long context reasoning.

0 1 0
Share on

ApsaraDB

619 posts | 184 followers

You may also like

Comments