×
Community Blog Alibaba Cloud Tair KVCache Manager: Architecture Design and Implementation of Enterprise-Level Global KVCache Management Service

Alibaba Cloud Tair KVCache Manager: Architecture Design and Implementation of Enterprise-Level Global KVCache Management Service

This article introduces the architecture and implementation of Tair KVCache Manager, an open-source enterprise-grade global KVCache management service for scalable Agentic AI inference.

[Blockbuster] Alibaba Cloud Tair KVCache team and Alibaba Holding-AI Engine & Tech infra and Reliability Engineering team will open source the enterprise-level global KVCache management service Tair KVCache Manager. This article introduces the architecture design and implementation details of this service in detail.

With the rise of Agentic AI, the traditional single-server tiering solution centered on the inference engine can no longer meet the storage requirements of KVCache in the new era. As KVCache pooled storage is coming to the ground in large-scale Agent inference scenarios, it is necessary to build an enterprise-level KVCache management system with the capabilities of accurate capacity assessment, dynamic elastic scaling, multi-tenant isolation, high availability guarantee, and version collaborative management,To support cost-effective optimization and service reliability requirements under PB-level storage. In order to solve these problems, we designed and implemented a set of global KVCache management services for large model inference scenarios, supporting Alibaba Group RTP-LLM inference service is accelerated and extended to support many open source engines such as vLLM and SGLang, and is compatible with various Attention mechanisms including Sparse Attention and Sliding Window Attention

This article will start with the KVCache storage requirements in the Agent scenario, further analyze the challenges faced by KVCache storage in this scenario on the basis of the previous article, and introduce the architectural design choices and related implementations made by Tair KVCache Manager to address these challenges.

This series of technical articles will systematically disassemble the evolution path of KVCache technology for agent inference:

  1. Alibaba Cloud Tair Partners with SGLang to Build HiCache: Constructing a New Cache Paradigm for "Agentic Inference"
  2. Alibaba Cloud Tair KVCache Engineering Implementation Based on 3FS: Enterprise-Grade Deployment, High-Availability Operations, and Performance Optimization Practices
  3. Hybrid Model Support:SGLang's Support for Hybrid architecture models such as Mamba-Transformer
  4. This article | Alibaba Cloud Tair KVCache Manager: Architecture Design and Implementation of Enterprise-Level Global KVCache Management Service
  5. KVCache Simulation Analysis: Industrial-Grade Practices for High-Precision Computation and Cache Modeling
  6. Hierarchical Sparse Attention: KV Hierarchical Management and On-Demand Loading under a Hierarchical Sparse Attention Framework
  7. Outlook: Software-Hardware Co-Design Evolution Driven by KVCache

Tair KVCache, as an extension of the Tair product capabilities of Alibaba Cloud Database, is essentially a three-time transition of the caching paradigm:

🔹From Redis's "cache data → Reduce I/O"

🔹To GPU KVCache's "cache calculation intermediate state → reduce repeated calculation"

🔹Then to Tair KVCache's "large-scale and agent attention state management → reconstruction of large-scale model inference cost model", it indicates that the cache is being upgraded from auxiliary components to the core capability of AI infrastructure layer-making "state" storable, shareable and schedulable, supporting the large-scale inference base in the era of agent.

Triple upgrade challenge of KVCache service

The rise of the Agent reshapes the KVCache storage paradigm

The rise of Agent not only lengthens the inference session context, but also makes the session mode change obviously. These changes pose severe challenges to the traditional single-machine hierarchical + affinity scheduling scheme:

  1. The number of rounds and duration of a single session are continuously increasing: compared with the traditional short-time and low-frequency interaction mode (such as a single round or a few rounds of requests that end immediately), the Agent session often runs through the whole life cycle of the task, including multiple rounds of inference requests, and the duration is significantly longer. This "long lifetime" feature makes the scheduling object actually change from "request" to "session" (although it may still be scheduled with request granularity),The scheduler needs to try to maintain the affinity between KVCache and computation within a time window of tens of minutes or even longer.
  2. The number of concurrent sessions has further increased: With the popularity of paradigms such as Vibe Coding and the extensive integration of high-latency tools and complex MCP (such as code compilation and execution, data warehouse query, etc.), the proportion of LLM inference time in a single Agent session has continued to decrease, and the time spent waiting for non-inference processes has increased significantly.To maintain high utilization of the inference cluster, the system needs more concurrent sessions to "fill" the computing gap. This also significantly increases the scheduling difficulty of the scheduler.
  3. The mode difference between sessions expands: different Agent tasks have significant differences in context scale, inference frequency, tool invocation mode and delay sensitivity (for example, a code review Agent may interact frequently in small rounds, while a data analysis Agent tends to have fewer rounds and large context inference),As a result, fixed computing and storage ratios are difficult to meet the differentiated needs of different scenarios.

The above points lead to the increasing difficulty of scheduling in the traditional mode, and the conflict between KVCache hit rate and SLO scheduling target is becoming increasingly acute. For example, when the number of requests increases, the scheduler needs to perform more session migrations to ensure SLO, which in turn leads to a decrease in KVCache hit rate. This leads to a nonlinear growth or even avalanche of Prefill loads.

Even though the single-server tiering solution breaks the limitation of VRAM and further expands the capacity of KVCache, it is still difficult to fully meet the storage requirements of KVCache in the Agent scenario. To further address this issue, we need to confront the fundamental problem of this architecture: the overly tight coupling between computing resources and KVCache storage.

1

Fortunately, with the continuous evolution of high-performance network technology, the network bandwidth in the intelligent computing environment increases rapidly (single port jumps rapidly from 25Gbps to 200+Gbps), the interconnection scale expands continuously (thousand, ten thousand cards and even across available zones), and the interconnection models become more diverse (mainstream storage machine have been covered by high-performance network such as eRDMA). This gives us the possibility of decoupling computing resources and KVCache storage.

Taking Alibaba Cloud as an example, the typical bandwidth of KVCache cross-machine transmission in intelligent computing scenarios is about 20 GB/s: The EGS 8 card model corresponds to 20 GB/s general-purpose network bandwidth, and the Lingjun 8 card model corresponds to 25 GB/s storage bandwidth (another 200 to 400 GB/s ScaleOut network bandwidth).We take Deepseek and Qwen3-Coder typical Prefill throughput as examples for analysis, and both 20GB/s bandwidth can meet the cross-machine transmission requirements of KVCache. As high-performance networks continue to evolve, we believe that KVCache transmission bandwidth will be further expanded, Continuously reduce the cost of cross-machine transmission.

Model Single machine Prefill KVCache write bandwidth requirements Time consuming to read 64K context at 20 Gb/s Single Token KVCache capacity Inference Model Prefill throughput token/s 64K context corresponds to KVCache capacity
DeepSeek-V3/R1 2.79GB/s 0.21s 70272 Byte H800 * 8 42664 4.19GB
Qwen3-Coder-480B 3.55GB/s 0.76s 253952 Byte H20 * 8 15000 15.14GB

(DeepSeek Prefill throughput from RTP-LLM reoccurrence report, 32.2K in the official DeepSeek disclosure. Qwen3-Coder Prefill throughput is a rough estimate in the optimized deployment mode, and the measured value in single server deployment is about 10000 token/s.)

From this external KVCache storage starts to move from inference binding to Global Pooling:

2

Combined with high-performance network and optimization in inference engines such as HiCache, global pooling greatly reduces the cost of KVCache not being able to make a local VRAM hit in inference (the cost of pulling from the remote end is much lower than that of reprefill), and relaxes the affinity constraint. Make it easier for the scheduler to achieve a high KVCache hit rate,In turn, you can focus on computing power scheduling to achieve higher computing utilization and SLO.

At the same time, based on the global pooled external KVCache storage, we can further separate the external KVCache storage from the inference container, which will bring the following benefits:

  1. The scaling of the inference container no longer affects the capacity and hit rate of KVCache, and allows flexible scaling to optimize inference costs.
  2. You can flexibly set a better KVCache storage capacity for different Agent scenarios. On the one hand, the storage capacity is prevented from being limited by the inference scale, and on the other hand, the waste of storage resources is reduced.
  3. Decoupling the storage and inference services from the deployment level simplifies operation and maintenance management, and avoids affecting the stability of each other.
  4. Introduce external storage systems such as 3FS to further expand storage capacity and reduce storage costs.

Of course, this separation does not mean giving up the storage resources on the inference host node, in fact, the unified pooling of these storage resources can make full use of these resources, and also help to improve the overall stability of the storage system.

LLM scenarios place new demands on existing storage systems

After KVCache is globally pooled, a large number of existing storage systems are used for external KVCache storage, such as file storage, KV storage, memory pool, object storage, etc., all of which have their own interfaces. The rapid implementation of global KVCache pooled storage based on existing interfaces also encounters some problems.

Compared with traditional KV storage, LLM KVCache has its special business characteristics and storage mode:

  1. There are prefix dependencies between data (A,C and B,C in C represents different data)
  2. The query mode is usually prefix matching + reverse sliding window (SWA/Linear)
  3. Diverse KVCache splitting modes in a single block: diverse parallel modes and mixed attention schemes
  4. Multiple inference processes simultaneously read and write a single block
  • Multiple inference processes under TP/PP parallelism will simultaneously read/write different parts of KVCache corresponding to the same Block (such as different heads)
  • Different parts of the life cycle binding, when used indispensable.

The current storage system interface is designed for general scenarios, which directly erases the semantics and features of LLM, making it difficult for the underlying storage system to further optimize the KVCache features. Typical problems are: in order to facilitate multiple processes to write at the same time, a single Block is split into multiple keys, but the eviction time of multiple keys is not consistent,As a result, after some keys are evicted, the remaining keys occupy storage capacity but cannot provide hits, wasting storage resources.

The metadata performance of some storage systems is insufficient, and it is difficult to complete a large number of metadata queries in a short time:

Taking block_size = 64 as an example, a 64K context needs to query metadata of 1K blocks, determine the existence and read relevant information. Moreover, the KVCache transmission corresponding to 64K token only takes less than 1 second, and the time-consuming impact of querying metadata is more obvious.At the same time, the KVCache size of a single block is at the single digit MB level, and the storage of 100 TB needs to support the management of hundreds of millions of blocks. Most storage systems designed for HPC scenarios mainly focus on optimizing read and write bandwidth, and support for large-scale metadata queries with such a large number of blocks is limited.

Retrofitting the existing storage system or building a new storage system has a large amount of work:

High performance distributed storage systems inherently have high complexity. At the same time, due to the demand for interface commonality in the stock business scenario and the high stability and reliability requirements of the storage system, whether it is to add LLM-related capabilities to the existing storage system or to re-develop a storage system for LLM scenarios, a lot of research and development resources and time are required,And if multiple storage systems are used, each system must be developed.

Therefore, it is necessary to reuse existing engineering resources as much as possible to meet the urgent storage requirements of KVCache as soon as possible. At the same time, it also defines the actual requirements and load mode for the dedicated storage system for LLM scenarios, so that it can have more complete information and sufficient time to complete the design and development, and help it better meet the requirements of KVCache.

Combined with the above problems, building a metadata management layer that focuses on LLM semantic adaptation and storage management has become a practical and feasible path:

This layer does not replace the underlying storage, but acts as a metadata manager, exposing native interfaces that conform to LLM semantics upward to the inference engine, mapping operations downward to efficient calls to the underlying storage, and providing the refined and translated storage features to the underlying storage. This way of adding an abstraction layer can take into account both landing speed and long-term evolution -- In the short term, existing storage systems can be quickly utilized to support large-capacity KVCache pooled storage. In the medium to long term, the optimization goals and interface boundaries of dedicated storage systems are clearly defined to achieve a smooth evolution.

Scale deployment requires management systems to provide new capabilities

With the support of the inference engine and the improvement of the KVCache storage system, pooled KVCache has begun to move from small-scale experiments to large-scale deployment. When the number of PB of back-end storage supports KVCache storage for multiple Agent models, the need for various enterprise-level management capabilities arises.

Before enabling external KVCache storage for inference services, it is necessary to assess the KVCache capacity requirements, the benefits and ROI of external KVCache storage, and identify the most profitable segment from a large number of existing inference services.

During operation, observability is required to promptly detect changes in system performance and continuously optimize storage capacity configuration based on actual online needs. Simultaneously, high availability and high reliability solutions are needed to prevent metadata service or storage system failures from causing inference service anomalies.

For some important model services, sufficient KVCache resources should be strictly guaranteed. Moreover, for long-tail model services, storage resources are shared as much as possible to reduce the workload of capacity configuration.

When switching model versions, ensure that the KVCache of the new and old models is isolated, and flexibly adjust the KVCache capacity ratio according to the changes in inference traffic between the new and old versions.

With the expansion of the deployment scale, such requirements continue to emerge, the enterprise-level capabilities of the KVCache management system put forward higher requirements.

3

Tair KVCache Manager Came into Being

Against this background, Alibaba Cloud Tair and Alibaba Holding-AI Engine & Tech infra and Reliability Engineering team jointly built Tair KVCache Manager (hereinafter referred to as Tair KVCM), an enterprise-level management service product for large-scale KVCache.The system solves the above problems through the following design:

  1. Centralized management of KVCache metadata enables global KVCache pooling and sharing across inference instances. This significantly improves inference performance in scenarios with similarly long contexts for agents.
  2. The semantics of LLM KVCache are reasonably abstracted to decouple the inference engine and the storage system. This simplifies the integration process while preserving ample optimization space for the storage system.
  3. The design was conceived from the outset to meet the needs of large-scale deployment, providing enterprise-level management capabilities covering the entire lifecycle of KVCache, such as: ROI assessment and high-yield scenario screening before model deployment, observability and high availability during model online service, etc.

The final implementation significantly reduces the consumption of GPU computing resources and improves the quality of inference online services while maintaining low costs.

The following part of the article will introduce the basic concepts and architecture design of the system in detail.

System Basic Concepts

Concept of Control Surface

4

1. Storage: A Storage system

  • There is a separate storage configuration (connection address, etc.)
  • Supports different storage types, such as NFS, 3FS, TairMemPool, and Mooncake
  • Allow different Instance groups and instances to share the same Storage

2. Instance Group

  • Share a set of quotas with all instances in a Group
  • Each Group can configure the available Storage list separately
  • Common uses:

    • Corresponding to a business team, multiple models in the team share the storage quota
    • Corresponding to a model, multiple versions of the model share the storage quota.
    • Configure Instance groups separately for important models to ensure exclusive storage resources.
    • Multiple long-tail models share the same Instance Group, sharing storage resources and meeting the sudden capacity requirements of individual models.

3. Instance: a KVCache Instance

  • KVCache is reused only within a single instance (inference instances that need to reuse KVCache should be configured to use the same instance). KVCache is not reused across instances.
  • The corresponding model and KVCache configuration (such as fp8/bf16,block_size, etc.) are unique and unchanged.
  • Belongs to and only belongs to one Instance Group.
  • No separate capacity quota configuration is required; the quota of the Instance Group to which the instance belongs is used.

Through the preceding abstraction, storage, quota, and instance are decoupled, allowing flexible control of KVCache storage mode and capacity, facilitating unified management of a large number of KVCache instances. Configure the capacity quota on the Instance Group to avoid configuring the storage capacity for the Instance separately, This simplifies the access process on the service side and facilitates capacity transfer during Model version switching.

Data Plane Concept

5

1. Block: A block is a series of tokens of a fixed length.

  • The number of token IDs corresponding to a single block is specified during instance initialization. The user-provided sequence of token IDs will be divided into multiple blocks.
  • There are prefix dependencies. Two blocks with the same token sequence but different prefixes are different.
  • Each block can have multiple CacheLocation: corresponding to multiple storage locations/levels.

2. CacheLocation: A storage location for a single block.

  • All data within a single CacheLocation must all be in the same storage type (type).
  • Status: writing->serving->deleting

    • writing: KVCache is writing and is not serviceable. No need to write again for now.
    • serving: The data has been written and can be read normally.
    • deleting: it is being deleted and cannot be read.

3. LocationSpec: part of the data of a single CacheLocation

  • The format of the organization storage location is the same as URI, but different expressions are allowed:

    • For the memory pool may be the address. For the file system this may be the file path. For KV storage may be Key.
    • It unifies the format and avoids semantic misalignment of underlying locations caused by forced mapping to address offsets or key-value pairs.
  • size is recorded in the uri to simplify capacity statistics.
  • Supports multi block storage within a single storage location (such as a single 3fs file) via blkid+size.
  • Spec name allows users to configure: Flexible support for different TP, PP, mixed attention.
  • Allow only a portion of the Spec within the Location: In hybrid attention scenarios, many blocks do not need to store linear attention.

Based on the above abstraction, it not only satisfies the KVCache query requirements of the inference engine (composite matching with hybrid attention, etc.), but also retains the relevant business characteristics and semantics of LLM in Tair KVCM: prefix relationships between multiple blocks, associations with location and multiple specs, etc. Because Tair KVCM can perceive and retain LLM-related characteristics, it can also implement more optimizations for KVCache storage scenarios, such as more detailed classification of requests through prefixes and adjustment of eviction policies. Prefix-dependent data relationships are used to avoid invalid storage of suffixes. Linear attention is given to the blocks that most need to be retained, optimizing storage capacity without sacrificing hit rate, etc.

From the perspective of the underlying storage, reduced access complexity (no need to understand inference concepts, completely transparent), and refined translated storage characteristics (lifecycle relationships between storage objects, etc.) are available, leaving sufficient room for further optimization and development of dedicated storage systems.

Deployment Mode and Service Interface

Tair KVCM is deployed in centralized mode. Because it is only responsible for the metadata management of KVCache, combined with C++ as the development language, relying on scale up can meet the needs of a single inference cluster. At the same time, the abstraction of instance group makes it easy to split different groups using the same storage,Horizontal scale out to serve different inference clusters. The centralized mode greatly simplifies deployment and operation. It also avoids injecting KVCM logic into the inference container, thus avoiding potential problems such as an explosion in the number of connections to the metadata storage backend (such as Valkey).

Furthermore, by incorporating features such as 3FS Master, Tair KVCM can interact using only TCP, the inference engine, and backend storage, without heavily relying on an RDMA environment. For scenarios where only GPU machines exist within the RDMA environment, KVCM can be deployed on other machines outside the RDMA environment, mitigating the issue of high failure rates on GPU machines.

6

After the storage location of KVCache is obtained through Tair KVCM, the Connector in the inference engine will directly read and write the back-end storage, and the KVCache data stream completely bypasses Tair KVCM, reducing the read and write latency of KVCache and the bandwidth demand for Tair KVCM.

To facilitate isolation, Tair KVCM separates the metadata surface from the control surface. Here are some of the key interfaces.

Management Interface:

  • CRUD interfaces for objects such as Storage, Instance Group, and Instance
  • Account-related APIs for permission management control
  • Metrics-related APIs for system observability functions

Metadata Interface:

  • RegisterInstance: used to create an Instance from the metadata plane to simplify the inference access process
  • Major metadata-related APIs:

    • getCacheLocation: obtains whether KVCache is hit and the storage location.
    • startWriteCache: request the KVCache to be written and the target write location and notify KVCM to start writing.
    • finishWriteCache: Reporting write completed
    • removeCache & trimCache: KVCache data pruning and deletion
    • API parameter design:

      • Supports block_keys and token_ids modes, compatible with multiple inference engines and convenient for peripheral system queries
      • Primary interfaces carry complete prefix information to maintain prefix metadata

System Architecture

7

Manager

Access Layer (Server)

  1. It also provides http and grpc interfaces to facilitate access to different inference engines.
  2. Relying on the RDMA ecosystem in C++, RDMA or even GPU Direct access can be realized if there is a requirement for extremely low latency.

Read and write module (CacheManager)

1.  Multiple matching modes supported. The inference engine supports multiple matching patterns when reading KVCache metadata:

  • KV-type matching: given a number of keys, directly return the corresponding metadata;
  • Prefix matching: Given a number of keys, only metadata that matches the prefix order is returned. This matching mode is the main mode when the inference engine reads.
  • Sliding window matching: Given several keys and window lengths, only metadata within the nearest window length is returned. For the KVCache mode using the sliding window mechanism, this matching method has higher performance;

2.  Two-phase write mechanism. When the inference engine writes KVCache, it uses the following two-phase write mechanism:

  • Before writing, call the StartWriteCache interface to obtain meta information from Tair KVCM, and call the SDKs of different storage backends locally for writing. The cache location being written will be marked as writing, waiting for inference to complete writing within the specified timeout period,Ensure that the same key cannot be written more than once at the same time;
  • After writing, the FinishWriteCache API is called to notify Tair KVCM of the completion of writing. The server marks the successfully written location as serving and deletes the failed location.

3.  High availability read and write support:

  • Tair KVCM supports multiple storage backends (see the "storage management" section below), and supports dynamic switching of different storage backends for reading and writing to improve availability;
  • Because KVCM can interface with multiple storage backends at the same time, the inference engine can easily store data to different backends as required. For example, store hot data to TairMemPool and cold data to 3FS. At the same time, it can also ensure that the cache service is still available when a back end is hung up.

Storage management (DataStorage)

1.  Compatible with multiple storage systems

  • Different storage systems have separate storage configurations.
  • Supports multiple Storage systems such as 3FS, TairMemPool, Mooncake, and NFS. The same Storage can be shared by different InstanceGroup and instances.
  • The storage location format is unified as Uri, and supports different expressions: memory pool is the address, file system is the path, etc.
  • All storage systems are accessed through the unified DataStorageBackend abstract API to ensure that the upper-layer logic does not need to perceive the underlying differences. The decoupling and scalability are strong. The upper layer only needs to pass Uri, and the new storage type only needs to implement the interface without modifying the core logic.

2.  DataStorageManager for unified management of multiple storage systems

  • GetAvailableStorage performs a probing check. Health checks are periodically performed in the background to improve performance and return all available storage instances to improve availability.
  • Supports dynamic lifecycle management of different back-end storage. RegisterStorage and UnRegisterStorage can register and unregister storage instances.
  • EnableStorage and DisableStoarge dynamically enable or disable the available status of a storage instance.
  • Create, Delete, and Exist are used to determine the availability and release of storage instance space.

3.  Other features

  • For NFS and 3FS storage systems with no metadata management capability or limited metadata performance, the block allocation capability is provided (a small number of large files are allocated and used in small pieces).

4.  Introduction to TairMemPool: TairMemPool is a high-performance KVCache memory pool solution jointly built by the Alibaba Cloud Tair team and the server R&D - custom computing and chip system team. Through the co-optimization of hardware and software, the scheme realizes the unified addressing and global access capability of multi-node memory,It supports multi-media/protocol access and KVCache dedicated transmission optimization. In a multi-network card environment, the network bandwidth utilization rate exceeds 90%, and it has enterprise-level high availability features, providing highly reliable storage support for the TairKVCache system.

Index management (MetaIndex)

1.  Use KV system to store metadata

  • The KV system is easy to obtain and the ecosystem is mature. The open source community and the Cloud Service Market provide a large number of highly available and high-performance KV storage solutions with high scalability and availability. Such as Valkey, Redis, and other memory-based high-performance KV engines that support persistence, such as Rocksdb based on the LSM-Tree structure,suitable for write-intensive loads.
  • In the early stages of the project, the team has limited resources, and the ability to directly use external high availability and persistence can effectively reduce the complexity of the initial system and development costs.
  • Metadata is also applicable to KV storage, which is unified with the KV semantics of the Upper Cache, facilitating the design and implementation of external interfaces.

2.  Interface Design

  • Supports the Put, Delete, Update, and Get interfaces for KV Cache metadata, and provides the ability to add, Delete, modify, and query data.
  • Supports the ReadModifyWrite interface and provides thread-safe updates for customizable modifiers.

    • In high concurrency scenarios, if multiple inference requests update the KV Cache metadata information of the same model at the same time, simple read rewriting is prone to inconsistent status. Therefore, MetaIndex provides the native read-rewrite atomic operation interface.
    • ReadModifyWrite(keys, ModifierFunc), which can be used externally to process the three operations of adding, modifying and deleting through custom ModifierFunc functions, providing flexible and thread-safe read and rewrite capabilities.
    • In this interface, a slice lock will be added to the specified keys to query the keys. The user can specify delete, modify, and write operations based on the query results through the custom ModifierFunc function. Finally, the MetaIndex verifies and performs the corresponding operations.
  • Support Scan and RandomSample interfaces, support interfaces such as obtaining storage and cache capacity.

    • As the cache size increases, the system must implement effective capacity management (such as LRU eviction) on the metadata storage of Instance granularity, provide relevant data information for the eviction policy of these capacity management, and specify the evicted key.

3.  Other features

  • Use LRU Cache as the search cache for metadata storage to reduce query I/O and latency in high hit scenarios

    • SearchCache is located above the data access logic. When querying the Value, the local LRU Cache will be tried first. If there is a Cache, it will not participate in subsequent queries with I/O overhead.
    • The new KV query inserts the query result into the SearchCache; After all write operations (such as Put, Delete, and Update) are successfully submitted to the underlying storage, the corresponding cache entry will be invalidated synchronously
    • You can configure LRU Cache parameters to adjust the Search Cache capability, such as Cache size and number of Cache shards.
  • Supports the configuration of sharding locks and batch operations to provide high-performance and high-concurrency services.

    • In a multi-threaded environment, metadata operations must ensure thread safety. However, global locking can become a performance bottleneck, and complete lock-free makes it difficult to implement complex Read-Modify-Write semantics. To this end, the system uses a hybrid strategy of Sharded Locking and batch alignment.
    • Fragment lock: The key space is divided into N fragments according to hash (N needs to be configured as a multiple of the power of 2). Each fragment maintains an independent lock. When operating a key, only the fragment to which it belongs needs to be locked. Note that in order to improve query performance, query operations are neither locked nor blocked by locks.
    • Batch operations are aligned with lock granularity: all external interfaces of MetaIndex accept batch operations. To avoid deadlock, locking needs to be performed in ascending order according to the lock slice granularity. At the same time, in order to avoid long-term blocking of other small batch operations by large batch operations, we need to batch and divide large batch keys into several small batch operations,The lock shards between small batches do not intersect with each other. After a small batch is executed, its Shard lock is released. The batch size is affected by the batch_size and shard_count configurations. This ability to add, delete and modify batch sizes adaptively according to lock granularity enables MetaIndex to implement trade-off control between high concurrency and single batch operation throughput through configuration parameters for different business characteristics.
  • The underlying storage supports flexible scaling. Currently, local storage and remote Valkey/Redis storage are supported.

Capacity management (Reclaimer & Executor)

8

Capacity management in KVCM is implemented in two dimensions:

Instance Group:

  • Quota configuration:

    • The Instance Group can be configured with one total Quota and one Quota for each storage type. For example: total Quota = 100T,TairMemPool Quota = 1T,3FS Quota = 99T.
    • Quota configuration is used to control the hard upper limit of the storage capacity of an Instance Group: if the total usage of an Instance Group exceeds the configured upper limit of the total Quota, KVCM will stop allocating write addresses to the Instance Group;If the usage of a storage type under an Instance Group exceeds the upper limit of the corresponding Quota, KVCM stops allocating write addresses for the Instance Group to the storage backend of that type and tries other available storage backends instead.
    • By controlling the total usage and the upper limit of the storage type usage, you can more clearly manage the storage usage intention of KVCache data under the Instance Group.
  • Water level configuration:

    • The Instance Group configuration also contains the eviction policy. One of the important controls is the usage level configuration, which is expressed as a floating point percentage. When the usage of an Instance Group reaches the specified water level, KVCM will trigger the eviction action to try to control the usage below this water level.
    • The eviction water level configuration is equivalent to the soft upper limit control of the Storage capacity: if the usage of an Instance Group for a certain Storage Type exceeds the configured water level, KVCM will start to trigger the eviction of the Instance Group limited to a specific Storage Type;If the overall water level of an Instance Group is exceeded, all Storage types of the Instance Group may be evicted.
    • Usage level and eviction policy control are important means of overall usage management and advance planning of the storage backend, so as to make the entire service lifecycle of the large-scale deployment of KVCache data storage pool sustainable.

Storage back-end capacity:

  • In the future, you can configure a capacity management policy for each specific storage backend Instance. When the total usage of a storage backend reaches the upper limit, write data is stopped, and data is evicted according to a certain algorithm when the eviction level is reached. This can be combined with the capacity management of the Instance Group to further ensure the security of the storage backend.

In terms of specific engineering implementation, KVCM implements asynchronous execution of deletion by putting the deletion task into the background thread pool and other engineering means, so that the deletion performance can be expanded to support a larger deployment scale. At the same time, it supports various common eviction modes such as TTL, LRU, and LFU, which can flexibly adapt to different capacity management requirements;Support real-time update of water level and Quota at Runtime; Support dynamic tuning of eviction parameters at Instance level in combination with Optimizer module; In the future, more accurate eviction based on prefix will be supported, for example, suffix block is evicted before father block, and pre-suffix correlation eviction of Linear and Full under mixed attention.

Optimizer

The Optimizer module is a cache strategy optimization analysis tool specially designed for LLM inference systems. It analyzes KVCache hit rate and storage capacity consumption mode by efficiently simulating and replaying real Trace data, and optimizes and guides Manager's capacity management through analysis results.

This module builds a prefix matching index based on Radix Tree, supports the parsing and conversion of real access Trace data in multiple formats, including Tair KVCM event, and can accurately simulate the read and write access mode under different eviction policies such as LRU and Random-LRU,It not only calculates and analyzes the impact of different capacity configurations on the cache hit rate, but also analyzes key indicators such as access time locality and access frequency distribution of each block in depth. The module supports parameter traversal and replay loops to automatically find the optimal storage parameter configuration to achieve the best trade-off between storage cost and performance.At the same time, the module has built-in flexible cache level configuration and unified eviction policy interface, which not only supports the fine analysis of a single cache layer, but also reserves an extended interface for the collaborative optimization of multi-level cache architecture.

Through the unified Trace data structure and efficient Trace processing mechanism, the Optimizer module can accurately restore the cache behavior in real business scenarios and provide stable and fast performance analysis. The effect is shown in the figure, and the dynamic analysis in Figure 1 can track the synergistic changes in cache capacity usage and hit rate in real time;Figure 2 clearly depicts the KVCache hit rate benefits of different eviction strategies under different cache capacity configurations, so as to test the performance changes of different eviction strategies and guide the optimal capacity configuration.

910

Sample trace source

Furthermore, Optimizer module supports deep integration with GPU inference Performance Model (computing profile) through flexible architecture design, which can quantitatively model the reduction effect of computation in prefill stage based on cache hit rate, and directly map the delay improvement (such as TTFT optimization) and throughput improvement brought by KVCache to the improvement of GPU resource utilization efficiency,And then quantifies the specific optimization suggestions for the reduction of the number of GPU nodes and the saving of operation cost, which provides the data-driven decision-making basis for the resource scheduling and cost optimization of the agent inference system. Details will be detailed in subsequent articles.

Outlook for future work

Support for more comprehensive LLM caching scenarios

As multimodal model capabilities improve, the demand for multimodal, multi-turn sessions is also increasing, necessitating further improvements to the multimodal support of the LLM KVCache interface.

For encoder cache and similar VLCacheIn the context-free KVCache scenario, add the traditional KV semantic interface.

In the face of more diverse deployment environments, there is also a lot of work:

1. Offline private environment: the offline environment self-sustaining host makes KVCM have more optimization targets besides the hit rate. More targeted eviction algorithms need to be designed for this type of scenario to optimize.

2. Super Pod: in this type of system, the bandwidth for GPU to access the memory pool through the ScaleUp network is greater than the bandwidth for CPU memory through PCIE. Therefore, it is natural to uniformly pool and manage the memory used to store KVCache. We will help implement the storage management of KVCache in the super node,Give full play to the performance advantages of super nodes.

Improve support for mainstream inference engines and back-end storage

Through cooperation with the inference engine community, more inference engines can obtain native support capabilities. At the same time, the transmission performance will be further optimized.

Adapts to more backend storage systems to meet the storage requirements of KVCache in more different scenarios.

Further enhance simulation and cache analysis capabilities and optimize storage algorithms

Enhanced inference co-simulation capability. Optimize the eviction algorithm to improve the hit rate and reduce the cost of inference.

Furthermore, most existing tiering algorithms are based on traditional data characteristics such as hot and cold data, which are not entirely suitable for the data tiering requirements of LLM KVCache scenarios. It is necessary to explore data tiering algorithms that better meet the needs of this scenario, taking into account the access characteristics of LLM.

Continuously enhance enterprise-class capabilities

For the pain points exposed under large-scale deployment, continue to enhance the relevant enterprise-level capabilities to provide more comprehensive management capabilities.

0 0 0
Share on

ApsaraDB

581 posts | 179 followers

You may also like

Comments