Alibaba Cloud Boosts GPU Utilization with AI Infrastructure Breakthrough at SOSP 2025

This article introduces Aegaeon, an AI infrastructure breakthrough from Alibaba Cloud accepted at SOSP 2025, which significantly boosts GPU utilization for serving multiple AI models concurrently.

Recently, Alibaba Cloud's research on multi-model GPU pooling services was accepted to the prestigious Symposium on Operating Systems Principles (SOSP) 2025 conference. The paper introduces Aegaeon, a multi-model hybrid serving system that significantly improves GPU resource utilization. The core technologies behind Aegaeon have already been deployed in Alibaba Cloud Model Studio.

SOSP, organized by ACM SIGOPS, is one of the top-tier conferences in computer systems. Often called the "Oscars of operating systems research," SOSP accepts only a few dozen papers each year, representing the most significant advancements in operating systems and systems software. This year's SOSP accepted just 66 papers. The integration of system software with large AI model technologies is a prominent trend reflected in this year's accepted papers.

Research Background

The number of global AI models is experiencing continuous growth. According to statistics, Hugging Face now hosts over one million models. In real-world deployments, a small subset of popular models dominates inference requests, while over 90% of models see infrequent use. The current de facto standard solution is to reserve at least one inference instance per model, which results in significant GPU resource underutilization.

Research Approach

In their paper, the research team introduces Aegaeon, an innovative multi-model hybrid serving system that enables token-level multi-model serving to achieve significantly higher GPU utilization when serving numerous large language models (LLMs) concurrently.

Aegaeon's serving architecture is designed around three key components: a proxy layer, a GPU pool, and a memory manager.

Image | Research framework

Proxy layer: This layer receives and dispatches inference requests, ensuring load balancing and fault tolerance. State synchronization is achieved through shared memory mechanisms such as Redis. Aegaeon can direct requests from different models to the same instance, enhancing resource sharing and scheduling flexibility.

GPU pool: This pool serves as a centralized resource pool comprising a set of virtualized GPU instances provided by cloud vendors. Each instance may consist of one or more GPUs hosted on a single physical machine. Within Aegaeon, a single instance performs either prefill or decoding operations and, guided by a token-level scheduler, serves requests from multiple models concurrently, making model switching a critical operation.

Memory manager: This component coordinates host and GPU memory resources across nodes in the serving cluster, with two primary goals:

1) QuickLoader caches model weights in available memory, significantly reducing the latency of loading models from remote repositories.

2) The GPU-CPU key-value (KV) management mechanism provides unified storage and management for KV caches.

Research Innovations and Conclusions

Aegaeon pioneers token-level scheduling, enabling dynamic model switching decisions after each generated token. By leveraging accurate execution time prediction and an innovative token-level scheduling algorithm, Aegaeon precisely determines model switching needs, enabling concurrent multi-model serving while meeting latency requirements.

Furthermore, through component reuse, fine-grained GPU memory management, and optimized KV cache synchronization, Aegaeon reduces model switching overhead by up to 97%, ensuring real-time token-level scheduling with sub-second model switching latency.

Aegaeon supports concurrent serving of up to seven different models on a single GPU, achieving 1.5× to 9× higher effective throughput and 2× to 2.5× greater request handling capacity compared to existing mainstream systems.

Image | Aegaeon significantly boosts GPU utilization

Practical Applications

Aegaeon's core technologies have been deployed in Alibaba Cloud Model Studio, supporting inference for dozens of models while reducing GPU consumption by 82%. To date, Alibaba Cloud Model Studio has launched more than 200 industry-leading models, including Qwen, Wan, and DeepSeek, and has seen a fifteen-fold increase in model invocations over the past year.

Community

Alibaba Cloud Boosts GPU Utilization with AI Infrastructure Breakthrough at SOSP 2025

Research Background

Research Approach

Research Innovations and Conclusions

Practical Applications

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Container Compute Service (ACS)

Container Service for Kubernetes

Tongyi Qianwen (Qwen)

Alibaba Cloud for Generative AI