As large language models (LLMs) become more prevalent, deploying and managing them efficiently, reliably, and at scale in production is a major challenge for businesses. The Cloud-native AI Serving Stack is an end-to-end solution built on Container Service for Kubernetes and designed specifically for cloud-native AI inference. The stack addresses the entire lifecycle of LLM inference and provides integrated features such as deployment management, smart routing, automatic scaling, and deep observability. The Cloud-native AI Serving Stack helps you manage complex cloud-native AI inference scenarios, whether you are just starting or running large-scale AI operations.

Core features
The Cloud-native AI Serving Stack makes running LLM inference services on Kubernetes easier and more efficient. It uses innovative workload designs, fine-grained scaling, deep observability, and powerful extension mechanisms. The AI Serving Stack has the following core features.
Feature | Description | References |
Supports single-node LLM inference | You can use a StatefulSet to deploy an LLM inference service. This supports single-node, single-GPU and single-node, multi-GPU deployments. | Deploy a single-node LLM inference service |
Supports multi-node distributed LLM inference | You can use a LeaderWorkerSet to deploy a multi-node, multi-GPU distributed inference service. |
|
Supports PD separation deployment for various inference engines | Different inference engines implement PD separation using various architectures and deployment methods. The AI Serving Stack uses RoleBasedGroup as a unified workload to deploy these PD separation architectures. |
|
Elastic scaling | Balancing cost and performance is crucial for LLM services. The AI Serving Stack provides industry-leading, multi-dimensional, and multi-layer automatic scaling capabilities.
|
|
Observability | The black-box nature of the inference process is a major obstacle to performance optimization. The AI Serving Stack provides a ready-to-use and deep observability solution.
| Configure monitoring for an LLM inference service |
Inference gateway | The ACK Gateway with Inference Extension component is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It supports Kubernetes Layer 4 and Layer 7 routing services and provides a series of enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing performance across multiple inference service workloads. | Configure smart routing with an inference gateway for an LLM inference service |
Model acceleration | In AI inference scenarios, the slow loading of LLM models causes problems, such as high application cold-start times and hindered elastic scaling. Fluid builds a distributed cache to store remote model files on local nodes. This enables fast startup, zero redundancy, and extreme elasticity. | |
Performance profiling | For deeper performance analysis, you can use the AI Profiling tool. It allows developers to collect data from GPU container processes to observe and analyze the performance of online training and inference services without interrupting the service or modifying the code.
| AI Profiling |
Disclaimer
The AI Serving Stack provides deployment and management capabilities for open-source inference engines and their PD separation frameworks. Alibaba Cloud provides technical support for the AI Serving Stack. However, Alibaba Cloud does not provide compensation or other commercial services for business losses caused by defects in the open-source engines or open-source PD separation frameworks.