AI Serving Stack - Container Service for Kubernetes - Alibaba Cloud Documentation Center

As large language models (LLMs) become more prevalent, deploying and managing them efficiently, reliably, and at scale in production is a major challenge for businesses. The Cloud-native AI Serving Stack is an end-to-end solution built on Container Service for Kubernetes and designed specifically for cloud-native AI inference. The stack addresses the entire lifecycle of LLM inference and provides integrated features such as deployment management, smart routing, automatic scaling, and deep observability. The Cloud-native AI Serving Stack helps you manage complex cloud-native AI inference scenarios, whether you are just starting or running large-scale AI operations.

Core features

The Cloud-native AI Serving Stack makes running LLM inference services on Kubernetes easier and more efficient. It uses innovative workload designs, fine-grained scaling, deep observability, and powerful extension mechanisms. The AI Serving Stack has the following core features.

Feature	Description	References
Supports single-node LLM inference	You can use a StatefulSet to deploy an LLM inference service. This supports single-node, single-GPU and single-node, multi-GPU deployments.	Deploy a single-node LLM inference service
Supports multi-node distributed LLM inference	You can use a LeaderWorkerSet to deploy a multi-node, multi-GPU distributed inference service.	Deploy a multi-node distributed inference service Best practices for deploying a full-performance DeepSeek inference service on a multi-node distributed ACK cluster
Supports PD separation deployment for various inference engines	Different inference engines implement PD separation using various architectures and deployment methods. The AI Serving Stack uses RoleBasedGroup as a unified workload to deploy these PD separation architectures.	Deploy an SGLang PD separation inference service Deploy a Dynamo PD separation inference service
Elastic scaling	Balancing cost and performance is crucial for LLM services. The AI Serving Stack provides industry-leading, multi-dimensional, and multi-layer automatic scaling capabilities. General elastic support: The stack deeply integrates and optimizes standard scaling mechanisms, such as Horizontal Pod Autoscaler (HPA), Kubernetes Event-driven Autoscaling (KEDA), and Knative (KPA), to meet the needs of different scenarios. Smart scaling for PD separation: The stack exclusively supports independent scaling for specific roles in a RoleBasedGroup (RBG). For example, you can dynamically scale the "Prefill" role based on inference engine metrics, such as request queue length, while keeping the "Scheduler" role stable. This achieves fine-grained resource configuration.	Configure elastic scaling for single-node or multi-node inference Configure an automatic scaling policy for a PD separation inference service
Observability	The black-box nature of the inference process is a major obstacle to performance optimization. The AI Serving Stack provides a ready-to-use and deep observability solution. Core engine monitoring: For mainstream inference engines, such as vLLM and SGLang, the stack provides pre-built metrics dashboards. These dashboards cover key metrics such as token throughput, request latency, GPU utilization, and KV Cache hit rate. Fast problem identification: The intuitive monitoring views help developers quickly locate performance bottlenecks and make informed optimization decisions.	Configure monitoring for an LLM inference service
Inference gateway	The ACK Gateway with Inference Extension component is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It supports Kubernetes Layer 4 and Layer 7 routing services and provides a series of enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing performance across multiple inference service workloads.	Configure smart routing with an inference gateway for an LLM inference service
Model acceleration	In AI inference scenarios, the slow loading of LLM models causes problems, such as high application cold-start times and hindered elastic scaling. Fluid builds a distributed cache to store remote model files on local nodes. This enables fast startup, zero redundancy, and extreme elasticity.	Best practices for Fluid data cache optimization policies
Performance profiling	For deeper performance analysis, you can use the AI Profiling tool. It allows developers to collect data from GPU container processes to observe and analyze the performance of online training and inference services without interrupting the service or modifying the code. Non-intrusive design: You can start it with one click. It is safe, reliable, and does not affect online services. Code bottleneck insights: This feature helps locate performance hot spots at the level of specific CUDA Kernels or Python functions. This provides data to support extreme optimization.	AI Profiling

Disclaimer

The AI Serving Stack provides deployment and management capabilities for open-source inference engines and their PD separation frameworks. Alibaba Cloud provides technical support for the AI Serving Stack. However, Alibaba Cloud does not provide compensation or other commercial services for business losses caused by defects in the open-source engines or open-source PD separation frameworks.