All Products
Search
Document Center

Container Service for Kubernetes:AI Serving Stack

Last Updated:Aug 25, 2025

As large language models (LLMs) become more prevalent, deploying and managing them efficiently, reliably, and at scale in production is a major challenge for businesses. The Cloud-native AI Serving Stack is an end-to-end solution built on Container Service for Kubernetes and designed specifically for cloud-native AI inference. The stack addresses the entire lifecycle of LLM inference and provides integrated features such as deployment management, smart routing, automatic scaling, and deep observability. The Cloud-native AI Serving Stack helps you manage complex cloud-native AI inference scenarios, whether you are just starting or running large-scale AI operations.

image.png

Core features

The Cloud-native AI Serving Stack makes running LLM inference services on Kubernetes easier and more efficient. It uses innovative workload designs, fine-grained scaling, deep observability, and powerful extension mechanisms. The AI Serving Stack has the following core features.

Feature

Description

References

Supports single-node LLM inference

You can use a StatefulSet to deploy an LLM inference service. This supports single-node, single-GPU and single-node, multi-GPU deployments.

Deploy a single-node LLM inference service

Supports multi-node distributed LLM inference

You can use a LeaderWorkerSet to deploy a multi-node, multi-GPU distributed inference service.

Supports PD separation deployment for various inference engines

Different inference engines implement PD separation using various architectures and deployment methods. The AI Serving Stack uses RoleBasedGroup as a unified workload to deploy these PD separation architectures.

  • Deploy an SGLang PD separation inference service

  • Deploy a Dynamo PD separation inference service

Elastic scaling

Balancing cost and performance is crucial for LLM services. The AI Serving Stack provides industry-leading, multi-dimensional, and multi-layer automatic scaling capabilities.

  • General elastic support: The stack deeply integrates and optimizes standard scaling mechanisms, such as Horizontal Pod Autoscaler (HPA), Kubernetes Event-driven Autoscaling (KEDA), and Knative (KPA), to meet the needs of different scenarios.

  • Smart scaling for PD separation: The stack exclusively supports independent scaling for specific roles in a RoleBasedGroup (RBG). For example, you can dynamically scale the "Prefill" role based on inference engine metrics, such as request queue length, while keeping the "Scheduler" role stable. This achieves fine-grained resource configuration.

  • Configure elastic scaling for single-node or multi-node inference

  • Configure an automatic scaling policy for a PD separation inference service

Observability

The black-box nature of the inference process is a major obstacle to performance optimization. The AI Serving Stack provides a ready-to-use and deep observability solution.

  • Core engine monitoring: For mainstream inference engines, such as vLLM and SGLang, the stack provides pre-built metrics dashboards. These dashboards cover key metrics such as token throughput, request latency, GPU utilization, and KV Cache hit rate.

  • Fast problem identification: The intuitive monitoring views help developers quickly locate performance bottlenecks and make informed optimization decisions.

Configure monitoring for an LLM inference service

Inference gateway

The ACK Gateway with Inference Extension component is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It supports Kubernetes Layer 4 and Layer 7 routing services and provides a series of enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing performance across multiple inference service workloads.

Configure smart routing with an inference gateway for an LLM inference service

Model acceleration

In AI inference scenarios, the slow loading of LLM models causes problems, such as high application cold-start times and hindered elastic scaling. Fluid builds a distributed cache to store remote model files on local nodes. This enables fast startup, zero redundancy, and extreme elasticity.

Best practices for Fluid data cache optimization policies

Performance profiling

For deeper performance analysis, you can use the AI Profiling tool. It allows developers to collect data from GPU container processes to observe and analyze the performance of online training and inference services without interrupting the service or modifying the code.

  • Non-intrusive design: You can start it with one click. It is safe, reliable, and does not affect online services.

  • Code bottleneck insights: This feature helps locate performance hot spots at the level of specific CUDA Kernels or Python functions. This provides data to support extreme optimization.

AI Profiling

Disclaimer

The AI Serving Stack provides deployment and management capabilities for open-source inference engines and their PD separation frameworks. Alibaba Cloud provides technical support for the AI Serving Stack. However, Alibaba Cloud does not provide compensation or other commercial services for business losses caused by defects in the open-source engines or open-source PD separation frameworks.