Gateway with Inference Extension traffic management and inference service management - Container Compute Service

This topic describes the features, implementation principles, and advantages of the Gateway with Inference Extension component.

Features

The Gateway with Inference Extension component is an enhanced component based on the Kubernetes community Gateway API and its Inference Extension specification. It supports Layer 4 and Layer 7 routing in Kubernetes and provides enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing performance across multiple inference service workloads.

Component features

Optimized load balancing for model inference services.
Model-aware routing: Routes inference requests based on the model names defined in the OpenAI API specification. You can perform grayscale traffic operations on different LoRA models of the same foundation model by name.
Model criticality configuration: Prioritizes requests for different models by specifying different criticality levels for each model.

Resource description

The Gateway with Inference Extension component declares and manages generative AI inference services using the InferencePool and InferenceModel custom resources, which are extensions of the Gateway API:

InferencePool: Represents a group of pods that share the same computing configuration, accelerator type, foundation model, and model server. It logically groups and manages AI model service resources. A single InferencePool object can include multiple pods across multiple ACK nodes, which provides scalability and high availability.
InferenceModel: Specifies the name of the model served by the model server pods in an InferencePool. The InferenceModel resource also defines the service properties of the model, such as its criticality level. Workloads classified as Critical are processed with higher priority.

The following figure shows the relationship between the InferencePool and InferenceModel custom resources and the Gateway API resources.

The following figure shows how the InferencePool and InferenceModel resource definitions of the Gateway with Inference Extension component process inference requests.

Advantages of inference-extended load balancing

Traditional HTTP routing

For traditional HTTP requests, classic load balancing algorithms can distribute requests evenly among different workloads. However, for large language model (LLM) inference services, the load that each request places on the backend is difficult to predict. During inference, request processing includes two phases:

Prefill phase: Encodes the input.
Decoding phase: This phase has multiple steps. Each step decodes the previous input and outputs a new token. A token is the basic unit of LLM data processing and roughly corresponds to a word in the LLM inference output.

Because the number of tokens that each request will output cannot be determined in advance, distributing requests evenly across different workloads leads to inconsistent loads and causes a load imbalance.

Inference service routing

The internal state of the inference servers is evaluated using metrics from multiple dimensions. The load is then balanced across multiple inference server workloads based on their internal states. Key metrics include the following:

Request queue length (vllm: num_requests_waiting): Represents the number of requests in the queue waiting to be processed by the model server. A shorter queue means new requests are more likely to be processed promptly.
GPU cache utilization (vllm: gpu_cache_usage_perc): Represents the utilization percentage of the KV Cache, which the model server uses to cache intermediate inference results. Lower utilization indicates that the GPU has enough space to allocate resources to new requests.

Compared with traditional load balancing algorithms, this approach provides better GPU load consistency across multiple inference service workloads. It significantly reduces the time to first token (TTFT) for LLM inference requests and improves throughput.