Overview of Gateway with Inference Extension - Container Service for Kubernetes

This topic describes the main features, implementation principles, and advantages of the Gateway with Inference Extension component.

Features

The Gateway with Inference Extension component is an enhanced component based on the Kubernetes community Gateway API and its Inference Extension. ACK Gateway with Inference Extension supports Layer 4 and Layer 7 routing services in Kubernetes and provides enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing across multiple inference service workloads.

Component features

Optimized load balancing for model inference service.
Model-aware routing: Routes inference requests based on model names defined in the OpenAI API specification. You can perform traffic gray-scale operations on different LoRA models of the same base model by name.
Model criticality configuration: Prioritizes requests for different models by specifying different criticality levels for each model.

Resource description

The Gateway with Inference Extension declares and manages generative AI inference services through InferencePool and InferenceModel custom resources extended from the Gateway API:

InferencePool: Specifies a group of pods that share the same computing configuration, accelerator type, foundation model, and model server. It logically groups and manages AI model service resources. A single InferencePool object can include multiple pods across multiple ACK nodes, providing scalability and high availability.
InferenceModel: Specifies the name of the model served by the model server pods from an InferencePool. The InferenceModel resource also defines the service properties of the model, such as the criticality level. Workloads classified as Critical will be processed with priority.

The following figure shows the association between InferencePool, InferenceModel custom resources, and Gateway API resources.

The following figure illustrates how the InferencePool and InferenceModel resource definitions of the Gateway with Inference Extension component process inference requests.

Advantages of load balancing for model inference service

Traditional HTTP routing

For traditional HTTP requests, classic load balancing algorithms can distribute requests evenly among different workloads. However, for large language models (LLMs) inference services, the load each request brings to the backend is difficult to predict. During the inference process, request processing includes the following two phases:

Prefill phase: Encodes the input.
Decoding phase: This phase can be divided into several steps, each step decodes the previous input and outputs a new token (the basic unit of LLM data processing, roughly corresponding to each word output by LLM inference).

Due to the inability to determine in advance how many tokens each request will output, evenly distributing requests across different workloads will lead to inconsistent actual workloads for each workload, resulting in load imbalance.

Inference service routing

Evaluate the internal state of inference servers through multiple dimensions of metrics and perform load balancing across multiple inference server workloads based on their internal states. The following metrics are included:

Request queue length (vllm: num_requests_waiting): Specifies the number of requests waiting to be processed by the model server. The fewer requests in the queue, the more likely new requests will be processed promptly.
GPU Cache utilization (vllm: gpu_cache_usage_perc): Specifies the percentage of KV Cache utilization used by the model server to cache intermediate inference results. Lower utilization indicates that the GPU has sufficient space to allocate resources to new requests.

Compared to traditional load balancing algorithms, this approach better ensures GPU load consistency across multiple inference service workloads, significantly reduces the time to first token (TTFT) response latency for LLM inference requests, and improves the throughput of LLM inference requests.