AI Inference Gateway Extension Overview - Container Service for Kubernetes

This topic describes the features, implementation, and benefits of the Gateway with Inference Extension component.

Capabilities

The Gateway with Inference Extension component enhances the Kubernetes Gateway API with support for the Inference Extension specification. It supports Layer 4 and Layer 7 routing and delivers advanced capabilities for generative AI inference. This component simplifies management of inference services for generative AI and optimizes load balancing across multiple inference service workloads.

Component features

Optimized load balancing for model inference services.
Model-aware routing: You can route inference requests based on the model name defined in the OpenAI API specification. This lets you perform grayscale traffic operations on different LoRA models of the same foundation model by name.
Model criticality configuration: You can specify the criticality level of different models to prioritize requests.

Resource description

Gateway with Inference Extension declares and manages generative AI inference services using the InferencePool and InferenceModel CustomResourceDefinitions (CRDs), which extend the Gateway API.

InferencePool: Represents a group of pods that share the same computing configuration, accelerator type, foundation model, and model server. It logically groups and manages AI model service resources. A single InferencePool object can contain multiple pods across multiple ACK nodes, providing scalability and high availability.
InferenceModel: Specifies the name of a model served by the model server pods in an InferencePool. The InferenceModel resource also defines the service properties of the model, such as its criticality level. Workloads classified as Critical are prioritized.

The following figure shows the relationship between the InferencePool and InferenceModel CRDs and Gateway API resources.

The following figure shows how the Gateway with Inference Extension component processes inference requests using the InferencePool and InferenceModel resource definitions.

Benefits of inference extension load balancing

Traditional HTTP routing

For traditional HTTP requests, classic load balancing algorithms can evenly distribute requests to different workloads. However, for Large Language Model (LLM) inference services, the load that each request places on the backend is difficult to predict. During the inference process, request processing includes the following two phases:

Prefill phase: Encodes the input.
Decoding phase: Consists of several steps. Each step decodes the previous input and outputs a new token. A token is the basic unit of data processing for an LLM and roughly corresponds to a word in the LLM's output.

Because it is not possible to determine in advance how many tokens each request will output, distributing requests evenly across different workloads results in inconsistent loads and causes load imbalance.

Inference service routing

The internal state of an inference server is evaluated using metrics from multiple dimensions. Load balancing is then performed across multiple inference server workloads based on this internal state. The main metrics include the following:

Request queue length (vllm: num_requests_waiting): Represents the number of requests that are queued and waiting for processing by the model server. The shorter the queue, the more likely a new request is to be processed promptly.
GPU cache utilization (vllm: gpu_cache_usage_perc): Represents the percentage of KV Cache utilization that the model server uses to cache intermediate inference results. A lower utilization rate indicates that the GPU has sufficient space to allocate resources to new requests.

Compared with traditional load balancing algorithms, this method ensures more consistent GPU loads across multiple inference service workloads. It significantly reduces the time to first token (TTFT) for LLM inference requests and increases the throughput of LLM inference requests.