All Products
Search
Document Center

Container Compute Service:Manage traffic and inference services with Gateway with Inference Extension

Last Updated:Mar 26, 2026

Gateway with Inference Extension is an ACK component that extends the Kubernetes Gateway API and its Inference Extension to address the load balancing and routing challenges of generative AI inference workloads. It evaluates real-time metrics from inference servers to route each request to the pod best able to handle it—reducing time to first token (TTFT) latency and improving GPU utilization across your cluster.

Features

Gateway with Inference Extension supports Layer 4 and Layer 7 routing in Kubernetes, with the following capabilities built for AI inference:

  • Inference-aware load balancing: Routes requests based on real-time inference server metrics (request queue depth and GPU KV cache utilization), rather than distributing load evenly. This keeps GPU utilization consistent across pods and reduces TTFT latency.

  • Model-aware routing: Routes inference requests by model name, as defined in the OpenAI API specification. For a single base model with multiple LoRA adapters, you can use traffic shifting to direct requests to specific adapters by name.

  • Model criticality: Assigns a criticality level to each model. Requests for models marked Critical are processed with priority over lower-criticality workloads.

Key concepts

Gateway with Inference Extension introduces two custom resources that extend the Kubernetes Gateway API:

Resource Purpose
InferencePool Groups pods that share the same computing configuration, accelerator type, foundation model, and model server. A single InferencePool can span multiple pods across multiple ACK nodes, providing scalability and high availability.
InferenceModel Specifies the model name served by an InferencePool and defines its service properties, including criticality level.

The following diagram shows the relationship between InferencePool, InferenceModel, and the Gateway API resources.

image

How it works

When an inference request arrives, the gateway evaluates the internal state of inference servers through multiple dimensions of metrics and routes the request accordingly. The following metrics are used to assess each pod:

  • Request queue length (vllm:num_requests_waiting): The number of requests waiting to be processed. Pods with shorter queues are more likely to start processing new requests immediately.

  • GPU KV cache utilization (vllm:gpu_cache_usage_perc): The percentage of GPU KV cache used to store intermediate inference results. Lower utilization means the pod has more capacity for new requests.

The gateway selects the pod best able to handle the request based on these metrics and routes the request to that pod.

The following diagram illustrates this request processing flow.

image

Why inference-aware load balancing

Standard HTTP load balancing distributes requests evenly across pods. This works for stateless services, but not for large language model (LLM) inference—where the compute cost of each request is unpredictable.

LLM inference has two phases:

  • Prefill phase: Encodes the input.

  • Decoding phase: Divided into several steps; each step decodes the previous input and outputs a new token (the basic unit of LLM data processing, roughly corresponding to each word output by LLM inference).

Because output length varies and cannot be determined in advance, even distribution leads to uneven GPU load—some pods become bottlenecks while others sit idle.

Inference-aware load balancing routes each request to the pod with the most available capacity, based on real-time queue depth and GPU KV cache utilization. This keeps GPU load consistent across pods, reduces TTFT latency, and improves overall throughput.