This topic describes the features, implementation, and benefits of the Gateway with Inference Extension component.
Capabilities
The Gateway with Inference Extension component enhances the Kubernetes Gateway API with support for the Inference Extension specification. It supports Layer 4 and Layer 7 routing and delivers advanced capabilities for generative AI inference. This component simplifies management of inference services for generative AI and optimizes load balancing across multiple inference service workloads.
Component features
Model-aware routing: You can route inference requests based on the model name defined in the OpenAI API specification. This lets you perform grayscale traffic operations on different LoRA models of the same foundation model by name.
Model criticality configuration: You can specify the criticality level of different models to prioritize requests.
Resource description
Gateway with Inference Extension declares and manages generative AI inference services using the InferencePool and InferenceModel CustomResourceDefinitions (CRDs), which extend the Gateway API.
InferencePool: Represents a group of pods that share the same computing configuration, accelerator type, foundation model, and model server. It logically groups and manages AI model service resources. A single InferencePool object can contain multiple pods across multiple ACK nodes, providing scalability and high availability.
InferenceModel: Specifies the name of a model served by the model server pods in an InferencePool. The InferenceModel resource also defines the service properties of the model, such as its criticality level. Workloads classified as
Criticalare prioritized.
The following figure shows the relationship between the InferencePool and InferenceModel CRDs and Gateway API resources.
The following figure shows how the Gateway with Inference Extension component processes inference requests using the InferencePool and InferenceModel resource definitions.
Benefits of inference extension load balancing
Traditional HTTP routingFor traditional HTTP requests, classic load balancing algorithms can evenly distribute requests to different workloads. However, for Large Language Model (LLM) inference services, the load that each request places on the backend is difficult to predict. During the inference process, request processing includes the following two phases:
Because it is not possible to determine in advance how many tokens each request will output, distributing requests evenly across different workloads results in inconsistent loads and causes load imbalance. | Inference service routingThe internal state of an inference server is evaluated using metrics from multiple dimensions. Load balancing is then performed across multiple inference server workloads based on this internal state. The main metrics include the following:
Compared with traditional load balancing algorithms, this method ensures more consistent GPU loads across multiple inference service workloads. It significantly reduces the time to first token (TTFT) for LLM inference requests and increases the throughput of LLM inference requests. |