This topic describes the features, implementation principles, and advantages of the Gateway with Inference Extension component.
Features
The Gateway with Inference Extension component is an enhanced component based on the Kubernetes community Gateway API and its Inference Extension specification. It supports Layer 4 and Layer 7 routing in Kubernetes and provides enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing performance across multiple inference service workloads.
Component features
Model-aware routing: Routes inference requests based on the model names defined in the OpenAI API specification. You can perform grayscale traffic operations on different LoRA models of the same foundation model by name.
Model criticality configuration: Prioritizes requests for different models by specifying different criticality levels for each model.
Resource description
The Gateway with Inference Extension component declares and manages generative AI inference services using the InferencePool and InferenceModel custom resources, which are extensions of the Gateway API:
InferencePool: Represents a group of pods that share the same computing configuration, accelerator type, foundation model, and model server. It logically groups and manages AI model service resources. A single InferencePool object can include multiple pods across multiple ACK nodes, which provides scalability and high availability.
InferenceModel: Specifies the name of the model served by the model server pods in an InferencePool. The InferenceModel resource also defines the service properties of the model, such as its criticality level. Workloads classified as
Criticalare processed with higher priority.
The following figure shows the relationship between the InferencePool and InferenceModel custom resources and the Gateway API resources.
The following figure shows how the InferencePool and InferenceModel resource definitions of the Gateway with Inference Extension component process inference requests.
Advantages of inference-extended load balancing
Traditional HTTP routingFor traditional HTTP requests, classic load balancing algorithms can distribute requests evenly among different workloads. However, for large language model (LLM) inference services, the load that each request places on the backend is difficult to predict. During inference, request processing includes two phases:
Because the number of tokens that each request will output cannot be determined in advance, distributing requests evenly across different workloads leads to inconsistent loads and causes a load imbalance. | Inference service routingThe internal state of the inference servers is evaluated using metrics from multiple dimensions. The load is then balanced across multiple inference server workloads based on their internal states. Key metrics include the following:
Compared with traditional load balancing algorithms, this approach provides better GPU load consistency across multiple inference service workloads. It significantly reduces the time to first token (TTFT) for LLM inference requests and improves throughput. |