This topic describes the main features, implementation principles, and advantages of the Gateway with Inference Extension component.
Features
The Gateway with Inference Extension component is an enhanced component based on the Kubernetes community Gateway API and its Inference Extension. ACK Gateway with Inference Extension supports Layer 4 and Layer 7 routing services in Kubernetes and provides enhanced capabilities for generative AI inference scenarios. This component simplifies the management of generative AI inference services and optimizes load balancing across multiple inference service workloads.
Component features
Model-aware routing: Routes inference requests based on model names defined in the OpenAI API specification. You can perform traffic gray-scale operations on different LoRA models of the same base model by name.
Model criticality configuration: Prioritizes requests for different models by specifying different criticality levels for each model.
Resource description
The Gateway with Inference Extension declares and manages generative AI inference services through InferencePool and InferenceModel custom resources extended from the Gateway API:
InferencePool: Specifies a group of pods that share the same computing configuration, accelerator type, foundation model, and model server. It logically groups and manages AI model service resources. A single InferencePool object can include multiple pods across multiple ACK nodes, providing scalability and high availability.
InferenceModel: Specifies the name of the model served by the model server pods from an InferencePool. The InferenceModel resource also defines the service properties of the model, such as the criticality level. Workloads classified as
Criticalwill be processed with priority.
The following figure shows the association between InferencePool, InferenceModel custom resources, and Gateway API resources.
The following figure illustrates how the InferencePool and InferenceModel resource definitions of the Gateway with Inference Extension component process inference requests.
Advantages of load balancing for model inference service
Traditional HTTP routingFor traditional HTTP requests, classic load balancing algorithms can distribute requests evenly among different workloads. However, for large language models (LLMs) inference services, the load each request brings to the backend is difficult to predict. During the inference process, request processing includes the following two phases:
Due to the inability to determine in advance how many tokens each request will output, evenly distributing requests across different workloads will lead to inconsistent actual workloads for each workload, resulting in load imbalance. | Inference service routingEvaluate the internal state of inference servers through multiple dimensions of metrics and perform load balancing across multiple inference server workloads based on their internal states. The following metrics are included:
Compared to traditional load balancing algorithms, this approach better ensures GPU load consistency across multiple inference service workloads, significantly reduces the time to first token (TTFT) response latency for LLM inference requests, and improves the throughput of LLM inference requests. |