Traditional load balancing algorithms can evenly distribute standard HTTP requests across different workloads. However, for LLM inference services, the load from each request is unpredictable. Gateway with Inference Extension is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It uses smart routing to optimize load balancing across multiple inference service workloads. It provides different load balancing policies for various LLM inference scenarios, and supports request queuing and traffic management for canary releases.
Prerequisites
You have deployed:
Step 1: Configure smart routing
Gateway with Inference Extension provides two smart routing load balancing policies to meet different inference service requirements.
Default policy (least load): Based on request queue length and GPU cache utilization.
Prefix Cache Aware Routing: Optimizes performance for shared prompt prefixes.
Configure the InferencePool and InferenceModel resources based on your backend inference service's deployment method and your chosen load balancing policy.
Default policy
When the annotations field of the InferencePool is empty, the gateway uses the default smart routing policy. This policy achieves optimal load balancing by dynamically allocating requests based on the real-time load of the backend inference service, including request queue length and GPU cache utilization.
Create an
inference_networking.yamlfile.Single-machine vLLM
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Single-machine SGLang
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed vLLM
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed SGLang
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100SGLang Prefill-Decode (P/D) separation
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads --- # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service profile: pd: # Specifies that the backend service is deployed in P/D separation mode pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels kvTransfer: bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang P/D separation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.Apply the configuration.
kubectl create -f inference_networking.yaml
Prefix Cache Aware Routing
The Prefix Cache Aware Routing policy increases the prefix cache hit ratio and reduces response time by routing requests with common prefixes to the same inference server Pod.
The vLLM v0.9.2 and SGLang framework versions used in this document have prefix cache enabled by default. You do not need to redeploy the service to enable the prefix cache.
To enable the prefix-aware load balancing policy, add the following annotation to the InferencePool: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
Create a
Prefix_Cache.yamlfile.Single-machine vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Single-machine SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100SGLang PD disaggregation deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads --- # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service profile: pd: # Specifies that the backend service is deployed in PD disaggregation mode trafficPolicy: prefixCache: # Declares the prefix cache load balancing policy mode: estimate prefillPolicyRef: prefixCache decodePolicyRef: prefixCache # Applies prefix-aware load balancing to both prefill and decode pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels kvTransfer: bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.Apply the configuration for prefix-aware load balancing.
kubectl create -f Prefix_Cache.yaml
Step 2: Deploy the gateway
Create a
gateway_networking.yamlfile.apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway-class spec: controllerName: inference.networking.x-k8s.io/gateway-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway-class listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.ioCreate the
GatewayClass,Gateway, andHTTPRouteresources to route traffic to the LLM inference service on port 8080.kubectl create -f gateway_networking.yaml
Step 3: Verify the gateway configuration
Run the following command to retrieve the gateway's external access address:
export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')Use
curlto test access to the service on port 8080:curl http://${GATEWAY_HOST}:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen3-32B", "messages": [ {"role": "user", "content": "Hello, this is a test"} ], "max_tokens": 50 }'Verify the different load balancing policies.
Default policy
The default policy uses smart routing based on request queue length and GPU cache utilization. Verify this by running a stress test on the inference service and observing the Time to First Token (TTFT) and throughput metrics.
For detailed testing methods, see Configure observability metrics and dashboards for LLM services.
Prefix Cache Aware Routing
Create test files to verify that the Prefix Cache Aware Routing policy is functioning correctly.
Generate
round1.txt:echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtGenerate
round2.txt:echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtRun the following commands to perform the test:
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txtCheck the logs of the Inference Extension Processor to confirm that prefix-aware load balancing is working:
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"If the same Pod name appears in both log entries, Prefix Cache Aware Routing is functioning correctly.
For more information about the testing method and results for Prefix Cache Aware Routing, see Evaluate inference service performance using multi-turn conversation tests.