Traditional load balancing algorithms can evenly distribute standard HTTP requests across different workloads. However, for large language model (LLM) inference services, predicting the payload of each request on the backend is challenging. Gateway with Inference Extension is an enhanced component built on the Kubernetes Gateway API and its Inference Extension specification. It uses smart routing to improve load balancing across multiple inference service workloads. The gateway offers various load balancing policies for different LLM inference service scenarios and enables features such as phased releases and inference request queuing.
Prerequisites
You have deployed the Gateway with Inference Extension component.
You have deployed a single-machine LLM inference service or a multi-machine distributed inference service.
Step 1: Configure smart routing for the inference service
Gateway with Inference Extension provides two smart routing load balancing policies to meet different inference service needs.
Load balancing based on request queue length and GPU cache utilization (default policy).
Prefix-aware load balancing policy (Prefix Cache Aware Routing).
You can enable the smart routing feature of the inference gateway by declaring InferencePool and InferenceModel resources for the inference service. Adjust the InferencePool and InferenceModel resource configurations based on the backend deployment method and the selected load balancing policy.
Load balancing based on request queue length and GPU cache utilization
If the annotations of the InferencePool are empty, the smart routing policy based on request queue length and GPU cache utilization is used by default. This policy dynamically allocates requests based on the real-time load of the backend inference service, which includes the request queue length and GPU cache utilization, to achieve optimal load balancing.
Create an
inference_networking.yamlfile.Single-machine vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Single-machine SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100SGLang PD disaggregation deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads --- # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service profile: pd: # Specifies that the backend service is deployed in PD disaggregation mode pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels kvTransfer: bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.Create the load balancer based on request queue length and GPU cache utilization.
kubectl create -f inference_networking.yaml
Prefix-aware load balancing (Prefix Cache Aware Routing)
The Prefix Cache Aware Routing policy sends requests that share the same prefix content to the same inference server pod whenever possible. If the model server has the automatic prefix cache (APC) feature enabled, this policy can improve the prefix cache hit ratio and reduce response times.
The vLLM v0.9.2 version and the SGLang framework used in this document have the prefix cache feature enabled by default. You do not need to redeploy the service to enable prefix caching.
To enable the prefix-aware load balancing policy, add the following annotation to the InferencePool: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
Create a
Prefix_Cache.yamlfile.Single-machine vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Single-machine SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100Distributed SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100SGLang PD disaggregation deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads --- # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service profile: pd: # Specifies that the backend service is deployed in PD disaggregation mode trafficPolicy: prefixCache: # Declares the prefix cache load balancing policy mode: estimate prefillPolicyRef: prefixCache decodePolicyRef: prefixCache # Applies prefix-aware load balancing to both prefill and decode pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels kvTransfer: bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.Create the prefix-aware load balancer.
kubectl create -f Prefix_Cache.yaml
Step 2: Deploy the gateway
Create a
gateway_networking.yamlfile.apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway-class spec: controllerName: inference.networking.x-k8s.io/gateway-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway-class listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.ioCreate the GatewayClass, Gateway, and HTTPRoute resources to configure the LLM inference service route on port 8080.
kubectl create -f gateway_networking.yaml
Step 3: Verify the inference gateway configuration
Run the following command to obtain the external endpoint of the gateway:
export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')Test access to the service on port 8080 using the curl command:
curl http://${GATEWAY_HOST}:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen3-32B", "messages": [ {"role": "user", "content": "Hello, this is a test"} ], "max_tokens": 50 }'Verify the different load balancing policies.
Verify the load balancing policy based on request queue length and GPU cache utilization
The default policy performs smart routing based on the request queue length and GPU cache utilization. You can observe its behavior by stress testing the inference service and monitoring the time to first token (TTFT) and throughput metrics.
For more information about the specific testing method, see Configure observability metrics and dashboards for LLM services.
Verify prefix-aware load balancing
Create test files to verify that prefix-aware load balancing is working.
Generate round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtGenerate round2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtRun the following commands to perform the test:
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txtCheck the logs of the Inference Extension Processor to confirm that prefix-aware load balancing is working:
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"If you see the same pod name in both log entries, prefix-aware load balancing is working.
For more information about the specific testing method and results for prefix-aware load balancing, see Evaluate inference service performance using multi-turn conversation tests.