Gateway with Inference Extension uses smart routing to distribute Large Language Model (LLM) inference requests across multiple backend Pods based on real-time load signals — not just round-robin. This document explains how to configure routing policies, deploy the gateway, and verify that traffic is flowing correctly.
Prerequisites
Before you begin, ensure that you have:
Choose a routing policy
Gateway with Inference Extension supports two routing policies. Select the one that matches your use case before writing any YAML.
| Policy | How it works | When to use |
|---|---|---|
| Default (least-load) | Routes each request to the Pod with the shortest request queue and highest GPU cache availability. | General LLM serving — the safe default for most deployments. |
| Prefix Cache Aware Routing | Routes requests that share a common prompt prefix to the same Pod to maximize prefix cache hit rate. | Workloads with repetitive system prompts, multi-turn conversations, or batch jobs using a shared context. |
vLLM v0.9.2 and the SGLang version used in this document have prefix cache enabled by default. Switching to Prefix Cache Aware Routing does not require redeploying the inference service.
Step 1: Configure smart routing
Configure InferencePool and InferenceModel resources to match your backend deployment type and chosen routing policy.
Default policy
When the InferencePool has no routing-strategy annotation, the gateway applies the default least-load policy. It routes each incoming request to the backend Pod with the lowest current load, measured by request queue length and GPU cache utilization.
Create inference_networking.yaml using the template for your deployment type.
Single-machine vLLM
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
spec:
targetPortNumber: 8000 # Port exposed by the inference service
selector:
alibabacloud.com/inference-workload: vllm-inference # Matches vLLM workload Pods
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B # Must match the model name used in inference requests
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
Single-machine SGLang
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
annotations:
inference.networking.x-k8s.io/model-server-runtime: sglang # Identifies SGLang as the runtime
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: sgl-inference
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
Distributed vLLM
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: vllm-multi-nodes
role: leader # Selects only the leader Pod in each distributed group
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
Distributed SGLang
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
annotations:
inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: sglang-multi-nodes
role: leader
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
SGLang Prefill-Decode (P/D) separation
For SGLang deployments in P/D separation mode, use InferenceTrafficPolicy instead of annotations on InferencePool to define the disaggregation behavior and KV cache transfer port.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workload Pods
---
# InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
name: inference-policy
spec:
poolRef:
name: qwen-inference-pool
modelServerRuntime: sglang
profile:
pd: # Specifies that the backend service is deployed in P/D separation mode
pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Pod label that distinguishes prefill from decode
kvTransfer:
bootstrapPort: 34000 # Must match disaggregation-bootstrap-port in the RoleBasedGroup deployment
Apply the configuration:
kubectl create -f inference_networking.yaml
Prefix Cache Aware Routing
Add the annotation inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" to the InferencePool. The gateway then routes requests that share a prompt prefix to the same Pod, increasing the prefix cache hit rate and reducing response latency.
Create Prefix_Cache.yaml using the template for your deployment type.
Single-machine vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
annotations:
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" # Enables prefix-aware routing
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: vllm-inference
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
Single-machine SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
annotations:
inference.networking.x-k8s.io/model-server-runtime: sglang
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: sgl-inference
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
Distributed vLLM deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
annotations:
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: vllm-multi-nodes
role: leader
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
Distributed SGLang deployment
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
annotations:
inference.networking.x-k8s.io/model-server-runtime: sglang
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference-workload: sglang-multi-nodes
role: leader
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen-inference-model
spec:
modelName: /models/Qwen3-32B
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: qwen-inference-pool
targetModels:
- name: /models/Qwen3-32B
weight: 100
SGLang P/D disaggregation with prefix cache
For P/D disaggregation, declare the prefix cache policy inside InferenceTrafficPolicy rather than as an InferencePool annotation. The example below applies prefix-aware load balancing to both the prefill and decode stages.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
spec:
targetPortNumber: 8000
selector:
alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workload Pods
---
# InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
name: inference-policy
spec:
poolRef:
name: qwen-inference-pool
modelServerRuntime: sglang
profile:
pd: # Enables P/D disaggregation mode
trafficPolicy:
prefixCache: # Declares the prefix cache load balancing policy
mode: estimate
prefillPolicyRef: prefixCache
decodePolicyRef: prefixCache # Applies prefix-aware routing to both prefill and decode
pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role
kvTransfer:
bootstrapPort: 34000 # Must match disaggregation-bootstrap-port in the RoleBasedGroup deployment
Apply the configuration:
kubectl create -f Prefix_Cache.yaml
Step 2: Deploy the gateway
Create gateway_networking.yaml with a GatewayClass, Gateway, and HTTPRoute that routes /v1 traffic to the InferencePool on port 8080.
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-gateway-class
spec:
controllerName: inference.networking.x-k8s.io/gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: inference-gateway-class
listeners:
- name: http-llm
protocol: HTTP
port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: inference-route
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /v1
backendRefs:
- name: qwen-inference-pool
kind: InferencePool
group: inference.networking.x-k8s.io
Apply the configuration:
kubectl create -f gateway_networking.yaml
Step 3: Verify the gateway configuration
Get the gateway address:
export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
Send a test request:
curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Qwen3-32B",
"messages": [
{"role": "user", "content": "Hello, this is a test"}
],
"max_tokens": 50
}'
A successful response confirms the gateway is routing requests to the inference service.
Verify the default policy
The default policy routes requests based on queue length and GPU cache utilization. To confirm it is working, run a load test against the inference service and observe Time to First Token (TTFT) and throughput. For detailed testing steps, see Configure observability metrics and dashboards for LLM services.
Verify Prefix Cache Aware Routing
Use two sequential requests that share a long prompt prefix. If both requests are routed to the same Pod, prefix-aware routing is functioning correctly.
-
Generate
round1.txt:echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt -
Generate
round2.txt:echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt -
Send both requests:
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt -
Check the Inference Extension Processor logs:
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"If the same Pod name appears in both log entries, Prefix Cache Aware Routing is working correctly.
For more information about testing methodology and results, see Evaluate inference service performance using multi-turn conversation tests.