Configure smart routing for an LLM inference service using Gateway with Inference Extension - Container Service for Kubernetes

Traditional load balancing algorithms can evenly distribute standard HTTP requests across different workloads. However, for LLM inference services, the load from each request is unpredictable. Gateway with Inference Extension is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It uses smart routing to optimize load balancing across multiple inference service workloads. It provides different load balancing policies for various LLM inference scenarios, and supports request queuing and traffic management for canary releases.

Prerequisites

You have deployed:
- The Gateway with Inference Extension component.
- A single-machine LLM inference service or multi-node distributed inference service.

Step 1: Configure smart routing

Gateway with Inference Extension provides two smart routing load balancing policies to meet different inference service requirements.

Default policy (least load): Based on request queue length and GPU cache utilization.
Prefix Cache Aware Routing: Optimizes performance for shared prompt prefixes.

Configure the InferencePool and InferenceModel resources based on your backend inference service's deployment method and your chosen load balancing policy.

Default policy

When the annotations field of the InferencePool is empty, the gateway uses the default smart routing policy. This policy achieves optimal load balancing by dynamically allocating requests based on the real-time load of the backend inference service, including request queue length and GPU cache utilization.

Create an inference_networking.yaml file.

Single-machine vLLM

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Single-machine SGLang

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed vLLM

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed SGLang

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang Prefill-Decode (P/D) separation

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads
---
# InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service
  profile:
    pd:  # Specifies that the backend service is deployed in P/D separation mode
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels
      kvTransfer:
        bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang P/D separation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.

Apply the configuration.

kubectl create -f inference_networking.yaml

Prefix Cache Aware Routing

The Prefix Cache Aware Routing policy increases the prefix cache hit ratio and reduces response time by routing requests with common prefixes to the same inference server Pod.

Important

The vLLM v0.9.2 and SGLang framework versions used in this document have prefix cache enabled by default. You do not need to redeploy the service to enable the prefix cache.

To enable the prefix-aware load balancing policy, add the following annotation to the InferencePool: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

Create a Prefix_Cache.yaml file.

Single-machine vLLM deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Single-machine SGLang deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed vLLM deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed SGLang deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD disaggregation deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads
---
# InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service
  profile:
    pd:  # Specifies that the backend service is deployed in PD disaggregation mode
      trafficPolicy:
        prefixCache: # Declares the prefix cache load balancing policy
          mode: estimate
      prefillPolicyRef: prefixCache
      decodePolicyRef: prefixCache # Applies prefix-aware load balancing to both prefill and decode
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels
      kvTransfer:
        bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.

Apply the configuration for prefix-aware load balancing.
```
kubectl create -f Prefix_Cache.yaml
```

Parameters

Parameter	Type	Description	Default value
`metadata.annotations.inference.networking.x-k8s.io/model-server-runtime`	string	Specifies the model server runtime. Example: sglang.	None
`metadata.annotations.inference.networking.x-k8s.io/routing-strategy`	string	Specifies the routing policy. Valid values: `DEFAULT`, `PREFIX_CACHE`.	`DEFAULT` (Smart routing based on request queue length and GPU cache utilization).
`spec.targetPortNumber`	int	Specifies the port number of the inference service.	None
`spec.selector`	map[string]string	The selector for matching the inference service pods.	None
`spec.extensionRef`	ObjectReference	A reference to the inference extension service.	None
`spec.modelName`	string	The model name, used for route matching.	None
`spec.criticality`	string	The criticality level of the model. Valid values: `Critical`, `Standard`.	None
`spec.poolRef`	PoolReference	The associated InferencePool resource.	None

Step 2: Deploy the gateway

Create a gateway_networking.yaml file.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway-class
spec:
  controllerName: inference.networking.x-k8s.io/gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway-class
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io

Create the GatewayClass, Gateway, and HTTPRoute resources to route traffic to the LLM inference service on port 8080.
```
kubectl create -f gateway_networking.yaml
```

Step 3: Verify the gateway configuration

Run the following command to retrieve the gateway's external access address:

export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

Use curl to test access to the service on port 8080:

curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-32B",
    "messages": [
      {"role": "user", "content": "Hello, this is a test"}
    ],
    "max_tokens": 50
  }'

Verify the different load balancing policies.

Default policy

The default policy uses smart routing based on request queue length and GPU cache utilization. Verify this by running a stress test on the inference service and observing the Time to First Token (TTFT) and throughput metrics.

For detailed testing methods, see Configure observability metrics and dashboards for LLM services.

Prefix Cache Aware Routing

Create test files to verify that the Prefix Cache Aware Routing policy is functioning correctly.

Generate round1.txt:

echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

Generate round2.txt:

echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

Run the following commands to perform the test:

curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

Check the logs of the Inference Extension Processor to confirm that prefix-aware load balancing is working:
```
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"
```
If the same Pod name appears in both log entries, Prefix Cache Aware Routing is functioning correctly.
For more information about the testing method and results for Prefix Cache Aware Routing, see Evaluate inference service performance using multi-turn conversation tests.