All Products
Search
Document Center

Container Service for Kubernetes:Configure smart routing for an LLM inference service using Gateway with Inference Extension

Last Updated:Mar 26, 2026

Gateway with Inference Extension uses smart routing to distribute Large Language Model (LLM) inference requests across multiple backend Pods based on real-time load signals — not just round-robin. This document explains how to configure routing policies, deploy the gateway, and verify that traffic is flowing correctly.

Prerequisites

Before you begin, ensure that you have:

Choose a routing policy

Gateway with Inference Extension supports two routing policies. Select the one that matches your use case before writing any YAML.

Policy How it works When to use
Default (least-load) Routes each request to the Pod with the shortest request queue and highest GPU cache availability. General LLM serving — the safe default for most deployments.
Prefix Cache Aware Routing Routes requests that share a common prompt prefix to the same Pod to maximize prefix cache hit rate. Workloads with repetitive system prompts, multi-turn conversations, or batch jobs using a shared context.
Important

vLLM v0.9.2 and the SGLang version used in this document have prefix cache enabled by default. Switching to Prefix Cache Aware Routing does not require redeploying the inference service.

Step 1: Configure smart routing

Configure InferencePool and InferenceModel resources to match your backend deployment type and chosen routing policy.

Default policy

When the InferencePool has no routing-strategy annotation, the gateway applies the default least-load policy. It routes each incoming request to the backend Pod with the lowest current load, measured by request queue length and GPU cache utilization.

Create inference_networking.yaml using the template for your deployment type.

Single-machine vLLM

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000           # Port exposed by the inference service
  selector:
    alibabacloud.com/inference-workload: vllm-inference  # Matches vLLM workload Pods
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B    # Must match the model name used in inference requests
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Single-machine SGLang

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang  # Identifies SGLang as the runtime
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed vLLM

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader                    # Selects only the leader Pod in each distributed group
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed SGLang

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang Prefill-Decode (P/D) separation

For SGLang deployments in P/D separation mode, use InferenceTrafficPolicy instead of annotations on InferencePool to define the disaggregation behavior and KV cache transfer port.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang  # Selects both prefill and decode workload Pods
---
# InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang
  profile:
    pd:  # Specifies that the backend service is deployed in P/D separation mode
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role  # Pod label that distinguishes prefill from decode
      kvTransfer:
        bootstrapPort: 34000  # Must match disaggregation-bootstrap-port in the RoleBasedGroup deployment

Apply the configuration:

kubectl create -f inference_networking.yaml

Prefix Cache Aware Routing

Add the annotation inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" to the InferencePool. The gateway then routes requests that share a prompt prefix to the same Pod, increasing the prefix cache hit rate and reducing response latency.

Create Prefix_Cache.yaml using the template for your deployment type.

Single-machine vLLM deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"  # Enables prefix-aware routing
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Single-machine SGLang deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed vLLM deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

Distributed SGLang deployment

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang P/D disaggregation with prefix cache

For P/D disaggregation, declare the prefix cache policy inside InferenceTrafficPolicy rather than as an InferencePool annotation. The example below applies prefix-aware load balancing to both the prefill and decode stages.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang  # Selects both prefill and decode workload Pods
---
# InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang
  profile:
    pd:  # Enables P/D disaggregation mode
      trafficPolicy:
        prefixCache:       # Declares the prefix cache load balancing policy
          mode: estimate
      prefillPolicyRef: prefixCache
      decodePolicyRef: prefixCache   # Applies prefix-aware routing to both prefill and decode
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role
      kvTransfer:
        bootstrapPort: 34000  # Must match disaggregation-bootstrap-port in the RoleBasedGroup deployment

Apply the configuration:

kubectl create -f Prefix_Cache.yaml

Key parameters

Parameter Type Description Default
metadata.annotations.inference.networking.x-k8s.io/model-server-runtime string Model server runtime. Example: sglang. None
metadata.annotations.inference.networking.x-k8s.io/routing-strategy string Routing policy. Valid values: DEFAULT, PREFIX_CACHE. DEFAULT
spec.targetPortNumber int Port number of the inference service. None
spec.selector map[string]string Label selector for matching inference service Pods. None
spec.extensionRef ObjectReference Reference to the inference extension service. None
spec.modelName string Model name used for route matching. None
spec.criticality string The criticality level of the model. Valid values: Critical, Standard. None
spec.poolRef PoolReference The associated InferencePool resource. None

Step 2: Deploy the gateway

Create gateway_networking.yaml with a GatewayClass, Gateway, and HTTPRoute that routes /v1 traffic to the InferencePool on port 8080.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway-class
spec:
  controllerName: inference.networking.x-k8s.io/gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway-class
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io

Apply the configuration:

kubectl create -f gateway_networking.yaml

Step 3: Verify the gateway configuration

Get the gateway address:

export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

Send a test request:

curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-32B",
    "messages": [
      {"role": "user", "content": "Hello, this is a test"}
    ],
    "max_tokens": 50
  }'

A successful response confirms the gateway is routing requests to the inference service.

Verify the default policy

The default policy routes requests based on queue length and GPU cache utilization. To confirm it is working, run a load test against the inference service and observe Time to First Token (TTFT) and throughput. For detailed testing steps, see Configure observability metrics and dashboards for LLM services.

Verify Prefix Cache Aware Routing

Use two sequential requests that share a long prompt prefix. If both requests are routed to the same Pod, prefix-aware routing is functioning correctly.

  1. Generate round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
  2. Generate round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  3. Send both requests:

    curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. Check the Inference Extension Processor logs:

    kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"

    If the same Pod name appears in both log entries, Prefix Cache Aware Routing is working correctly.

For more information about testing methodology and results, see Evaluate inference service performance using multi-turn conversation tests.

What's next