All Products
Search
Document Center

Container Service for Kubernetes:Configure smart routing for an LLM inference service using Gateway with Inference Extension

Last Updated:Jan 15, 2026

Traditional load balancing algorithms can evenly distribute standard HTTP requests across different workloads. However, for LLM inference services, the load from each request is unpredictable. Gateway with Inference Extension is an enhanced component based on the Kubernetes Gateway API and its Inference Extension specification. It uses smart routing to optimize load balancing across multiple inference service workloads. It provides different load balancing policies for various LLM inference scenarios, and supports request queuing and traffic management for canary releases.

Prerequisites

Step 1: Configure smart routing

Gateway with Inference Extension provides two smart routing load balancing policies to meet different inference service requirements.

  • Default policy (least load): Based on request queue length and GPU cache utilization.

  • Prefix Cache Aware Routing: Optimizes performance for shared prompt prefixes.

Configure the InferencePool and InferenceModel resources based on your backend inference service's deployment method and your chosen load balancing policy.

Default policy

When the annotations field of the InferencePool is empty, the gateway uses the default smart routing policy. This policy achieves optimal load balancing by dynamically allocating requests based on the real-time load of the backend inference service, including request queue length and GPU cache utilization.

  1. Create an inference_networking.yaml file.

    Single-machine vLLM

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Single-machine SGLang

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed vLLM

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed SGLang

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang Prefill-Decode (P/D) separation

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads
    ---
    # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service
      profile:
        pd:  # Specifies that the backend service is deployed in P/D separation mode
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels
          kvTransfer:
            bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang P/D separation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.
  2. Apply the configuration.

    kubectl create -f inference_networking.yaml

Prefix Cache Aware Routing

The Prefix Cache Aware Routing policy increases the prefix cache hit ratio and reduces response time by routing requests with common prefixes to the same inference server Pod.

Important

The vLLM v0.9.2 and SGLang framework versions used in this document have prefix cache enabled by default. You do not need to redeploy the service to enable the prefix cache.

To enable the prefix-aware load balancing policy, add the following annotation to the InferencePool: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

  1. Create a Prefix_Cache.yaml file.

    Single-machine vLLM deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Single-machine SGLang deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed vLLM deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed SGLang deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD disaggregation deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads
    ---
    # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service
      profile:
        pd:  # Specifies that the backend service is deployed in PD disaggregation mode
          trafficPolicy:
            prefixCache: # Declares the prefix cache load balancing policy
              mode: estimate
          prefillPolicyRef: prefixCache
          decodePolicyRef: prefixCache # Applies prefix-aware load balancing to both prefill and decode
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels
          kvTransfer:
            bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.
  2. Apply the configuration for prefix-aware load balancing.

    kubectl create -f Prefix_Cache.yaml

Parameters

Parameter

Type

Description

Default value

metadata.annotations.inference.networking.x-k8s.io/model-server-runtime

string

Specifies the model server runtime.

Example: sglang.

None

metadata.annotations.inference.networking.x-k8s.io/routing-strategy

string

Specifies the routing policy.

Valid values: DEFAULT, PREFIX_CACHE.

DEFAULT (Smart routing based on request queue length and GPU cache utilization).

spec.targetPortNumber

int

Specifies the port number of the inference service.

None

spec.selector

map[string]string

The selector for matching the inference service pods.

None

spec.extensionRef

ObjectReference

A reference to the inference extension service.

None

spec.modelName

string

The model name, used for route matching.

None

spec.criticality

string

The criticality level of the model.

Valid values: Critical, Standard.

None

spec.poolRef

PoolReference

The associated InferencePool resource.

None

Step 2: Deploy the gateway

  1. Create a gateway_networking.yaml file.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway-class
    spec:
      controllerName: inference.networking.x-k8s.io/gateway-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway-class
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
  2. Create the GatewayClass, Gateway, and HTTPRoute resources to route traffic to the LLM inference service on port 8080.

    kubectl create -f gateway_networking.yaml

Step 3: Verify the gateway configuration

  1. Run the following command to retrieve the gateway's external access address:

    export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. Use curl to test access to the service on port 8080:

    curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/models/Qwen3-32B",
        "messages": [
          {"role": "user", "content": "Hello, this is a test"}
        ],
        "max_tokens": 50
      }'
  3. Verify the different load balancing policies.

    Default policy

    The default policy uses smart routing based on request queue length and GPU cache utilization. Verify this by running a stress test on the inference service and observing the Time to First Token (TTFT) and throughput metrics.

    For detailed testing methods, see Configure observability metrics and dashboards for LLM services.

    Prefix Cache Aware Routing

    Create test files to verify that the Prefix Cache Aware Routing policy is functioning correctly.

    1. Generate round1.txt:

      echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    2. Generate round2.txt:

      echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
    3. Run the following commands to perform the test:

      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
    4. Check the logs of the Inference Extension Processor to confirm that prefix-aware load balancing is working:

      kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"

      If the same Pod name appears in both log entries, Prefix Cache Aware Routing is functioning correctly.

      For more information about the testing method and results for Prefix Cache Aware Routing, see Evaluate inference service performance using multi-turn conversation tests.