All Products
Search
Document Center

Container Service for Kubernetes:Configure smart routing for LLM inference services using an inference gateway

Last Updated:Sep 04, 2025

Traditional load balancing algorithms can evenly distribute standard HTTP requests across different workloads. However, for large language model (LLM) inference services, predicting the payload of each request on the backend is challenging. Gateway with Inference Extension is an enhanced component built on the Kubernetes Gateway API and its Inference Extension specification. It uses smart routing to improve load balancing across multiple inference service workloads. The gateway offers various load balancing policies for different LLM inference service scenarios and enables features such as phased releases and inference request queuing.

Prerequisites

Step 1: Configure smart routing for the inference service

Gateway with Inference Extension provides two smart routing load balancing policies to meet different inference service needs.

  • Load balancing based on request queue length and GPU cache utilization (default policy).

  • Prefix-aware load balancing policy (Prefix Cache Aware Routing).

You can enable the smart routing feature of the inference gateway by declaring InferencePool and InferenceModel resources for the inference service. Adjust the InferencePool and InferenceModel resource configurations based on the backend deployment method and the selected load balancing policy.

Load balancing based on request queue length and GPU cache utilization

If the annotations of the InferencePool are empty, the smart routing policy based on request queue length and GPU cache utilization is used by default. This policy dynamically allocates requests based on the real-time load of the backend inference service, which includes the request queue length and GPU cache utilization, to achieve optimal load balancing.

  1. Create an inference_networking.yaml file.

    Single-machine vLLM deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Single-machine SGLang deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed vLLM deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed SGLang deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD disaggregation deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads
    ---
    # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service
      profile:
        pd:  # Specifies that the backend service is deployed in PD disaggregation mode
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels
          kvTransfer:
            bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.
  2. Create the load balancer based on request queue length and GPU cache utilization.

    kubectl create -f inference_networking.yaml

Prefix-aware load balancing (Prefix Cache Aware Routing)

The Prefix Cache Aware Routing policy sends requests that share the same prefix content to the same inference server pod whenever possible. If the model server has the automatic prefix cache (APC) feature enabled, this policy can improve the prefix cache hit ratio and reduce response times.

Important

The vLLM v0.9.2 version and the SGLang framework used in this document have the prefix cache feature enabled by default. You do not need to redeploy the service to enable prefix caching.

To enable the prefix-aware load balancing policy, add the following annotation to the InferencePool: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

  1. Create a Prefix_Cache.yaml file.

    Single-machine vLLM deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Single-machine SGLang deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed vLLM deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    Distributed SGLang deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD disaggregation deployment

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # Selects both prefill and decode workloads
    ---
    # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # Specifies SGLang as the runtime framework for the backend service
      profile:
        pd:  # Specifies that the backend service is deployed in PD disaggregation mode
          trafficPolicy:
            prefixCache: # Declares the prefix cache load balancing policy
              mode: estimate
          prefillPolicyRef: prefixCache
          decodePolicyRef: prefixCache # Applies prefix-aware load balancing to both prefill and decode
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role # Differentiates between prefill and decode roles in the InferencePool using pod labels
          kvTransfer:
            bootstrapPort: 34000 # The bootstrap port used for KV Cache transfer in the SGLang PD disaggregation service. This must be consistent with the disaggregation-bootstrap-port parameter specified in the RoleBasedGroup deployment.
  2. Create the prefix-aware load balancer.

    kubectl create -f Prefix_Cache.yaml

Expand to view descriptions of InferencePool and InferenceModel configuration items.

Configuration item

Type

Description

Default value

metadata.annotations.inference.networking.x-k8s.io/model-server-runtime

string

Specifies the model service runtime, such as sglang.

None

metadata.annotations.inference.networking.x-k8s.io/routing-strategy

string

Specifies the routing policy. Valid values: DEFAULT and PREFIX_CACHE.

Smart routing policy based on request queue length and GPU cache utilization

spec.targetPortNumber

int

Specifies the port number of the inference service.

None

spec.selector

map[string]string

The selector used to match the pods of the inference service.

None

spec.extensionRef

ObjectReference

A declaration for the inference extension service.

None

spec.modelName

string

The model name, used for route matching.

None

spec.criticality

string

The criticality level of the model. Valid values: Critical and Standard.

None

spec.poolRef

PoolReference

The associated InferencePool resource.

None

Step 2: Deploy the gateway

  1. Create a gateway_networking.yaml file.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway-class
    spec:
      controllerName: inference.networking.x-k8s.io/gateway-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway-class
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
  2. Create the GatewayClass, Gateway, and HTTPRoute resources to configure the LLM inference service route on port 8080.

    kubectl create -f gateway_networking.yaml

Step 3: Verify the inference gateway configuration

  1. Run the following command to obtain the external endpoint of the gateway:

    export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. Test access to the service on port 8080 using the curl command:

    curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/models/Qwen3-32B",
        "messages": [
          {"role": "user", "content": "Hello, this is a test"}
        ],
        "max_tokens": 50
      }'
  3. Verify the different load balancing policies.

    Verify the load balancing policy based on request queue length and GPU cache utilization

    The default policy performs smart routing based on the request queue length and GPU cache utilization. You can observe its behavior by stress testing the inference service and monitoring the time to first token (TTFT) and throughput metrics.

    For more information about the specific testing method, see Configure observability metrics and dashboards for LLM services.

    Verify prefix-aware load balancing

    Create test files to verify that prefix-aware load balancing is working.

    1. Generate round1.txt:

      echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    2. Generate round2.txt:

      echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
    3. Run the following commands to perform the test:

      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
    4. Check the logs of the Inference Extension Processor to confirm that prefix-aware load balancing is working:

      kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"

      If you see the same pod name in both log entries, prefix-aware load balancing is working.

      For more information about the specific testing method and results for prefix-aware load balancing, see Evaluate inference service performance using multi-turn conversation tests.