All Products
Search
Document Center

Container Service for Kubernetes:Implement model name-based inference service routing by using Gateway with Inference Extension

Last Updated:Jun 16, 2025

With the Gateway with Inference Extension component, after you deploy generative AI inference services by using the OpenAI API format, you can specify request routing policies based on the model name in the request, including canary releases based on traffic splitting, traffic mirroring, and traffic circuit breaking. This topic describes how to implement inference service routing based on model names by using Gateway with Inference Extension.

Important
  • Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.

  • The content of this topic depends on Gateway with Inference Extension 1.4.0 or later.

Background information

OpenAI-compatible APIs

OpenAI-compatible APIs refer to a type of generative large language model (LLM) inference service APIs that are highly compatible with the official OpenAI API (such as GPT-3.5 and GPT-4) in terms of interface, parameters, and response format. The compatibility is reflected in the following aspects:

  • Interface structure: Uses the same HTTP request methods (such as POST), endpoint formats, and authentication methods (such as API keys).

  • Parameter support: Supports parameters similar to the OpenAI API, such as model, prompt, temperature, and max_tokens.

  • Response format: Returns the same JSON structure as OpenAI, including fields such as choices, usage, and id.

Currently, mainstream third-party LLM services and mainstream LLM inference engines, such as vLLM and SGLang, provide OpenAI-compatible APIs to ensure consistency in user migration and experience.

Scenarios

For generative AI inference services, the model name requested by users is important metadata in the request. Specifying routing policies based on the model name in the request is a common scenario when exposing inference services through a gateway. However, for LLM inference services that provide OpenAI-compatible APIs, the model name information is located in the request body, and ordinary routing policies do not support routing based on the request body.

Gateway with Inference Extension supports specifying routing policies based on model names under OpenAI-compatible APIs. By parsing and extracting the model name from the request body and attaching it to the request header, Gateway with Inference Extension provides the out-of-the-box model name-based routing feature. To use this feature, you only need to match the X-Gateway-Model-Name request header in the HTTPRoute resource to implement model name-based routing capabilities without requiring client modifications.

This example demonstrates how to route the Qwen-2.5-7B-Instruct and DeepSeek-R1-Distill-Qwen-7B inference services on the same gateway instance based on the model name in the request. When a request for the Qwen model is made, route the request to the Qwen inference service. When a request for the DeepSeek-R1 model is made, route the request to the DeepSeek-R1 service. The following is the main flow of routing:

yuque_diagram (2)

Prerequisites

Note

For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS (8th-gen GPU B) cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Procedure

Step 1: Deploy a sample inference service

  1. Create vllm-service.yaml.

    View the YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: qwen
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --trust-remote-code --served-model-name qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: qwen
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: deepseek-r1
      name: deepseek-r1
    spec:
      progressDeadlineSeconds: 600
      replicas: 1 
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: deepseek-r1
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: deepseek-r1
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ds-r1-qwen-7b-without-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: restful
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: deepseek-r1
      name: deepseek-r1
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: deepseek-r1
  2. Deploy the sample inference service.

    kubectl apply -f vllm-service.yaml

Step 2: Deploy inference routing

In this step, you create InferencePool resources and InferenceModel resources.

  1. Create inference-pool.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen
    spec:
      criticality: Critical
      modelName: qwen
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: qwen
        weight: 100
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: deepseek-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: deepseek-ext-proc
      selector:
        app: deepseek-r1
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: deepseek-r1
    spec:
      criticality: Critical
      modelName: deepseek-r1
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: deepseek-pool
      targetModels:
      - name: deepseek-r1
        weight: 100
  2. Deploy the inference routing.

    kubectl apply -f inference-pool.yaml

Step 3: Deploy gateway and gateway routing rules

  1. Create inference-gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8080
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: ClientTrafficPolicy
    metadata:
      name: client-buffer-limit
    spec:
      connection:
        bufferLimit: 20Mi
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: inference-gateway
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. Create inference-route.yaml

    In the routing rules specified in HTTPRoute, the model name in the request body is automatically parsed into the X-Gateway-Model-Name request header.

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: qwen-pool
          weight: 1
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: deepseek-pool
          weight: 1
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: deepseek-r1
  3. Deploy the gateway and gateway rules.

    kubectl apply -f inference-gateway.yaml
    kubectl apply -f inference-route.yaml

Step 4: Verify the effect

  1. Obtain the gateway IP address.

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. Request the qwen model.

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "qwen",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "who are you?" 
          }
        ]
    }'

    Expected output:

    {"id":"chatcmpl-475bc88d-b71d-453f-8f8e-0601338e11a9","object":"chat.completion","created":1748311216,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am here to assist you with any questions or conversations you might have! How can I help you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":33,"total_tokens":70,"completion_tokens":37,"prompt_tokens_details":null},"prompt_logprobs":null}
  3. Request the deepseek-r1 model.

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "deepseek-r1",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "who are you?" 
          }
        ]
    }'

    Expected output:

    {"id":"chatcmpl-9a143fc5-8826-46bc-96aa-c677d130aef9","object":"chat.completion","created":1748312185,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Alright, someone just asked, \"who are you?\" Hmm, I need to explain who I am in a clear and friendly way. Let's see, I'm an AI created by DeepSeek, right? I don't have a physical form, so I don't have a \"name\" like you do. My purpose is to help with answering questions and providing information. I'm here to assist with a wide range of topics, from general knowledge to more specific inquiries. I understand that I can't do things like think or feel, but I'm here to make your day easier by offering helpful responses. So, I'll keep it simple and approachable, making sure to convey that I'm here to help with whatever they need.\n</think>\n\nI'm DeepSeek-R1-Lite-Preview, an AI assistant created by the Chinese company DeepSeek. I'm here to help you with answering questions, providing information, and offering suggestions. I don't have personal experiences or emotions, but I'm designed to make your interactions with me as helpful and pleasant as possible. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":232,"completion_tokens":223,"prompt_tokens_details":null},"prompt_logprobs":null}

    As you can see, both inference services are providing services externally, and external requests can be routed to different inference services based on the model name in the request.