Route inference services based on model names using Gateway with Inference Extension - Container Compute Service

Gateway with Inference Extension routes inference requests to different backends based on the model name in the request body — without any changes to the client. This topic shows how to configure model-name-based routing for OpenAI-compatible inference services on an ACK cluster, using Qwen-2.5-7B-Instruct and DeepSeek-R1-Distill-Qwen-7B as examples.

Important

Before you begin, familiarize yourself with the concepts of InferencePool and InferenceModel. This topic requires Gateway with Inference Extension version 1.4.0 or later.

How it works

OpenAI-compatible APIs expose LLM inference services using the same interface design, parameters, and response format as the official OpenAI API (such as GPT-3.5 and GPT-4). This includes:

HTTP interface: POST requests, standard endpoint paths, and API key authentication
Parameters: model, prompt, temperature, max_tokens, and others
Response format: JSON with choices, usage, and id fields

Most major LLM inference engines — including vLLM and SGLang — support OpenAI-compatible APIs.

The routing challenge

When you expose multiple LLM inference services through a single gateway, you typically want to route each request to the backend that serves the requested model. The problem: in OpenAI-compatible APIs, the model name is in the request body, not the headers. Standard HTTP routers cannot inspect the request body for routing decisions.

Gateway with Inference Extension solves this by parsing the request body, extracting the model name, and injecting it into the X-Gateway-Model-Name request header. Your HTTPRoute rules then match on this header, enabling model-name-based routing without any client changes.

The example in this topic routes requests to two inference services on the same gateway instance:

Requests with "model": "qwen" → Qwen-2.5-7B-Instruct
Requests with "model": "deepseek-r1" → DeepSeek-R1-Distill-Qwen-7B

$yuque_diagram $2$$

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster with a GPU node pool, or an ACK managed cluster with the ACK Virtual Node component installed to use ACS GPU computing power
Gateway with Inference Extension version 1.4.0 or later installed, with Enable Gateway API Inference Extension (Requires a deployed inference service) selected. See Install components

The images used in this topic are large. Transfer them to Alibaba Cloud Container Registry (ACR) in advance and pull them over the internal network — pulling directly from the public internet is slow and depends on your cluster's elastic IP address (EIP) bandwidth. For GPU card recommendations: use A10 cards for ACK clusters and L20 (GN8IS) cards for ACS GPU computing power.

Step 1: Deploy the sample inference services

Create a file named vllm-service.yaml.

View YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --trust-remote-code --served-model-name qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: qwen
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: deepseek-r1
  name: deepseek-r1
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: deepseek-r1
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: deepseek-r1
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ds-r1-qwen-7b-without-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: deepseek-r1
  name: deepseek-r1
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: deepseek-r1

Apply the manifest:

kubectl apply -f vllm-service.yaml

Step 2: Deploy the inference routes

Create the InferencePool and InferenceModel resources. Each InferencePool selects the pods that serve a specific model. Each InferenceModel maps a model name to its pool.

Create a file named inference-pool.yaml:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-pool
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: qwen-ext-proc
  selector:
    app: qwen
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen
spec:
  criticality: Critical
  modelName: qwen
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-pool
  targetModels:
  - name: qwen
    weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: deepseek-pool
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: deepseek-ext-proc
  selector:
    app: deepseek-r1
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: deepseek-r1
spec:
  criticality: Critical
  modelName: deepseek-r1
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: deepseek-pool
  targetModels:
  - name: deepseek-r1
    weight: 100

Apply the manifest:

kubectl apply -f inference-pool.yaml

Step 3: Deploy the gateway and routing rules

This step creates the GatewayClass, Gateway, HTTPRoute, ClientTrafficPolicy, and BackendTrafficPolicy resources.

Create a file named inference-gateway.yaml:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 8080
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-buffer-limit
spec:
  connection:
    bufferLimit: 20Mi
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: inference-gateway
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

The ClientTrafficPolicy sets the client-to-gateway buffer limit to 20 MiB. The BackendTrafficPolicy sets a 24-hour request timeout to handle long-running inference requests.

Create a file named inference-route.yaml.

The HTTPRoute rules match on the X-Gateway-Model-Name header, which Gateway with Inference Extension automatically populates by parsing the model name from the request body.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: qwen-pool
      weight: 1
    matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: qwen
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: deepseek-pool
      weight: 1
    matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: deepseek-r1

Apply both manifests:

kubectl apply -f inference-gateway.yaml
kubectl apply -f inference-route.yaml

Step 4: Verify routing

Get the gateway IP address:

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

Send a request to the qwen model:

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "who are you?"
      }
    ]
  }'

Expected output:

{"id":"chatcmpl-475bc88d-b71d-453f-8f8e-0601338e11a9","object":"chat.completion","created":1748311216,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am here to assist you with any questions or conversations you might have! How can I help you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":33,"total_tokens":70,"completion_tokens":37,"prompt_tokens_details":null},"prompt_logprobs":null}

Send a request to the deepseek-r1 model:

curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-r1",
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "who are you?"
      }
    ]
  }'

Expected output:

{"id":"chatcmpl-9a143fc5-8826-46bc-96aa-c677d130aef9","object":"chat.completion","created":1748312185,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Alright, someone just asked, \"who are you?\" Hmm, I need to explain who I am in a clear and friendly way. Let's see, I'm an AI created by DeepSeek, right? I don't have a physical form, so I don't have a \"name\" like you do. My purpose is to help with answering questions and providing information. I'm here to assist with a wide range of topics, from general knowledge to more specific inquiries. I understand that I can't do things like think or feel, but I'm here to make your day easier by offering helpful responses. So, I'll keep it simple and approachable, making sure to convey that I'm here to help with whatever they need.\n</think>\n\nI'm DeepSeek-R1-Lite-Preview, an AI assistant created by the Chinese company DeepSeek. I'm here to help you with answering questions, providing information, and offering suggestions. I don't have personal experiences or emotions, but I'm designed to make your interactions with me as helpful and pleasant as possible. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":9,"total_tokens":232,"completion_tokens":223,"prompt_tokens_details":null},"prompt_logprobs":null}

Both responses confirm that requests are routed to the correct inference service based on the model name in the request body.