Implement inference request queueing and priority scheduling with intelligent inference routing - Container Service for Kubernetes

Gateway with Inference Extension supports load-aware request queueing and priority scheduling for inference services. When a generative AI inference server is operating at full capacity, the gateway prioritizes requests in the queue based on their assigned model criticality. This ensures that requests for high-priority models are processed first. This topic introduces these capabilities of Gateway with Inference Extension.

Important

This feature requires version 1.4.0 or later of Gateway with Inference Extension.

Background

For generative AI inference services, the request throughput of a single inference server is strictly limited by its GPU resources. When many concurrent requests are sent to the same server, resources such as key-value (KV) caching in the inference engine become fully occupied, degrading response times and token throughput for all requests.

Gateway with Inference Extension addresses this by monitoring multiple metrics to assess the internal state of each inference server. When a server's load reaches capacity, the gateway queues incoming inference requests, preventing the server from being overloaded and maintaining overall service quality.

Prerequisites

An ACK managed cluster with GPU node pools is created. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU computing power.
Gateway with Inference Extension 1.4.0 is installed and Enable Gateway API Inference Extension is selected. For more information about the operation entry, see Step 2: Install the Gateway with Inference Extension component.

Note

For the image described in this topic, we recommend using A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Procedure

Step 1: Deploy a sample inference service

Create vllm-service.yaml.

View the YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  replicas: 5
  selector:
    matchLabels:
      app: qwen
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: qwen

Deploy the sample inference service.
```
kubectl apply -f vllm-service.yaml
```

Step 2: Configure inference routing

In this step, you will create InferencePool and InferenceModel resources. By adding the inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" and inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" annotations to InferencePool, you enable the request queuing feature for the selected inference services.

Create a file named inference-pool.yaml.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
    inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
  name: qwen-pool
  namespace: default
spec:
  extensionRef:
    group: ""
    kind: Service
    name: qwen-ext-proc
  selector:
    app: qwen
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-model
spec:
  criticality: Critical
  modelName: qwen
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-pool
  targetModels:
  - name: qwen
    weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: travel-helper-model
spec:
  criticality: Standard
  modelName: travel-helper
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-pool
  targetModels:
  - name: travel-helper-v1
    weight: 100

This configuration defines two InferenceModel resources that represent the two models served by the sample inference service:

qwen-model: Represents the base model qwen and is assigned the Critical criticality level.
travel-helper-model: Represents the LoRA model travel-helper and is assigned the Standard criticality level.

The available criticality levels, in order of priority, are Critical > Standard > Sheddable. When queuing is enabled and the backend servers are at full capacity, requests for higher-priority models will be processed before lower-priority ones.

Deploy the inference routing configuration.
```
kubectl apply -f inference-pool.yaml
```

Step 3: Deploy the gateway and routing rules

In this step, you will configure a gateway and an HTTPRoute to route requests for the qwen and travel-helper models to the qwen-pool backend InferencePool.

Create a file named inference-gateway.yaml.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
  namespace: default
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
    sectionName: llm-gw
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: qwen-pool
    matches:
    - headers:
      - type: Exact
        name: X-Gateway-Model-Name
        value: qwen
    - headers:
      - type: RegularExpression
        name: X-Gateway-Model-Name
        value: travel-helper.*
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

Deploy the gateway.
```
kubectl apply -f inference-gateway.yaml
```

Step 4: Validate queueing and priority scheduling

In this step, you will use the vLLM benchmark tool to simultaneously load test both the qwen and travel-helper models, pushing the inference servers to full capacity.

Deploy the benchmark workload.

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vllm-benchmark
  name: vllm-benchmark
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: vllm-benchmark
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: vllm-benchmark
    spec:
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa
        imagePullPolicy: IfNotPresent
        name: vllm-benchmark
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
EOF

Get the internal IP of the gateway.

export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')

Open two separate terminal windows and run the load tests simultaneously.

Important

The following data was generated in a test environment and is for reference only. Your results may vary depending on your environment.

Terminal 1: Load test the qwen (critical) model

kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/DeepSeek-R1-Distill-Qwen-7B \
--served-model-name qwen \
--trust-remote-code \
--dataset-name random \
--random-prefix-len 1000 \
--random-input-len 3000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 300 \
--max-concurrency 60 \
--host $GW_IP \
--port 8081 \
--endpoint /v1/completions \
--save-result \
2>&1 | tee benchmark_serving.txt

Expected output:

============ Serving Benchmark Result ============
Successful requests:                     293       
Benchmark duration (s):                  1005.55   
Total input tokens:                      1163919   
Total generated tokens:                  837560    
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         832.94    
Total Token throughput (tok/s):          1990.43   
---------------Time to First Token----------------
Mean TTFT (ms):                          21329.91  
Median TTFT (ms):                        15754.01  
P99 TTFT (ms):                           140782.55 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.58     
Median TPOT (ms):                        58.36     
P99 TPOT (ms):                           91.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.32     
Median ITL (ms):                         50.56     
P99 ITL (ms):                            64.12     
==================================================

Terminal 2: Load test the travel-helper (standard) model

kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/DeepSeek-R1-Distill-Qwen-7B \
--served-model-name travel-helper \
--trust-remote-code \
--dataset-name random \
--random-prefix-len 1000 \
--random-input-len 3000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 300 \
--max-concurrency 60 \
--host $GW_IP \
--port 8081 \
--endpoint /v1/completions \
--save-result \
2>&1 | tee benchmark_serving.txt

Expected output:

============ Serving Benchmark Result ============
Successful requests:                     165       
Benchmark duration (s):                  889.41    
Total input tokens:                      660560    
Total generated tokens:                  492207    
Request throughput (req/s):              0.19      
Output token throughput (tok/s):         553.41    
Total Token throughput (tok/s):          1296.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          44201.12  
Median TTFT (ms):                        28757.03  
P99 TTFT (ms):                           214710.13 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.38     
Median TPOT (ms):                        60.51     
P99 TPOT (ms):                           118.36    
---------------Inter-token Latency----------------
Mean ITL (ms):                           66.98     
Median ITL (ms):                         51.25     
P99 ITL (ms):                            64.87     
==================================================

The benchmark results confirm the effectiveness of priority scheduling under full load:

Faster response times: The high-priority qwen model's average Time to First Token (TTFT) was 50% lower than that of the standard-priority travel-helper model.
Higher reliability: The qwen model also returned 96% fewer request errors.