Gateway with Inference Extension supports load-aware request queueing and priority scheduling for inference services. When GPU resources are saturated, the gateway queues incoming requests and processes them based on model priority, ensuring high-priority models receive faster responses.
This feature requires Gateway with Inference Extension version 1.4.0 or later.
How it works
Generative AI inference servers have a hard limit on GPU throughput. When too many concurrent requests arrive, resources such as KV (key-value) cache fill up entirely, degrading response times and token throughput for all requests.
Gateway with Inference Extension monitors each inference server's internal metrics to detect saturation. When a server is at capacity, the gateway queues requests centrally rather than letting them pile up inside the server. It then dispatches queued requests in priority order—high-priority models are always served before lower-priority ones.
Priority levels
Assign a criticality level to each InferenceModel to control queue priority under saturation:
| Criticality level | Priority order |
|---|---|
Critical |
Highest |
Standard |
Medium |
Schedulable |
Lowest |
When queueing is enabled and backend servers are saturated, Critical requests are dispatched before Standard requests, which are dispatched before Schedulable requests.
Prerequisites
Before you begin, make sure you have:
-
An ACK managed cluster with GPU node pools. Alternatively, install the ACK Virtual Node component to use ACS (Alibaba Cloud Container Compute Service) GPU computing power.
-
Gateway with Inference Extension 1.4.0 installed with Enable Gateway API Inference Extension selected. For the installation entry point, see Step 2: Install the Gateway with Inference Extension component.
For the image described in this topic, use A10 cards for ACK clusters and GN8IS cards for ACS GPU computing power. Due to the large size of the LLM image, transfer it to Container Registry in advance and pull it using the internal network address. Pulling from the public network depends on the bandwidth of the cluster elastic IP address (EIP) and may result in longer wait times.
Enable request queueing and priority scheduling
Step 2: Configure inference routing
Create InferencePool and InferenceModel resources. Adding the inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" and inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" annotations to InferencePool enables request queueing for the selected inference services.
-
Create a file named
inference-pool.yaml.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" name: qwen-pool namespace: default spec: extensionRef: group: "" kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-model spec: criticality: Critical modelName: qwen poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: qwen weight: 100 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: travel-helper-model spec: criticality: Standard modelName: travel-helper poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: travel-helper-v1 weight: 100This configuration defines two
InferenceModelresources for the models served by the sample inference service:Model Criticality qwen-model(base modelqwen)Criticaltravel-helper-model(LoRA modeltravel-helper)Standard -
Deploy the inference routing configuration.
kubectl apply -f inference-pool.yaml
Step 3: Deploy the gateway and routing rules
Configure a Gateway and an HTTPRoute to route requests for the qwen and travel-helper models to the qwen-pool backend InferencePool.
-
Create a file named
inference-gateway.yaml.apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway listeners: - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: llm-route namespace: default spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool matches: - headers: - type: Exact name: X-Gateway-Model-Name value: qwen - headers: - type: RegularExpression name: X-Gateway-Model-Name value: travel-helper.* --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gateway -
Deploy the gateway.
kubectl apply -f inference-gateway.yaml
Step 4: Validate queueing and priority scheduling
Use the vLLM benchmark tool to simultaneously load test both the qwen and travel-helper models, pushing the inference servers to full capacity.
-
Deploy the benchmark workload.
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: vllm-benchmark name: vllm-benchmark namespace: default spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: vllm-benchmark strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: vllm-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa imagePullPolicy: IfNotPresent name: vllm-benchmark resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 EOF -
Get the internal IP of the gateway.
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}') -
Open two separate terminal windows and run the load tests simultaneously.
ImportantThe following data was generated in a test environment and is for reference only. Your results may vary depending on your environment.
Terminal 1: Load test the `qwen` (Critical) model
kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name qwen \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txtExpected output:
============ Serving Benchmark Result ============ Successful requests: 293 Benchmark duration (s): 1005.55 Total input tokens: 1163919 Total generated tokens: 837560 Request throughput (req/s): 0.29 Output token throughput (tok/s): 832.94 Total Token throughput (tok/s): 1990.43 ---------------Time to First Token---------------- Mean TTFT (ms): 21329.91 Median TTFT (ms): 15754.01 P99 TTFT (ms): 140782.55 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 58.58 Median TPOT (ms): 58.36 P99 TPOT (ms): 91.09 ---------------Inter-token Latency---------------- Mean ITL (ms): 58.32 Median ITL (ms): 50.56 P99 ITL (ms): 64.12 ==================================================Terminal 2: Load test the `travel-helper` (Standard) model
kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name travel-helper \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txtExpected output:
============ Serving Benchmark Result ============ Successful requests: 165 Benchmark duration (s): 889.41 Total input tokens: 660560 Total generated tokens: 492207 Request throughput (req/s): 0.19 Output token throughput (tok/s): 553.41 Total Token throughput (tok/s): 1296.10 ---------------Time to First Token---------------- Mean TTFT (ms): 44201.12 Median TTFT (ms): 28757.03 P99 TTFT (ms): 214710.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 67.38 Median TPOT (ms): 60.51 P99 TPOT (ms): 118.36 ---------------Inter-token Latency---------------- Mean ITL (ms): 66.98 Median ITL (ms): 51.25 P99 ITL (ms): 64.87 ==================================================The results confirm priority scheduling under full load:
Metric qwen(Critical)travel-helper(Standard)Mean TTFT (ms) 21,329 44,201 Successful requests 293 / 300 (97.7%) 165 / 300 (55%) Mean TPOT (ms) 58.58 67.38 Under saturation,
Standard-priority requests wait longer in the gateway queue before dispatching—which is why their Time to First Token (TTFT) is approximately 2x higher than forCriticalrequests. This is the expected trade-off:Criticalrequests are served first, whileStandardrequests wait.