Gateway with Inference Extension supports load-aware request queueing and priority scheduling for inference services. When a generative AI inference server is operating at full capacity, the gateway prioritizes requests in the queue based on their assigned model criticality. This ensures that requests for high-priority models are processed first. This topic introduces these capabilities of Gateway with Inference Extension.
This feature requires version 1.4.0 or later of Gateway with Inference Extension.
Background
For generative AI inference services, the request throughput of a single inference server is strictly limited by its GPU resources. When many concurrent requests are sent to the same server, resources such as key-value (KV) caching in the inference engine become fully occupied, degrading response times and token throughput for all requests.
Gateway with Inference Extension addresses this by monitoring multiple metrics to assess the internal state of each inference server. When a server's load reaches capacity, the gateway queues incoming inference requests, preventing the server from being overloaded and maintaining overall service quality.
Prerequisites
An ACK managed cluster with GPU node pools is created. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU computing power.
Gateway with Inference Extension 1.4.0 is installed and Enable Gateway API Inference Extension is selected. For more information about the operation entry, see Step 2: Install the Gateway with Inference Extension component.
For the image described in this topic, we recommend using A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.
Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.
Procedure
Step 1: Deploy a sample inference service
Create vllm-service.yaml.
Deploy the sample inference service.
kubectl apply -f vllm-service.yaml
Step 2: Configure inference routing
In this step, you will create InferencePool and InferenceModel resources. By adding the inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" and inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" annotations to InferencePool, you enable the request queuing feature for the selected inference services.
Create a file named
inference-pool.yaml.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" name: qwen-pool namespace: default spec: extensionRef: group: "" kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-model spec: criticality: Critical modelName: qwen poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: qwen weight: 100 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: travel-helper-model spec: criticality: Standard modelName: travel-helper poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: travel-helper-v1 weight: 100This configuration defines two
InferenceModelresources that represent the two models served by the sample inference service:qwen-model: Represents the base model
qwenand is assigned theCriticalcriticality level.travel-helper-model: Represents the LoRA model
travel-helperand is assigned theStandardcriticality level.
The available criticality levels, in order of priority, are
Critical>Standard>Sheddable. When queuing is enabled and the backend servers are at full capacity, requests for higher-priority models will be processed before lower-priority ones.Deploy the inference routing configuration.
kubectl apply -f inference-pool.yaml
Step 3: Deploy the gateway and routing rules
In this step, you will configure a gateway and an HTTPRoute to route requests for the qwen and travel-helper models to the qwen-pool backend InferencePool.
Create a file named
inference-gateway.yaml.apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway listeners: - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: llm-route namespace: default spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool matches: - headers: - type: Exact name: X-Gateway-Model-Name value: qwen - headers: - type: RegularExpression name: X-Gateway-Model-Name value: travel-helper.* --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gatewayDeploy the gateway.
kubectl apply -f inference-gateway.yaml
Step 4: Validate queueing and priority scheduling
In this step, you will use the vLLM benchmark tool to simultaneously load test both the qwen and travel-helper models, pushing the inference servers to full capacity.
Deploy the benchmark workload.
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: vllm-benchmark name: vllm-benchmark namespace: default spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: vllm-benchmark strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: vllm-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa imagePullPolicy: IfNotPresent name: vllm-benchmark resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 EOFGet the internal IP of the gateway.
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')Open two separate terminal windows and run the load tests simultaneously.
ImportantThe following data was generated in a test environment and is for reference only. Your results may vary depending on your environment.
Terminal 1: Load test the
qwen(critical) modelkubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name qwen \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txtExpected output:
============ Serving Benchmark Result ============ Successful requests: 293 Benchmark duration (s): 1005.55 Total input tokens: 1163919 Total generated tokens: 837560 Request throughput (req/s): 0.29 Output token throughput (tok/s): 832.94 Total Token throughput (tok/s): 1990.43 ---------------Time to First Token---------------- Mean TTFT (ms): 21329.91 Median TTFT (ms): 15754.01 P99 TTFT (ms): 140782.55 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 58.58 Median TPOT (ms): 58.36 P99 TPOT (ms): 91.09 ---------------Inter-token Latency---------------- Mean ITL (ms): 58.32 Median ITL (ms): 50.56 P99 ITL (ms): 64.12 ==================================================Terminal 2: Load test the
travel-helper(standard) modelkubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name travel-helper \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txtExpected output:
============ Serving Benchmark Result ============ Successful requests: 165 Benchmark duration (s): 889.41 Total input tokens: 660560 Total generated tokens: 492207 Request throughput (req/s): 0.19 Output token throughput (tok/s): 553.41 Total Token throughput (tok/s): 1296.10 ---------------Time to First Token---------------- Mean TTFT (ms): 44201.12 Median TTFT (ms): 28757.03 P99 TTFT (ms): 214710.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 67.38 Median TPOT (ms): 60.51 P99 TPOT (ms): 118.36 ---------------Inter-token Latency---------------- Mean ITL (ms): 66.98 Median ITL (ms): 51.25 P99 ITL (ms): 64.87 ==================================================The benchmark results confirm the effectiveness of priority scheduling under full load:
Faster response times: The high-priority
qwenmodel's average Time to First Token (TTFT) was 50% lower than that of the standard-prioritytravel-helpermodel.Higher reliability: The
qwenmodel also returned 96% fewer request errors.