With the Gateway with Inference Extension component, you can specify different load balancing strategies for inference service routing based on different usage scenarios of generative AI inference services. This topic describes how to implement prefix-aware load balancing strategies by using the Gateway with Inference Extension component.
Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.
This topic requires Gateway with Inference Extension 1.4.0 or later.
Background information
APC of vLLM
Vectorized Large Language Model (vLLM) supports Automatic prefix caching (APC). APC caches the key-value (KV) cache that vLLM has already computed for requests. This way, if a new request has the same prefix as a historical request, it can directly reuse the existing KV cache. This allows the new request to skip the KV cache calculation for the shared prefix, thereby accelerating the processing of LLM inference requests.
Prefix-aware load balancing strategy
A prefix-aware load balancing strategy refers to a load balancing strategy that sends requests sharing the same prefix content to the same inference server pod whenever possible.
When the APC feature is enabled on the model server, a prefix-aware load balancing strategy can maximize the cache hit ratio and reduce request response time. This strategy is suitable for scenarios with many requests sharing prefixes. You should evaluate based on your actual business scenarios.
Typical usage scenarios:
Long document queries: Users repeatedly query the same long document (such as software manuals or annual reports) with different queries.
Multi-round conversations: Users may interact with the application multiple times in the same session.
Prerequisites
An ACK managed cluster with GPU node pools is created. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU computing power.
Gateway with Inference Extension 1.4.0 is installed and Enable Gateway API Inference Extension is selected. For more information about the operation entry, see Step 2: Install the Gateway with Inference Extension component.
For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.
Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.
Procedure
Step 1: Deploy a sample inference service
Create vllm-service.yaml.
Deploy the sample inference service.
kubectl apply -f vllm-service.yaml
Step 2: Deploy inference routing
In this step, you create InferencePool and InferenceModel resources.
Create inference-pool.yaml.
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" name: vllm-qwen-pool spec: targetPortNumber: 8000 selector: app: qwen extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-qwen spec: modelName: /model/qwen criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool targetModels: - name: /model/qwen weight: 100In the InferencePool resource, you enable the prefix-aware load balancing strategy for pods in the InferencePool resource by configuring the
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"annotation.Deploy the inference routing.
kubectl apply -f inference-pool.yaml
Step 3: Deploy the gateway and gateway routing rules
In this step, you create a gateway with ports 8080 and 8081. On port 8081 of the gateway, an HTTPRoute resource specifies the InferencePool resource provided by the inference extension as the gateway routing backend. Inference requests will be routed to the pod set specified by the InferencePool resource. On port 8080 of the gateway, an HTTPRoute resource specifies the Service as the gateway routing backend. Inference requests will be routed to the same pod set using the standard HTTP least request load balancing strategy.
Create inference-gateway.yaml.
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: qwen-inference-gateway-class spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: qwen-inference-gateway spec: gatewayClassName: qwen-inference-gateway-class listeners: - name: http protocol: HTTP port: 8080 - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend spec: parentRefs: - name: qwen-inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool matches: - path: type: PathPrefix value: / --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend-no-inference spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway sectionName: http rules: - backendRefs: - group: "" kind: Service name: qwen port: 8000 weight: 1 matches: - path: type: PathPrefix value: / --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 1h targetRef: group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gatewayDeploy the gateway.
kubectl apply -f inference-gateway.yaml
Step 4: Verify the routing rules
Create round1.txt and round2.txt. Both files contain the same
contentsection. You can check whether the prefix-aware feature of intelligent routing is triggered by using round1.txt and round2.txt as LLM request bodies in sequence, and then check the logs of the extensionRef field of intelligent routing.round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtround2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtGet the public IP address of the gateway.
export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')Initiate two session requests to simulate a multi-round conversation scenario.
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.View the logs to confirm whether the prefix load balancing is effective.
kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"Expected output:
2025-05-23T03:33:09Z INFO scheduling/prefixcache_filter.go:311 Do prefix-aware routing! {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}The output shows that the log contains the
Do prefix-aware routing!message, which indicates that the prefix load balancing is effective.
(Optional) Step 5: Evaluate inference service performance through multi-round conversation testing
This step demonstrates how to use a stress testing tool for multi-round conversation testing to compare the effects of regular HTTP routing and inference routing with prefix load balancing. In this example, an ACK cluster is used.
Deploy the llm-qa-benchmark stress testing tool.
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: llm-qa-benchmark name: llm-qa-benchmark spec: replicas: 1 selector: matchLabels: app: llm-qa-benchmark template: metadata: labels: app: llm-qa-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1 imagePullPolicy: IfNotPresent name: llm-qa-benchmark terminationMessagePath: /dev/termination-log terminationMessagePolicy: File restartPolicy: Always EOFGet the internal IP address of the gateway.
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')Perform the stress test.
ImportantThe following test results are generated from a test environment. Actual results may vary depending on your specific environment.
Regular HTTP routing
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8080/v1Expected output:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1080 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.0703 tokens/s Output tokens per second: 4.8576 tokens/s Average generation throughput (per request): 26.6710 tokens/req/s Average TTFT: 0.3669s Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s) ===============================================================Inference service routing
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8081/v1Expected output:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1081 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.3009 tokens/s Output tokens per second: 4.8548 tokens/s Average generation throughput (per request): 26.9300 tokens/req/s Average TTFT: 0.2761s Time range: 1748231885.874972 - 1748232468.5918882 (582.72s) ===============================================================You can see that the value of the
Average TTFTparameter of inference service routing shows a significant improvement compared to HTTP routing.