The Gateway with Inference Extension component lets you specify different load balancing policies for inference service routing in various generative AI scenarios. This topic describes how to use the Gateway with Inference Extension component to implement prefix-cache-aware routing in estimation mode.
Before you begin, make sure that you understand the concepts of InferencePool and InferenceModel.
This topic requires the Gateway with Inference Extension component version 1.4.0 or later.
Background information
Automatic prefix caching in vLLM
vLLM supports automatic prefix caching (APC). APC caches the key-value (KV) state that vLLM has computed for previous requests. If a new request shares a prefix with a previous request, it can reuse the existing KV cache. This allows the new request to skip the KV cache calculation for the shared prefix. This process accelerates the processing of large language model (LLM) inference requests.
Prefix-cache-aware routing in estimation mode
Prefix-cache-aware routing in estimation mode is a load balancing policy that sends requests that share the same prefix to the same inference server pod whenever possible. The Gateway with Inference Extension component tracks requests routed to each inference server. It then estimates the prefix cache status of each server to improve the cache hit ratio of the inference engine.
When the APC feature is enabled on the model server, the prefix-cache-aware routing policy in estimation mode can maximize the cache hit ratio and reduce the request response time. This policy is ideal for scenarios with many requests that share prefixes. You should evaluate its suitability based on your business needs.
Typical scenarios include the following:
Long document queries: Users repeatedly query the same long document, such as a software manual or an annual report, with different queries.
Multi-turn conversations: Users may interact with the application multiple times in the same chat session.
Prerequisites
An ACK managed cluster with a GPU node pool is available. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU computing power.
The Gateway with Inference Extension component of version 1.4.0 or later is installed and you have selected Enable Gateway API Inference Extension. For more information, see Install components.
For the image described in this topic, we recommend using A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.
Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.
Procedure
Step 1: Deploy a sample inference service
Create a file named vllm-service.yaml.
Deploy the sample inference service.
kubectl apply -f vllm-service.yaml
Step 2: Deploy an inference route
In this step, you can create an InferencePool resource and an InferenceModel resource.
Create a file named inference-pool.yaml.
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" name: vllm-qwen-pool spec: targetPortNumber: 8000 selector: app: qwen extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-qwen spec: modelName: /model/qwen criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool targetModels: - name: /model/qwen weight: 100In the InferencePool resource, add the
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"annotation. This enables prefix-cache-aware routing in estimation mode for the pods in the InferencePool.Deploy the inference route.
kubectl apply -f inference-pool.yaml
Step 3: Deploy the gateway and gateway routing rules
In this step, you can create a gateway that listens on ports 8080 and 8081. On port 8081, an HTTPRoute resource specifies the InferencePool provided by the inference extension as the backend. Inference requests are routed to the pod collection defined in the InferencePool. On port 8080, another HTTPRoute resource specifies the Service as the backend. Inference requests are routed to the same pod collection using the standard HTTP least-request load balancing policy.
Create a file named inference-gateway.yaml.
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: qwen-inference-gateway-class spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: qwen-inference-gateway spec: gatewayClassName: qwen-inference-gateway-class listeners: - name: http protocol: HTTP port: 8080 - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend spec: parentRefs: - name: qwen-inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool matches: - path: type: PathPrefix value: / --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend-no-inference spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway sectionName: http rules: - backendRefs: - group: "" kind: Service name: qwen port: 8000 weight: 1 matches: - path: type: PathPrefix value: / --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 1h targetRef: group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gatewayDeploy the gateway.
kubectl apply -f inference-gateway.yaml
Step 4: Verify the routing rules
Create files named round1.txt and round2.txt. Both files contain an identical
contentsection. Use the content of round1.txt and then round2.txt as the request body for the LLM. Then, check the extensionRef logs to verify that prefix-cache-aware routing in estimation mode is triggered.round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtround2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtRetrieve the public IP address of the gateway.
export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')Send two session requests to simulate a multi-turn conversation scenario.
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txtCheck the logs to confirm that prefix-cache-aware routing in estimation mode is working.
kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"Expected output:
2025-05-23T03:33:09Z INFO scheduling/prefixcache_filter.go:311 Do prefix-aware routing! {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}The log contains the
Do prefix-aware routing!message. This indicates that prefix-cache-aware routing in estimation mode is working.
(Optional) Step 5: Evaluate inference service performance with multi-turn conversation tests
This step uses an ACK cluster as an example and shows how to use a stress testing tool to run multi-turn conversation tests. The goal is to compare the performance of standard HTTP routing with that of prefix-cache-aware inference routing.
Deploy the llm-qa-benchmark stress testing tool.
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: llm-qa-benchmark name: llm-qa-benchmark spec: replicas: 1 selector: matchLabels: app: llm-qa-benchmark template: metadata: labels: app: llm-qa-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1 imagePullPolicy: IfNotPresent name: llm-qa-benchmark terminationMessagePath: /dev/termination-log terminationMessagePolicy: File restartPolicy: Always EOFRetrieve the internal IP address of the gateway.
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')Run the stress test.
ImportantThe following results are from a test environment. Your actual results may vary.
Standard HTTP routing
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8080/v1Expected output:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1080 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.0703 tokens/s Output tokens per second: 4.8576 tokens/s Average generation throughput (per request): 26.6710 tokens/req/s Average TTFT: 0.3669s Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s) ===============================================================Inference service routing
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8081/v1Expected output:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1081 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.3009 tokens/s Output tokens per second: 4.8548 tokens/s Average generation throughput (per request): 26.9300 tokens/req/s Average TTFT: 0.2761s Time range: 1748231885.874972 - 1748232468.5918882 (582.72s) ===============================================================The results show that the
Average TTFTfor inference service routing is significantly lower than that for standard HTTP routing.