Traditional load balancing distributes requests evenly across backends, but LLM inference workloads are inherently uneven. A short prompt might finish in milliseconds while a long completion ties up a GPU for seconds. Service Mesh (ASM) solves this by routing requests based on each vLLM backend's real-time state: request queue depth and KV cache utilization. The result is lower time-to-first-token (TTFT), higher throughput, and balanced GPU utilization across your inference fleet.
This topic walks you through deploying a vLLM-based Llama 2 inference service, configuring LLM-aware routing through ASM, and setting up observability dashboards to monitor inference traffic.
Currently, only LLM inference services based on vLLM are supported.
Background
Large language models (LLMs)
Large language models (LLMs) are neural network-based language models with billions of parameters, exemplified by GPT, Qwen, and Llama. These models are trained on diverse and extensive datasets -- including web text, professional literature, and code -- and are primarily used for text generation tasks such as completion and dialogue.
To leverage LLMs for building applications, you can:
Use external LLM API services from platforms like OpenAI, Alibaba Cloud Model Studio, or Moonshot.
Build your own LLM inference services using open-source or proprietary models and frameworks such as vLLM, and deploy them in a Kubernetes cluster. This approach suits scenarios that require control over the inference service or high customization of LLM inference capabilities.
vLLM
vLLM is a framework designed for efficient and user-friendly construction of LLM inference services. It supports various large language models, including Qwen, and optimizes inference efficiency through techniques like PagedAttention, dynamic batch inference (Continuous Batching), and model quantization.
How LLM-aware load balancing works
Why traditional load balancing falls short for LLM inference
Classic algorithms like round-robin and least-connections assume each request imposes a similar load. LLM inference breaks this assumption:
Variable processing time. Each request goes through two phases -- prefill (encoding the prompt) and decode (generating tokens one by one). The decode phase length is unpredictable because the number of output tokens varies per request.
GPU memory contention. vLLM pre-allocates GPU memory for KV cache. As cache fills up, the server queues new requests or swaps them to CPU memory, which sharply increases latency.
Without accounting for these factors, requests pile up on some backends while others sit idle -- increasing tail latency and wasting GPU resources.
How ASM routes LLM traffic
ASM evaluates multi-dimensional metrics from each vLLM backend to make routing decisions:
| Metric | Source | Routing signal |
|---|---|---|
| Request queue depth | vllm:num_requests_waiting | Fewer queued requests indicate faster processing |
| KV cache utilization | vllm:gpu_cache_usage_perc | Lower utilization means more GPU memory available for new requests |
When a new request arrives, ASM selects the backend with the best combination of these signals. This keeps GPU load balanced across inference replicas, reduces TTFT, and improves overall throughput compared to traditional algorithms.
LLM traffic observability
Standard proxies parse HTTP headers and URL paths but ignore the request body. Since LLM inference APIs (OpenAI-compatible format) carry the model name and token parameters in the request body, traditional observability misses critical dimensions.
ASM extends observability for LLM inference traffic:
Access logs include the model name and input/output token counts per request.
Monitoring metrics add a
modeldimension for per-model analysis.Token metrics (
asm_llm_proxy_prompt_tokens,asm_llm_proxy_completion_tokens) track token consumption across workloads.
Prerequisites
Before you begin, make sure that you have:
A Container Service for Kubernetes (ACK) managed cluster with a GPU node pool, or a Container Compute Service (ACS) cluster in a recommended zone for GPU computing power
For ACK clusters, see Create an ACK managed cluster
For ACS clusters, see Create an ACS cluster
(Optional) To use ACS GPU computing power in an ACK cluster, install the ACK Virtual Node component. See ACS GPU computing power in ACK
An ASM instance v1.24 or later with your cluster added. See Add a cluster to an ASM instance
An ingress gateway with HTTP service enabled on port 8080. See Create an ingress gateway
(Optional) Sidecar injection enabled in the
defaultnamespace, required only for observability. See Enable automatic sidecar proxy injection
Step 1: Deploy a sample vLLM inference service
Deploy a Llama 2 model served by vLLM with multiple LoRA adapters. The deployment includes a Kubernetes Service, a ConfigMap for the chat template, and a Deployment with 3 GPU-backed replicas.
The container image requires a GPU with more than 16 GiB of video memory. Use the A10 GPU type for ACK clusters or the 8th-generation GPU B for ACS clusters. The T4 (16 GiB) does not provide sufficient memory. For model details, submit a ticket.
The LLM image is large. Pre-store it in Alibaba Cloud Container Registry (ACR) and pull over the internal network to avoid slow downloads over the public endpoint.
Create a file named
vllm-service.yamlwith the following content.ACK cluster
apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool spec: selector: app: vllm-llama2-7b-pool ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP --- apiVersion: v1 kind: ConfigMap metadata: name: chat-template data: llama-2-chat.jinja: | {% if messages[0]['role'] == 'system' %} {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %} {% set messages = messages[1:] %} {% else %} {% set system_message = '' %} {% endif %} {% for message in messages %} {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %} {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} {% endif %} {% if loop.index0 == 0 %} {% set content = system_message + message['content'] %} {% else %} {% set content = message['content'] %} {% endif %} {% if message['role'] == 'user' %} {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }} {% elif message['role'] == 'assistant' %} {{ ' ' + content | trim + ' ' + eos_token }} {% endif %} {% endfor %} --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama2-7b-pool namespace: default spec: replicas: 3 selector: matchLabels: app: vllm-llama2-7b-pool template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: '8000' prometheus.io/scrape: 'true' labels: app: vllm-llama2-7b-pool spec: containers: - name: lora image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2" imagePullPolicy: IfNotPresent command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "/model/llama2" - "--tensor-parallel-size" - "1" - "--port" - "8000" - '--gpu_memory_utilization' - '0.8' - "--enable-lora" - "--max-loras" - "4" - "--max-cpu-loras" - "12" - "--lora-modules" - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0' - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - '--chat-template' - '/etc/vllm/llama-2-chat.jinja' env: - name: PORT value: "8000" ports: - containerPort: 8000 name: http protocol: TCP livenessProbe: failureThreshold: 2400 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 6000 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - mountPath: /etc/vllm name: chat-template restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: chat-template configMap: name: chat-templateACS cluster
apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool spec: selector: app: vllm-llama2-7b-pool ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP --- apiVersion: v1 kind: ConfigMap metadata: name: chat-template data: llama-2-chat.jinja: | {% if messages[0]['role'] == 'system' %} {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %} {% set messages = messages[1:] %} {% else %} {% set system_message = '' %} {% endif %} {% for message in messages %} {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %} {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} {% endif %} {% if loop.index0 == 0 %} {% set content = system_message + message['content'] %} {% else %} {% set content = message['content'] %} {% endif %} {% if message['role'] == 'user' %} {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }} {% elif message['role'] == 'assistant' %} {{ ' ' + content | trim + ' ' + eos_token }} {% endif %} {% endfor %} --- apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama2-7b-pool namespace: default spec: replicas: 3 selector: matchLabels: app: vllm-llama2-7b-pool template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: '8000' prometheus.io/scrape: 'true' labels: app: vllm-llama2-7b-pool alibabacloud.com/compute-class: gpu # Specify GPU computing power alibabacloud.com/compute-qos: default alibabacloud.com/gpu-model-series: "example-model" # Replace with your GPU model series spec: containers: - name: lora image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2" imagePullPolicy: IfNotPresent command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "/model/llama2" - "--tensor-parallel-size" - "1" - "--port" - "8000" - '--gpu_memory_utilization' - '0.8' - "--enable-lora" - "--max-loras" - "4" - "--max-cpu-loras" - "12" - "--lora-modules" - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0' - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - '--chat-template' - '/etc/vllm/llama-2-chat.jinja' env: - name: PORT value: "8000" ports: - containerPort: 8000 name: http protocol: TCP livenessProbe: failureThreshold: 2400 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 6000 httpGet: path: /health port: http scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 requests: cpu: 8 memory: 30Gi nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - mountPath: /etc/vllm name: chat-template restartPolicy: Always schedulerName: default-scheduler terminationGracePeriodSeconds: 30 volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: chat-template configMap: name: chat-templateDeploy the inference service.
kubectl apply -f vllm-service.yamlWait for all 3 replicas to become ready. The LLM image is large, so initial pulls may take several minutes. All 3 pods should reach
Runningstatus with1/1ready containers before you proceed.kubectl get pods -l app=vllm-llama2-7b-pool -w
Step 2: Configure ASM gateway rules
Create a Gateway resource to enable HTTP traffic on port 8080 of the ASM ingress gateway.
Create a file named
gateway.yamlwith the following content.apiVersion: networking.istio.io/v1 kind: Gateway metadata: name: llm-inference-gateway namespace: default spec: selector: istio: ingressgateway servers: - hosts: - '*' port: name: http-service number: 8080 protocol: HTTPApply the Gateway resource.
kubectl apply -f gateway.yaml
Step 3: Configure routing and load balancing for the LLM inference service
To compare LLM-aware load balancing against traditional load balancing, complete the steps in (Optional) Compare with traditional load balancing before proceeding.
This step creates three resources that connect the ASM gateway to your vLLM backends with LLM-aware routing:
| Resource | Purpose |
|---|---|
| InferencePool | Groups vLLM Pods by label selector and specifies the inference port |
| InferenceModel | Maps model names from the request body to backend Pods and defines traffic distribution |
| LLMRoute | Connects the ASM gateway to the InferencePool for LLM-aware routing |
Enable LLM inference routing
Run the following command using the kubeconfig of your ASM instance:
kubectl patch asmmeshconfig default --type=merge \
--patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'Create the InferencePool
The InferencePool groups vLLM Pods by label selector and specifies the inference port.
Create a file named
inferencepool.yamlwith the following content.Field Description .spec.targetPortNumberThe port on each Pod that serves inference requests .spec.selectorLabel selector matching inference Pods. The key must be appand the value must match the corresponding Service nameapiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-poolApply the InferencePool using the kubeconfig of the data plane cluster.
kubectl apply -f inferencepool.yamlVerify that the InferencePool is active. The status should include
Accepted=TrueandResolvedRefs=Truebefore you proceed.kubectl get inferencepool vllm-llama2-7b-pool -o yaml
Create the InferenceModel
The InferenceModel maps the model parameter in incoming requests to specific backend Pods and defines traffic distribution weights.
Create a file named
inferencemodel.yamlwith the following content.Field Description .spec.modelNameMatches the modelparameter in the request body.spec.targetModelsDefines traffic routing rules. In this example, all requests with model: tweet-summaryroute to Pods running that modelapiVersion: inference.networking.x-k8s.io/v1alpha1 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: tweet-summary poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: tweet-summary weight: 100Apply the InferenceModel.
kubectl apply -f inferencemodel.yamlVerify that the InferenceModel is active. The status should include
Accepted=TrueandResolvedRefs=True.kubectl get inferencemodel inferencemodel-sample -o yaml
Create the LLMRoute
The LLMRoute connects the ASM gateway to the InferencePool, forwarding all requests on port 8080 to the inference service with LLM-aware load balancing.
Create a file named
llmroute.yamlwith the following content.apiVersion: istio.alibabacloud.com/v1 kind: LLMRoute metadata: name: test-llm-route spec: gateways: - llm-inference-gateway host: test.com rules: - backendRefs: - backendRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-poolApply the LLMRoute.
kubectl apply -f llmroute.yaml
Step 4: Verify the configuration
Send a test request through the ASM gateway to confirm that LLM-aware routing is working.
curl -v \
-H "host: test.com" \
-H "Content-Type: application/json" \
http://${ASM_GATEWAY_IP}:8080/v1/completions \
-d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'Replace ${ASM_GATEWAY_IP} with your ASM ingress gateway's IP address.
A successful response looks like this:
{
"id": "cmpl-2fc9a351-d866-422b-b561-874a30843a6b",
"object": "text_completion",
"created": 1736933141,
"model": "tweet-summary",
"choices": [
{
"index": 0,
"text": "...",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 2,
"total_tokens": 102,
"completion_tokens": 100,
"prompt_tokens_details": null
}
}Run the command multiple times to observe that requests are distributed across different backend Pods based on their real-time load.
(Optional) Step 5: Set up observability for LLM inference services
After configuring InferencePool, InferenceModel, and LLMRoute, set up monitoring to track inference request rates, token throughput, and vLLM backend health.
Enable LLM traffic observability in ASM
Enable LLM-specific log fields, metrics, and metric dimensions in the ASM console. See Traffic observation: Efficiently manage LLM traffic using ASM for configuration details. After configuration, ASM monitoring metrics include a
modeldimension. Collect these metrics using either:Add scrape rules for ASM's LLM-specific token metrics (
asm_llm_proxy_prompt_tokensandasm_llm_proxy_completion_tokens) to your Prometheus configuration. See Other Prometheus service discovery configurations for setup details.scrape_configs: - job_name: asm-envoy-stats-llm scrape_interval: 30s scrape_timeout: 30s metrics_path: /stats/prometheus scheme: http kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: - __meta_kubernetes_pod_container_port_name action: keep regex: .*-envoy-prom - source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:15090 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: - __meta_kubernetes_namespace action: replace target_label: namespace - source_labels: - __meta_kubernetes_pod_name action: replace target_label: pod_name metric_relabel_configs: - action: keep source_labels: - __name__ regex: asm_llm_.*
Collect vLLM backend metrics
The vLLM service exposes Prometheus metrics at /metrics on port 8000. The sample Deployment already includes the required annotations:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"Prometheus discovers and scrapes these endpoints automatically through its default service discovery mechanism. See Default pod service discovery for details.
Key vLLM metrics to monitor:
| Metric | Description |
|---|---|
vllm:gpu_cache_usage_perc | KV cache utilization. Lower values indicate more GPU memory available for new requests |
vllm:request_queue_time_seconds_sum | Time requests spend waiting in the queue before the vLLM scheduler runs prefill and decode |
vllm:num_requests_running, vllm:num_requests_waiting, vllm:num_requests_swapped | Number of requests running inference, waiting in queue, or swapped to CPU memory. Use these to assess backend pressure |
vllm:avg_generation_throughput_toks_per_s, vllm:avg_prompt_throughput_toks_per_s | Token throughput for the decode and prefill stages per second |
vllm:time_to_first_token_seconds_bucket | TTFT distribution. Measures how quickly clients receive the first token after submitting a request -- a key metric for user-perceived latency |
Create a Grafana dashboard
Set up a Grafana dashboard that combines ASM metrics (request rate, token throughput) with vLLM metrics (GPU cache, queue depth, TTFT) for a unified view of your inference fleet.
Make sure that your Prometheus data source in Grafana is collecting both ASM and vLLM metrics.
Import the dashboard JSON provided below into Grafana. Download the complete dashboard JSON from the ASM documentation page (expand Dashboard JSON on that page to copy the full content).
In Grafana, go to Dashboards > Import, paste the JSON, and select your Prometheus data source.

(Optional) Compare with traditional load balancing
Use the observability dashboard to measure the difference between LLM-aware and traditional load balancing. Run this comparison before configuring LLM-aware routing in Step 3.
If you have already completed Step 3, clean up the LLM routing resources first:
kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --allCreate a VirtualService that routes traffic using traditional round-robin load balancing.
kubectl apply -f- <<EOF apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: llm-vs namespace: default spec: gateways: - default/llm-inference-gateway hosts: - '*' http: - name: any-host route: - destination: host: vllm-llama2-7b-pool.default.svc.cluster.local port: number: 8000 EOFRun a stress test against the inference service using a tool such as llmperf.
Delete the VirtualService, then complete Step 3 to configure LLM-aware routing. Make sure that no VirtualService resources remain before proceeding. Run the same stress test again.
Compare the results in the Grafana dashboard. LLM-aware load balancing provides:
Lower TTFT (time to first token)
Higher token throughput
More even KV cache utilization across replicas

Clean up
To remove all resources created in this tutorial:
# Delete LLM routing resources
kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --all
# Delete the gateway
kubectl delete -f gateway.yaml
# Delete the inference service
kubectl delete -f vllm-service.yaml