All Products
Search
Document Center

Alibaba Cloud Service Mesh:Smart routing with queues, KV Cache, and LoRA

Last Updated:Mar 11, 2026

Traditional load balancing distributes requests evenly across backends, but LLM inference workloads are inherently uneven. A short prompt might finish in milliseconds while a long completion ties up a GPU for seconds. Service Mesh (ASM) solves this by routing requests based on each vLLM backend's real-time state: request queue depth and KV cache utilization. The result is lower time-to-first-token (TTFT), higher throughput, and balanced GPU utilization across your inference fleet.

This topic walks you through deploying a vLLM-based Llama 2 inference service, configuring LLM-aware routing through ASM, and setting up observability dashboards to monitor inference traffic.

Important

Currently, only LLM inference services based on vLLM are supported.

Background

Large language models (LLMs)

Large language models (LLMs) are neural network-based language models with billions of parameters, exemplified by GPT, Qwen, and Llama. These models are trained on diverse and extensive datasets -- including web text, professional literature, and code -- and are primarily used for text generation tasks such as completion and dialogue.

To leverage LLMs for building applications, you can:

  • Use external LLM API services from platforms like OpenAI, Alibaba Cloud Model Studio, or Moonshot.

  • Build your own LLM inference services using open-source or proprietary models and frameworks such as vLLM, and deploy them in a Kubernetes cluster. This approach suits scenarios that require control over the inference service or high customization of LLM inference capabilities.

vLLM

vLLM is a framework designed for efficient and user-friendly construction of LLM inference services. It supports various large language models, including Qwen, and optimizes inference efficiency through techniques like PagedAttention, dynamic batch inference (Continuous Batching), and model quantization.

How LLM-aware load balancing works

Why traditional load balancing falls short for LLM inference

Classic algorithms like round-robin and least-connections assume each request imposes a similar load. LLM inference breaks this assumption:

  • Variable processing time. Each request goes through two phases -- prefill (encoding the prompt) and decode (generating tokens one by one). The decode phase length is unpredictable because the number of output tokens varies per request.

  • GPU memory contention. vLLM pre-allocates GPU memory for KV cache. As cache fills up, the server queues new requests or swaps them to CPU memory, which sharply increases latency.

Without accounting for these factors, requests pile up on some backends while others sit idle -- increasing tail latency and wasting GPU resources.

How ASM routes LLM traffic

ASM evaluates multi-dimensional metrics from each vLLM backend to make routing decisions:

MetricSourceRouting signal
Request queue depthvllm:num_requests_waitingFewer queued requests indicate faster processing
KV cache utilizationvllm:gpu_cache_usage_percLower utilization means more GPU memory available for new requests

When a new request arrives, ASM selects the backend with the best combination of these signals. This keeps GPU load balanced across inference replicas, reduces TTFT, and improves overall throughput compared to traditional algorithms.

LLM traffic observability

Standard proxies parse HTTP headers and URL paths but ignore the request body. Since LLM inference APIs (OpenAI-compatible format) carry the model name and token parameters in the request body, traditional observability misses critical dimensions.

ASM extends observability for LLM inference traffic:

  • Access logs include the model name and input/output token counts per request.

  • Monitoring metrics add a model dimension for per-model analysis.

  • Token metrics (asm_llm_proxy_prompt_tokens, asm_llm_proxy_completion_tokens) track token consumption across workloads.

Prerequisites

Before you begin, make sure that you have:

Step 1: Deploy a sample vLLM inference service

Deploy a Llama 2 model served by vLLM with multiple LoRA adapters. The deployment includes a Kubernetes Service, a ConfigMap for the chat template, and a Deployment with 3 GPU-backed replicas.

Note

The container image requires a GPU with more than 16 GiB of video memory. Use the A10 GPU type for ACK clusters or the 8th-generation GPU B for ACS clusters. The T4 (16 GiB) does not provide sufficient memory. For model details, submit a ticket.

The LLM image is large. Pre-store it in Alibaba Cloud Container Registry (ACR) and pull over the internal network to avoid slow downloads over the public endpoint.

  1. Create a file named vllm-service.yaml with the following content.

    ACK cluster

       apiVersion: v1
       kind: Service
       metadata:
         name: vllm-llama2-7b-pool
       spec:
         selector:
           app: vllm-llama2-7b-pool
         ports:
         - protocol: TCP
           port: 8000
           targetPort: 8000
         type: ClusterIP
       ---
       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: chat-template
       data:
         llama-2-chat.jinja: |
           {% if messages[0]['role'] == 'system' %}
             {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
             {% set messages = messages[1:] %}
           {% else %}
               {% set system_message = '' %}
           {% endif %}
    
           {% for message in messages %}
               {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                   {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
               {% endif %}
    
               {% if loop.index0 == 0 %}
                   {% set content = system_message + message['content'] %}
               {% else %}
                   {% set content = message['content'] %}
               {% endif %}
               {% if message['role'] == 'user' %}
                   {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
               {% elif message['role'] == 'assistant' %}
                   {{ ' ' + content | trim + ' ' + eos_token }}
               {% endif %}
           {% endfor %}
       ---
       apiVersion: apps/v1
       kind: Deployment
       metadata:
         name: vllm-llama2-7b-pool
         namespace: default
       spec:
         replicas: 3
         selector:
           matchLabels:
             app: vllm-llama2-7b-pool
         template:
           metadata:
             annotations:
               prometheus.io/path: /metrics
               prometheus.io/port: '8000'
               prometheus.io/scrape: 'true'
             labels:
               app: vllm-llama2-7b-pool
           spec:
             containers:
               - name: lora
                 image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
                 imagePullPolicy: IfNotPresent
                 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
                 args:
                 - "--model"
                 - "/model/llama2"
                 - "--tensor-parallel-size"
                 - "1"
                 - "--port"
                 - "8000"
                 - '--gpu_memory_utilization'
                 - '0.8'
                 - "--enable-lora"
                 - "--max-loras"
                 - "4"
                 - "--max-cpu-loras"
                 - "12"
                 - "--lora-modules"
                 - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
                 - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
                 - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
                 - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
                 - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
                 - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
                 - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
                 - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
                 - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
                 - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
                 - '--chat-template'
                 - '/etc/vllm/llama-2-chat.jinja'
                 env:
                   - name: PORT
                     value: "8000"
                 ports:
                   - containerPort: 8000
                     name: http
                     protocol: TCP
                 livenessProbe:
                   failureThreshold: 2400
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 readinessProbe:
                   failureThreshold: 6000
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 resources:
                   limits:
                     nvidia.com/gpu: 1
                   requests:
                     nvidia.com/gpu: 1
                 volumeMounts:
                   - mountPath: /data
                     name: data
                   - mountPath: /dev/shm
                     name: shm
                   - mountPath: /etc/vllm
                     name: chat-template
             restartPolicy: Always
             schedulerName: default-scheduler
             terminationGracePeriodSeconds: 30
             volumes:
               - name: data
                 emptyDir: {}
               - name: shm
                 emptyDir:
                   medium: Memory
               - name: chat-template
                 configMap:
                   name: chat-template

    ACS cluster

       apiVersion: v1
       kind: Service
       metadata:
         name: vllm-llama2-7b-pool
       spec:
         selector:
           app: vllm-llama2-7b-pool
         ports:
         - protocol: TCP
           port: 8000
           targetPort: 8000
         type: ClusterIP
       ---
       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: chat-template
       data:
         llama-2-chat.jinja: |
           {% if messages[0]['role'] == 'system' %}
             {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
             {% set messages = messages[1:] %}
           {% else %}
               {% set system_message = '' %}
           {% endif %}
    
           {% for message in messages %}
               {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                   {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
               {% endif %}
    
               {% if loop.index0 == 0 %}
                   {% set content = system_message + message['content'] %}
               {% else %}
                   {% set content = message['content'] %}
               {% endif %}
               {% if message['role'] == 'user' %}
                   {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
               {% elif message['role'] == 'assistant' %}
                   {{ ' ' + content | trim + ' ' + eos_token }}
               {% endif %}
           {% endfor %}
       ---
       apiVersion: apps/v1
       kind: Deployment
       metadata:
         name: vllm-llama2-7b-pool
         namespace: default
       spec:
         replicas: 3
         selector:
           matchLabels:
             app: vllm-llama2-7b-pool
         template:
           metadata:
             annotations:
               prometheus.io/path: /metrics
               prometheus.io/port: '8000'
               prometheus.io/scrape: 'true'
             labels:
               app: vllm-llama2-7b-pool
               alibabacloud.com/compute-class: gpu  # Specify GPU computing power
               alibabacloud.com/compute-qos: default
               alibabacloud.com/gpu-model-series: "example-model" # Replace with your GPU model series
           spec:
             containers:
               - name: lora
                 image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
                 imagePullPolicy: IfNotPresent
                 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
                 args:
                 - "--model"
                 - "/model/llama2"
                 - "--tensor-parallel-size"
                 - "1"
                 - "--port"
                 - "8000"
                 - '--gpu_memory_utilization'
                 - '0.8'
                 - "--enable-lora"
                 - "--max-loras"
                 - "4"
                 - "--max-cpu-loras"
                 - "12"
                 - "--lora-modules"
                 - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
                 - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
                 - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
                 - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
                 - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
                 - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
                 - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
                 - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
                 - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
                 - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
                 - '--chat-template'
                 - '/etc/vllm/llama-2-chat.jinja'
                 env:
                   - name: PORT
                     value: "8000"
                 ports:
                   - containerPort: 8000
                     name: http
                     protocol: TCP
                 livenessProbe:
                   failureThreshold: 2400
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 readinessProbe:
                   failureThreshold: 6000
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 resources:
                   limits:
                     cpu: 16
                     memory: 64Gi
                     nvidia.com/gpu: 1
                   requests:
                     cpu: 8
                     memory: 30Gi
                     nvidia.com/gpu: 1
                 volumeMounts:
                   - mountPath: /data
                     name: data
                   - mountPath: /dev/shm
                     name: shm
                   - mountPath: /etc/vllm
                     name: chat-template
             restartPolicy: Always
             schedulerName: default-scheduler
             terminationGracePeriodSeconds: 30
             volumes:
               - name: data
                 emptyDir: {}
               - name: shm
                 emptyDir:
                   medium: Memory
               - name: chat-template
                 configMap:
                   name: chat-template
  2. Deploy the inference service.

       kubectl apply -f vllm-service.yaml
  3. Wait for all 3 replicas to become ready. The LLM image is large, so initial pulls may take several minutes. All 3 pods should reach Running status with 1/1 ready containers before you proceed.

       kubectl get pods -l app=vllm-llama2-7b-pool -w

Step 2: Configure ASM gateway rules

Create a Gateway resource to enable HTTP traffic on port 8080 of the ASM ingress gateway.

  1. Create a file named gateway.yaml with the following content.

       apiVersion: networking.istio.io/v1
       kind: Gateway
       metadata:
         name: llm-inference-gateway
         namespace: default
       spec:
         selector:
           istio: ingressgateway
         servers:
           - hosts:
               - '*'
             port:
               name: http-service
               number: 8080
               protocol: HTTP
  2. Apply the Gateway resource.

       kubectl apply -f gateway.yaml

Step 3: Configure routing and load balancing for the LLM inference service

Note

To compare LLM-aware load balancing against traditional load balancing, complete the steps in (Optional) Compare with traditional load balancing before proceeding.

This step creates three resources that connect the ASM gateway to your vLLM backends with LLM-aware routing:

ResourcePurpose
InferencePoolGroups vLLM Pods by label selector and specifies the inference port
InferenceModelMaps model names from the request body to backend Pods and defines traffic distribution
LLMRouteConnects the ASM gateway to the InferencePool for LLM-aware routing

Enable LLM inference routing

Run the following command using the kubeconfig of your ASM instance:

kubectl patch asmmeshconfig default --type=merge \
  --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'

Create the InferencePool

The InferencePool groups vLLM Pods by label selector and specifies the inference port.

  1. Create a file named inferencepool.yaml with the following content.

    FieldDescription
    .spec.targetPortNumberThe port on each Pod that serves inference requests
    .spec.selectorLabel selector matching inference Pods. The key must be app and the value must match the corresponding Service name
       apiVersion: inference.networking.x-k8s.io/v1alpha1
       kind: InferencePool
       metadata:
         name: vllm-llama2-7b-pool
       spec:
         targetPortNumber: 8000
         selector:
           app: vllm-llama2-7b-pool
  2. Apply the InferencePool using the kubeconfig of the data plane cluster.

       kubectl apply -f inferencepool.yaml
  3. Verify that the InferencePool is active. The status should include Accepted=True and ResolvedRefs=True before you proceed.

       kubectl get inferencepool vllm-llama2-7b-pool -o yaml

Create the InferenceModel

The InferenceModel maps the model parameter in incoming requests to specific backend Pods and defines traffic distribution weights.

  1. Create a file named inferencemodel.yaml with the following content.

    FieldDescription
    .spec.modelNameMatches the model parameter in the request body
    .spec.targetModelsDefines traffic routing rules. In this example, all requests with model: tweet-summary route to Pods running that model
       apiVersion: inference.networking.x-k8s.io/v1alpha1
       kind: InferenceModel
       metadata:
         name: inferencemodel-sample
       spec:
         modelName: tweet-summary
         poolRef:
           group: inference.networking.x-k8s.io
           kind: InferencePool
           name: vllm-llama2-7b-pool
         targetModels:
         - name: tweet-summary
           weight: 100
  2. Apply the InferenceModel.

       kubectl apply -f inferencemodel.yaml
  3. Verify that the InferenceModel is active. The status should include Accepted=True and ResolvedRefs=True.

       kubectl get inferencemodel inferencemodel-sample -o yaml

Create the LLMRoute

The LLMRoute connects the ASM gateway to the InferencePool, forwarding all requests on port 8080 to the inference service with LLM-aware load balancing.

  1. Create a file named llmroute.yaml with the following content.

       apiVersion: istio.alibabacloud.com/v1
       kind: LLMRoute
       metadata:
         name: test-llm-route
       spec:
         gateways:
         - llm-inference-gateway
         host: test.com
         rules:
         - backendRefs:
           - backendRef:
               group: inference.networking.x-k8s.io
               kind: InferencePool
               name: vllm-llama2-7b-pool
  2. Apply the LLMRoute.

       kubectl apply -f llmroute.yaml

Step 4: Verify the configuration

Send a test request through the ASM gateway to confirm that LLM-aware routing is working.

curl -v \
  -H "host: test.com" \
  -H "Content-Type: application/json" \
  http://${ASM_GATEWAY_IP}:8080/v1/completions \
  -d '{
    "model": "tweet-summary",
    "prompt": "Write as if you were a critic: San Francisco",
    "max_tokens": 100,
    "temperature": 0
  }'

Replace ${ASM_GATEWAY_IP} with your ASM ingress gateway's IP address.

A successful response looks like this:

{
  "id": "cmpl-2fc9a351-d866-422b-b561-874a30843a6b",
  "object": "text_completion",
  "created": 1736933141,
  "model": "tweet-summary",
  "choices": [
    {
      "index": 0,
      "text": "...",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 2,
    "total_tokens": 102,
    "completion_tokens": 100,
    "prompt_tokens_details": null
  }
}

Run the command multiple times to observe that requests are distributed across different backend Pods based on their real-time load.

(Optional) Step 5: Set up observability for LLM inference services

After configuring InferencePool, InferenceModel, and LLMRoute, set up monitoring to track inference request rates, token throughput, and vLLM backend health.

Enable LLM traffic observability in ASM

  1. Enable LLM-specific log fields, metrics, and metric dimensions in the ASM console. See Traffic observation: Efficiently manage LLM traffic using ASM for configuration details. After configuration, ASM monitoring metrics include a model dimension. Collect these metrics using either:

  2. Add scrape rules for ASM's LLM-specific token metrics (asm_llm_proxy_prompt_tokens and asm_llm_proxy_completion_tokens) to your Prometheus configuration. See Other Prometheus service discovery configurations for setup details.

       scrape_configs:
       - job_name: asm-envoy-stats-llm
         scrape_interval: 30s
         scrape_timeout: 30s
         metrics_path: /stats/prometheus
         scheme: http
         kubernetes_sd_configs:
         - role: pod
         relabel_configs:
         - source_labels:
           - __meta_kubernetes_pod_container_port_name
           action: keep
           regex: .*-envoy-prom
         - source_labels:
           - __address__
           - __meta_kubernetes_pod_annotation_prometheus_io_port
           action: replace
           regex: ([^:]+)(?::\d+)?;(\d+)
           replacement: $1:15090
           target_label: __address__
         - action: labelmap
           regex: __meta_kubernetes_pod_label_(.+)
         - source_labels:
           - __meta_kubernetes_namespace
           action: replace
           target_label: namespace
         - source_labels:
           - __meta_kubernetes_pod_name
           action: replace
           target_label: pod_name
         metric_relabel_configs:
         - action: keep
           source_labels:
           - __name__
           regex: asm_llm_.*

Collect vLLM backend metrics

The vLLM service exposes Prometheus metrics at /metrics on port 8000. The sample Deployment already includes the required annotations:

annotations:
  prometheus.io/path: /metrics
  prometheus.io/port: "8000"
  prometheus.io/scrape: "true"

Prometheus discovers and scrapes these endpoints automatically through its default service discovery mechanism. See Default pod service discovery for details.

Key vLLM metrics to monitor:

MetricDescription
vllm:gpu_cache_usage_percKV cache utilization. Lower values indicate more GPU memory available for new requests
vllm:request_queue_time_seconds_sumTime requests spend waiting in the queue before the vLLM scheduler runs prefill and decode
vllm:num_requests_running, vllm:num_requests_waiting, vllm:num_requests_swappedNumber of requests running inference, waiting in queue, or swapped to CPU memory. Use these to assess backend pressure
vllm:avg_generation_throughput_toks_per_s, vllm:avg_prompt_throughput_toks_per_sToken throughput for the decode and prefill stages per second
vllm:time_to_first_token_seconds_bucketTTFT distribution. Measures how quickly clients receive the first token after submitting a request -- a key metric for user-perceived latency

Create a Grafana dashboard

Set up a Grafana dashboard that combines ASM metrics (request rate, token throughput) with vLLM metrics (GPU cache, queue depth, TTFT) for a unified view of your inference fleet.

  1. Make sure that your Prometheus data source in Grafana is collecting both ASM and vLLM metrics.

  2. Import the dashboard JSON provided below into Grafana. Download the complete dashboard JSON from the ASM documentation page (expand Dashboard JSON on that page to copy the full content).

  3. In Grafana, go to Dashboards > Import, paste the JSON, and select your Prometheus data source.

Grafana dashboard for LLM inference monitoring

(Optional) Compare with traditional load balancing

Use the observability dashboard to measure the difference between LLM-aware and traditional load balancing. Run this comparison before configuring LLM-aware routing in Step 3.

Note

If you have already completed Step 3, clean up the LLM routing resources first:

kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --all
  1. Create a VirtualService that routes traffic using traditional round-robin load balancing.

       kubectl apply -f- <<EOF
       apiVersion: networking.istio.io/v1
       kind: VirtualService
       metadata:
         name: llm-vs
         namespace: default
       spec:
         gateways:
           - default/llm-inference-gateway
         hosts:
           - '*'
         http:
           - name: any-host
             route:
               - destination:
                   host: vllm-llama2-7b-pool.default.svc.cluster.local
                   port:
                     number: 8000
       EOF
  2. Run a stress test against the inference service using a tool such as llmperf.

  3. Delete the VirtualService, then complete Step 3 to configure LLM-aware routing. Make sure that no VirtualService resources remain before proceeding. Run the same stress test again.

  4. Compare the results in the Grafana dashboard. LLM-aware load balancing provides:

    • Lower TTFT (time to first token)

    • Higher token throughput

    • More even KV cache utilization across replicas

Performance comparison: traditional vs. LLM-aware load balancing

Clean up

To remove all resources created in this tutorial:

# Delete LLM routing resources
kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --all

# Delete the gateway
kubectl delete -f gateway.yaml

# Delete the inference service
kubectl delete -f vllm-service.yaml

Related topics