All Products
Search
Document Center

Alibaba Cloud Service Mesh:Configure canary release policies for Multi-LoRA models

Last Updated:Mar 11, 2026

When you run multiple LoRA (Low-Rank Adaptation) adapters on a shared base model, you need a way to control how inference traffic is distributed across them. Service Mesh (ASM) lets you assign weights to each LoRA adapter through an InferenceModel resource, so you can run canary releases and A/B tests -- gradually shifting traffic from one adapter version to another and validating results before a full rollout.

How it works

LoRA and Multi-LoRA

LoRA is a widely adopted technique for fine-tuning large language models (LLMs) cost-effectively. Instead of retraining the entire model, LoRA adds lightweight adapter layers that load alongside the base model at inference time.

Multi-LoRA extends this approach: multiple LoRA adapters share a single base model and GPU, each serving a different fine-tuned variant. The vLLM platform supports loading and serving multiple LoRA adapters simultaneously, routing requests by the model field in each API call.

Traffic distribution through ASM

In a Multi-LoRA deployment, ASM routes inference requests to different LoRA adapters based on the model name in each request. You define an InferenceModel resource that maps a virtual model name to a set of target adapters with traffic weights. ASM then distributes incoming requests according to those weights.

A typical canary release workflow looks like this:

  1. Route 100% of traffic to the current adapter version.

  2. Deploy the new adapter version and shift 10% of traffic to it.

  3. Monitor metrics. If the new version performs well, gradually increase its traffic share.

  4. After validation, route 100% of traffic to the new version.

Prerequisites

Before you begin, make sure that you have:

Deploy the vLLM inference service

This step deploys a Llama-2-7b base model on vLLM with 10 LoRA adapters: sql-lora through sql-lora-4 (5 SQL-focused adapters) and tweet-summary through tweet-summary-4 (5 summarization adapters).

Note The container image requires a GPU with more than 16 GiB of video memory. T4 GPUs (16 GiB) are insufficient. Use an A10 GPU for ACK clusters or an 8th-generation GPU B for ACS clusters. For GPU model details, submit a ticket. The LLM image is large. Store it in Container Registry (ACR) and pull it over the internal network to avoid slow downloads over the public network.
  1. Create a file named vllm-service.yaml with the following content.

    ACK cluster

    Expand to view YAML content

    ACS cluster

    Expand to view YAML content

    The YAML defines three resources:

    ResourcePurpose
    Service (vllm-llama2-7b-pool)Exposes the vLLM server on port 8000 as a ClusterIP service
    ConfigMap (chat-template)Provides the Llama-2 chat prompt template
    Deployment (vllm-llama2-7b-pool)Runs 3 replicas of the vLLM server with the base model and all 10 LoRA adapters loaded

    Key vLLM parameters:

    ParameterValueDescription
    --enable-lora-Enables LoRA adapter support
    --max-loras10Maximum number of LoRA adapters loaded in GPU memory simultaneously
    --max-cpu-loras12Maximum number of LoRA adapters stored in CPU memory
    --gpu_memory_utilization0.8Fraction of GPU memory allocated for KV cache
    --lora-modules<name>=<path>Maps adapter names to their weight file paths
       apiVersion: v1
       kind: Service
       metadata:
         name: vllm-llama2-7b-pool
       spec:
         selector:
           app: vllm-llama2-7b-pool
         ports:
         - protocol: TCP
           port: 8000
           targetPort: 8000
         type: ClusterIP
       ---
       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: chat-template
       data:
         llama-2-chat.jinja: |
           {% if messages[0]['role'] == 'system' %}
             {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
             {% set messages = messages[1:] %}
           {% else %}
               {% set system_message = '' %}
           {% endif %}
    
           {% for message in messages %}
               {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                   {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
               {% endif %}
    
               {% if loop.index0 == 0 %}
                   {% set content = system_message + message['content'] %}
               {% else %}
                   {% set content = message['content'] %}
               {% endif %}
               {% if message['role'] == 'user' %}
                   {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
               {% elif message['role'] == 'assistant' %}
                   {{ ' ' + content | trim + ' ' + eos_token }}
               {% endif %}
           {% endfor %}
       ---
       apiVersion: apps/v1
       kind: Deployment
       metadata:
         name: vllm-llama2-7b-pool
         namespace: default
       spec:
         replicas: 3
         selector:
           matchLabels:
             app: vllm-llama2-7b-pool
         template:
           metadata:
             annotations:
               prometheus.io/path: /metrics
               prometheus.io/port: '8000'
               prometheus.io/scrape: 'true'
             labels:
               app: vllm-llama2-7b-pool
           spec:
             containers:
               - name: lora
                 image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
                 imagePullPolicy: IfNotPresent
                 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
                 args:
                 - "--model"
                 - "/model/llama2"
                 - "--tensor-parallel-size"
                 - "1"
                 - "--port"
                 - "8000"
                 - '--gpu_memory_utilization'
                 - '0.8'
                 - "--enable-lora"
                 - "--max-loras"
                 - "10"
                 - "--max-cpu-loras"
                 - "12"
                 - "--lora-modules"
                 - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
                 - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
                 - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
                 - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
                 - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
                 - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
                 - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
                 - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
                 - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
                 - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
                 - '--chat-template'
                 - '/etc/vllm/llama-2-chat.jinja'
                 env:
                   - name: PORT
                     value: "8000"
                 ports:
                   - containerPort: 8000
                     name: http
                     protocol: TCP
                 livenessProbe:
                   failureThreshold: 2400
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 readinessProbe:
                   failureThreshold: 6000
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 resources:
                   limits:
                     cpu: 16
                     memory: 64Gi
                     nvidia.com/gpu: 1
                   requests:
                     cpu: 8
                     memory: 30Gi
                     nvidia.com/gpu: 1
                 volumeMounts:
                   - mountPath: /data
                     name: data
                   - mountPath: /dev/shm
                     name: shm
                   - mountPath: /etc/vllm
                     name: chat-template
             restartPolicy: Always
             schedulerName: default-scheduler
             terminationGracePeriodSeconds: 30
             volumes:
               - name: data
                 emptyDir: {}
               - name: shm
                 emptyDir:
                   medium: Memory
               - name: chat-template
                 configMap:
                   name: chat-template
       apiVersion: v1
       kind: Service
       metadata:
         name: vllm-llama2-7b-pool
       spec:
         selector:
           app: vllm-llama2-7b-pool
         ports:
         - protocol: TCP
           port: 8000
           targetPort: 8000
         type: ClusterIP
       ---
       apiVersion: v1
       kind: ConfigMap
       metadata:
         name: chat-template
       data:
         llama-2-chat.jinja: |
           {% if messages[0]['role'] == 'system' %}
             {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
             {% set messages = messages[1:] %}
           {% else %}
               {% set system_message = '' %}
           {% endif %}
    
           {% for message in messages %}
               {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                   {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
               {% endif %}
    
               {% if loop.index0 == 0 %}
                   {% set content = system_message + message['content'] %}
               {% else %}
                   {% set content = message['content'] %}
               {% endif %}
               {% if message['role'] == 'user' %}
                   {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
               {% elif message['role'] == 'assistant' %}
                   {{ ' ' + content | trim + ' ' + eos_token }}
               {% endif %}
           {% endfor %}
       ---
       apiVersion: apps/v1
       kind: Deployment
       metadata:
         name: vllm-llama2-7b-pool
         namespace: default
       spec:
         replicas: 3
         selector:
           matchLabels:
             app: vllm-llama2-7b-pool
         template:
           metadata:
             annotations:
               prometheus.io/path: /metrics
               prometheus.io/port: '8000'
               prometheus.io/scrape: 'true'
             labels:
               app: vllm-llama2-7b-pool
               alibabacloud.com/compute-class: gpu
               alibabacloud.com/compute-qos: default
               alibabacloud.com/gpu-model-series: "example-model" # Replace with your actual GPU model series
           spec:
             containers:
               - name: lora
                 image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
                 imagePullPolicy: IfNotPresent
                 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
                 args:
                 - "--model"
                 - "/model/llama2"
                 - "--tensor-parallel-size"
                 - "1"
                 - "--port"
                 - "8000"
                 - '--gpu_memory_utilization'
                 - '0.8'
                 - "--enable-lora"
                 - "--max-loras"
                 - "10"
                 - "--max-cpu-loras"
                 - "12"
                 - "--lora-modules"
                 - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
                 - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
                 - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
                 - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
                 - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
                 - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
                 - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
                 - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
                 - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
                 - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
                 - '--chat-template'
                 - '/etc/vllm/llama-2-chat.jinja'
                 env:
                   - name: PORT
                     value: "8000"
                 ports:
                   - containerPort: 8000
                     name: http
                     protocol: TCP
                 livenessProbe:
                   failureThreshold: 2400
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 readinessProbe:
                   failureThreshold: 6000
                   httpGet:
                     path: /health
                     port: http
                     scheme: HTTP
                   initialDelaySeconds: 5
                   periodSeconds: 5
                   successThreshold: 1
                   timeoutSeconds: 1
                 resources:
                   limits:
                     cpu: 16
                     memory: 64Gi
                     nvidia.com/gpu: 1
                   requests:
                     cpu: 8
                     memory: 30Gi
                     nvidia.com/gpu: 1
                 volumeMounts:
                   - mountPath: /data
                     name: data
                   - mountPath: /dev/shm
                     name: shm
                   - mountPath: /etc/vllm
                     name: chat-template
             restartPolicy: Always
             schedulerName: default-scheduler
             terminationGracePeriodSeconds: 30
             volumes:
               - name: data
                 emptyDir: {}
               - name: shm
                 emptyDir:
                   medium: Memory
               - name: chat-template
                 configMap:
                   name: chat-template
  2. Deploy the service using the data plane cluster kubeconfig:

       kubectl apply -f vllm-service.yaml
  3. Verify that all pods are running: Wait until all 3 replicas show Running status and READY is 1/1 (or 2/2 if sidecar injection is enabled).

       kubectl get pods -l app=vllm-llama2-7b-pool

Configure ASM gateway rules

Set up the ASM ingress gateway to listen on port 8080 for HTTP traffic.

  1. Create a file named gateway.yaml:

       apiVersion: networking.istio.io/v1
       kind: Gateway
       metadata:
         name: llm-inference-gateway
         namespace: default
       spec:
         selector:
           istio: ingressgateway
         servers:
           - hosts:
               - '*'
             port:
               name: http-service
               number: 8080
               protocol: HTTP
  2. Apply the gateway rule using the ASM kubeconfig:

       kubectl apply -f gateway.yaml

Configure routing and traffic distribution

This step creates three resources that work together to route and distribute inference traffic:

ResourceRole
InferencePoolSelects the vLLM pods that serve inference requests
InferenceModelDefines which LoRA adapters receive traffic and at what weights
LLMRouteConnects the ingress gateway to the InferencePool

Enable the Gateway API inference extension

Run the following command using the ASM kubeconfig:

kubectl patch asmmeshconfig default --type=merge \
  --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'

Create the InferencePool

The InferencePool selects vLLM pods by label and enables ASM to perform inference-aware load balancing across them.

  1. Create a file named inferencepool.yaml:

       apiVersion: inference.networking.x-k8s.io/v1alpha1
       kind: InferencePool
       metadata:
         name: vllm-llama2-7b-pool
       spec:
         targetPortNumber: 8000
         selector:
           app: vllm-llama2-7b-pool
  2. Apply the resource using the data plane cluster kubeconfig:

       kubectl apply -f inferencepool.yaml

Create the InferenceModel

The InferenceModel maps a virtual model name to a set of LoRA adapters with traffic weights. When a request specifies model: "lora-request", ASM distributes it to one of the target adapters based on the configured weights.

  1. Create a file named inferencemodel.yaml: In this example, all 10 adapters have equal weight (10), so traffic splits evenly: 50% to the tweet-summary group and 50% to the sql-lora group. Adjust weights for a canary release To gradually shift traffic between adapter groups, change the weights. For example, to route 90% of traffic to sql-lora adapters and 10% to tweet-summary adapters: After validating the new adapter group, increase its weight further until you reach 100%.

       apiVersion: inference.networking.x-k8s.io/v1alpha1
       kind: InferenceModel
       metadata:
         name: inferencemodel-sample
       spec:
         modelName: lora-request
         poolRef:
           group: inference.networking.x-k8s.io
           kind: InferencePool
           name: vllm-llama2-7b-pool
         targetModels:
         - name: tweet-summary
           weight: 10
         - name: tweet-summary-1
           weight: 10
         - name: tweet-summary-2
           weight: 10
         - name: tweet-summary-3
           weight: 10
         - name: tweet-summary-4
           weight: 10
         - name: sql-lora
           weight: 10
         - name: sql-lora-1
           weight: 10
         - name: sql-lora-2
           weight: 10
         - name: sql-lora-3
           weight: 10
         - name: sql-lora-4
           weight: 10
       targetModels:
       - name: sql-lora
         weight: 18    # 90 / 5 adapters = 18 per adapter
       - name: sql-lora-1
         weight: 18
       # ... remaining sql-lora adapters with weight: 18
       - name: tweet-summary
         weight: 2     # 10 / 5 adapters = 2 per adapter
       - name: tweet-summary-1
         weight: 2
       # ... remaining tweet-summary adapters with weight: 2
  2. Apply the resource:

       kubectl apply -f inferencemodel.yaml

Create the LLMRoute

The LLMRoute binds the gateway to the InferencePool, routing all requests on port 8080 to the inference service.

  1. Create a file named llmroute.yaml:

       apiVersion: istio.alibabacloud.com/v1
       kind: LLMRoute
       metadata:
         name: test-llm-route
       spec:
         gateways:
         - llm-inference-gateway
         host: test.com
         rules:
         - backendRefs:
           - backendRef:
               group: inference.networking.x-k8s.io
               kind: InferencePool
               name: vllm-llama2-7b-pool
  2. Apply the resource using the ASM kubeconfig:

       kubectl apply -f llmroute.yaml

Verify the traffic distribution

Before sending test traffic, confirm that all resources are ready.

  1. Check that the InferencePool and InferenceModel resources exist:

       kubectl get inferencepool vllm-llama2-7b-pool
       kubectl get inferencemodel inferencemodel-sample
  2. Send multiple requests through the gateway: Replace ${ASM_GATEWAY_IP} with the IP address of your ASM ingress gateway.

       curl -H "host: test.com" ${ASM_GATEWAY_IP}:8080/v1/completions \
         -H 'Content-Type: application/json' \
         -d '{
           "model": "lora-request",
           "prompt": "Write as if you were a critic: San Francisco",
           "max_tokens": 100,
           "temperature": 0
         }' -v
  3. Check the response. The model field indicates which LoRA adapter served the request: After sending multiple requests, the ratio between tweet-summary and sql-lora responses should be approximately 1:1.

       {
         "id": "cmpl-2fc9a351-d866-422b-b561-874a30843a6b",
         "object": "text_completion",
         "created": 1736933141,
         "model": "tweet-summary-1",
         "choices": [
           {
             "index": 0,
             "text": "...",
             "finish_reason": "length"
           }
         ],
         "usage": {
           "prompt_tokens": 2,
           "total_tokens": 102,
           "completion_tokens": 100
         }
       }

(Optional) Set up observability

After you configure the InferencePool, InferenceModel, and LLMRoute resources, monitor the LLM inference service through ASM metrics and vLLM metrics.

Collect ASM metrics

  1. Enable LLM traffic observability in the ASM console. This adds model-level dimensions to ASM monitoring metrics, including log fields and metric labels. For detailed instructions, see Efficiently manage LLM traffic using ASM.

  2. Collect the metrics using Prometheus within the observability framework or a self-hosted Prometheus instance.

  3. ASM exposes two LLM-specific metrics: Add the following scrape configuration to Prometheus. For details, see Other Prometheus service discovery configurations.

    MetricDescription
    asm_llm_proxy_prompt_tokensNumber of input tokens per request
    asm_llm_proxy_completion_tokensNumber of output tokens per request
       scrape_configs:
       - job_name: asm-envoy-stats-llm
         scrape_interval: 30s
         scrape_timeout: 30s
         metrics_path: /stats/prometheus
         scheme: http
         kubernetes_sd_configs:
         - role: pod
         relabel_configs:
         - source_labels:
           - __meta_kubernetes_pod_container_port_name
           action: keep
           regex: .*-envoy-prom
         - source_labels:
           - __address__
           - __meta_kubernetes_pod_annotation_prometheus_io_port
           action: replace
           regex: ([^:]+)(?::\d+)?;(\d+)
           replacement: $1:15090
           target_label: __address__
         - action: labelmap
           regex: __meta_kubernetes_pod_label_(.+)
         - source_labels:
           - __meta_kubernetes_namespace
           action: replace
           target_label: namespace
         - source_labels:
           - __meta_kubernetes_pod_name
           action: replace
           target_label: pod_name
         metric_relabel_configs:
         - action: keep
           source_labels:
           - __name__
           regex: asm_llm_.*

Collect vLLM metrics

  1. The deployment YAML already includes Prometheus annotations on the vLLM pods, so metrics are scraped automatically: Retrieve vLLM metrics through Prometheus default service discovery. For details, see Default service discovery.

       annotations:
         prometheus.io/path: /metrics
         prometheus.io/port: "8000"
         prometheus.io/scrape: "true"
  2. Key vLLM metrics:

    MetricDescription
    vllm:gpu_cache_usage_percGPU KV cache utilization. Lower values mean more capacity for new requests.
    vllm:request_queue_time_seconds_sumTime requests spend waiting in the queue before inference begins.
    vllm:num_requests_runningNumber of requests currently running inference.
    vllm:num_requests_waitingNumber of requests waiting in the queue.
    vllm:num_requests_swappedNumber of requests swapped to CPU memory.
    vllm:avg_generation_throughput_toks_per_sDecode-stage token throughput (tokens/second).
    vllm:avg_prompt_throughput_toks_per_sPrefill-stage token throughput (tokens/second).
    vllm:time_to_first_token_seconds_bucketTime from request arrival to first token output (TTFT). A key metric for user-perceived latency.
    vllm:e2e_request_latency_seconds_bucketEnd-to-end request latency.

Build a Grafana dashboard

Build a Grafana dashboard to visualize both ASM and vLLM metrics:

  • ASM metrics: Track request rate and token throughput per model.

  • vLLM metrics: Monitor GPU cache utilization, queue depth, and per-request latency.

To set up the dashboard:

  1. Add your Prometheus instance as a data source in the Grafana console.

  2. Import the dashboard JSON provided below.

Expand to view the Grafana dashboard JSON

    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": {
              "type": "grafana",
              "uid": "-- Grafana --"
            },
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "target": {
              "limit": 100,
              "matchAny": false,
              "tags": [],
              "type": "dashboard"
            },
            "type": "dashboard"
          }
        ]
      },
      "description": "Monitoring vLLM Inference Server",
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": 49,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 23,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "exemplar": false,
              "expr": "sum by(model) (rate(istio_requests_total{model!=\"unknown\"}[$__rate_interval]))",
              "instant": false,
              "interval": "",
              "legendFormat": "__auto",
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Request Rate",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 20,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_completion_tokens{}[$__rate_interval]))",
              "instant": false,
              "legendFormat": "generate tokens (from proxy)",
              "range": true,
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_prompt_tokens{}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "legendFormat": "prompt tokens (from proxy)",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Tokens Rate",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "min": -1,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 8
          },
          "id": 17,
          "options": {
            "colorMode": "value",
            "graphMode": "area",
            "justifyMode": "auto",
            "orientation": "auto",
            "reduceOptions": {
              "calcs": [
                "mean"
              ],
              "fields": "",
              "values": false
            },
            "textMode": "auto"
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "avg(vllm:gpu_cache_usage_perc)",
              "hide": false,
              "instant": false,
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Average gpu cache usage",
          "type": "stat"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 8
          },
          "id": 18,
          "options": {
            "colorMode": "value",
            "graphMode": "area",
            "justifyMode": "auto",
            "orientation": "auto",
            "reduceOptions": {
              "calcs": [
                "mean"
              ],
              "fields": "",
              "values": false
            },
            "textMode": "auto"
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "avg(rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "range": true,
              "refId": "C"
            }
          ],
          "title": "Average Queue Time",
          "type": "stat"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Percentage of used cache blocks by vLLM.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 16
          },
          "id": 4,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "sum by(kubernetes_pod_name) (vllm:gpu_cache_usage_perc{model_name=\"$model_name\"})",
              "instant": false,
              "legendFormat": "GPU Cache Usage ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "vllm:cpu_cache_usage_perc{model_name=\"$model_name\"}",
              "hide": false,
              "instant": false,
              "legendFormat": "CPU Cache Usage",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Cache Utilization",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "seconds",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 16
          },
          "id": 14,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "__auto",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Queue Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "P50, P90, P95, and P99 TTFT latency in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 24
          },
          "id": 5,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])) / sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "legendFormat": "Average ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "Time To First Token Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of tokens processed per second",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 24
          },
          "id": 8,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "Prompt Tokens/Sec",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:generation_tokens_total{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "Generation Tokens/Sec ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "B",
              "useBackend": false
            }
          ],
          "title": "Token Throughput",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "description": "End to end request latency measured in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 32
          },
          "id": 9,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "rate(vllm:e2e_request_latency_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:e2e_request_latency_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Average",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "E2E Request Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of requests in RUNNING, WAITING, and SWAPPED state",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "none"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 32
          },
          "id": 3,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "vllm:num_requests_running{model_name=\"$model_name\"}",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Running",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "vllm:num_requests_swapped{model_name=\"$model_name\"}",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Swapped",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "exemplar": false,
              "expr": "sum by(kubernetes_pod_name) (vllm:num_requests_waiting{model_name=\"$model_name\"})",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Waiting for {{kubernetes_pod_name}}",
              "range": true,
              "refId": "C",
              "useBackend": false
            }
          ],
          "title": "Scheduler State",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Inter token latency in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 40
          },
          "id": 10,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "rate(vllm:time_per_output_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:time_per_output_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Mean",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "Time Per Output Token Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "default": false,
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "__systemRef": "hideSeriesFrom",
                "matcher": {
                  "id": "byNames",
                  "options": {
                    "mode": "exclude",
                    "names": [
                      "Decode"
                    ],
                    "prefix": "All except:",
                    "readOnly": true
                  }
                },
                "properties": [
                  {
                    "id": "custom.hideFrom",
                    "value": {
                      "legend": false,
                      "tooltip": false,
                      "viz": true
                    }
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 40
          },
          "id": 15,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "rate(vllm:request_prefill_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Prefill",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "rate(vllm:request_decode_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Decode",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Requests Prefill and Decode Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Heatmap of request prompt length",
          "fieldConfig": {
            "defaults": {
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "scaleDistribution": {
                  "type": "linear"
                }
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 48
          },
          "id": 12,
          "options": {
            "calculate": false,
            "cellGap": 1,
            "cellValues": {
              "unit": "none"
            },
            "color": {
              "exponent": 0.5,
              "fill": "dark-orange",
              "min": 0,
              "mode": "scheme",
              "reverse": false,
              "scale": "exponential",
              "scheme": "Spectral",
              "steps": 64
            },
            "exemplars": {
              "color": "rgba(255,0,255,0.7)"
            },
            "filterValues": {
              "le": 1e-9
            },
            "legend": {
              "show": true
            },
            "rowsFrame": {
              "layout": "auto",
              "value": "Request count"
            },
            "tooltip": {
              "mode": "single",
              "show": true,
              "showColorScale": false,
              "yHistogram": true
            },
            "yAxis": {
              "axisLabel": "Prompt Length",
              "axisPlacement": "left",
              "reverse": false,
              "unit": "none"
            }
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(le) (increase(vllm:request_prompt_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
              "format": "heatmap",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "{{le}}",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Request Prompt Length",
          "type": "heatmap"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Heatmap of request generation length",
          "fieldConfig": {
            "defaults": {
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "scaleDistribution": {
                  "type": "linear"
                }
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 48
          },
          "id": 13,
          "options": {
            "calculate": false,
            "cellGap": 1,
            "cellValues": {
              "unit": "none"
            },
            "color": {
              "exponent": 0.5,
              "fill": "dark-orange",
              "min": 0,
              "mode": "scheme",
              "reverse": false,
              "scale": "exponential",
              "scheme": "Spectral",
              "steps": 64
            },
            "exemplars": {
              "color": "rgba(255,0,255,0.7)"
            },
            "filterValues": {
              "le": 1e-9
            },
            "legend": {
              "show": true
            },
            "rowsFrame": {
              "layout": "auto",
              "value": "Request count"
            },
            "tooltip": {
              "mode": "single",
              "show": true,
              "showColorScale": false,
              "yHistogram": true
            },
            "yAxis": {
              "axisLabel": "Generation Length",
              "axisPlacement": "left",
              "reverse": false,
              "unit": "none"
            }
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(le) (increase(vllm:request_generation_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
              "format": "heatmap",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "{{le}}",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Request Generation Length",
          "type": "heatmap"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 56
          },
          "id": 11,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(finished_reason) (increase(vllm:request_success_total{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "interval": "",
              "legendFormat": "__auto",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Finish Reason",
          "type": "timeseries"
        },
        {
          "datasource": {
            "default": false,
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "__systemRef": "hideSeriesFrom",
                "matcher": {
                  "id": "byNames",
                  "options": {
                    "mode": "exclude",
                    "names": [
                      "Tokens"
                    ],
                    "prefix": "All except:",
                    "readOnly": true
                  }
                },
                "properties": [
                  {
                    "id": "custom.hideFrom",
                    "value": {
                      "legend": false,
                      "tooltip": false,
                      "viz": true
                    }
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 56
          },
          "id": 16,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "rate(vllm:request_max_num_generation_tokens_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Tokens",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Max Generation Token in Sequence Group",
          "type": "timeseries"
        }
      ],
      "refresh": false,
      "schemaVersion": 38,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "current": {
              "selected": true,
              "text": "prom-cec64713b1aab44d0b49236b6f54cd671",
              "value": "prom-cec64713b1aab44d0b49236b6f54cd671"
            },
            "hide": 0,
            "includeAll": false,
            "label": "datasource",
            "multi": false,
            "name": "DS_PROMETHEUS",
            "options": [],
            "query": "prometheus",
            "queryValue": "",
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "type": "datasource"
          },
          {
            "current": {
              "selected": false,
              "text": "/model/llama2",
              "value": "/model/llama2"
            },
            "datasource": {
              "type": "prometheus",
              "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
            },
            "definition": "label_values(model_name)",
            "hide": 0,
            "includeAll": false,
            "label": "model_name",
            "multi": false,
            "name": "model_name",
            "options": [],
            "query": {
              "query": "label_values(model_name)",
              "refId": "StandardVariableQuery"
            },
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "sort": 0,
            "type": "query"
          }
        ]
      },
      "time": {
        "from": "2025-01-10T04:00:36.511Z",
        "to": "2025-01-10T04:18:26.639Z"
      },
      "timepicker": {},
      "timezone": "",
      "title": "vLLM",
      "uid": "b281712d-8bff-41ef-9f3f-71ad43c05e9c",
      "version": 10,
      "weekStart": ""
    }

The following figure shows an example of the dashboard:

image

See also