Gray policies for LLM inference service in Multi-LoRA scenarios - Alibaba Cloud Service Mesh

Deploying large language models (LLMs) with fine-tuning based on Low-Rank Adaptation (LoRA) technology in Kubernetes clusters is a flexible and efficient way to offer customized inference capabilities. This topic explains how to fine-tune LLM inference services using Multi-LoRA within Service Mesh (ASM), define traffic distribution policies for various LoRA models, and implement LoRA model gray releases.

Before you begin

Before reading this topic, you need to understand:

How to use GPU computing power in a Container Compute Service (ACS) cluster.
How to add a GPU-accelerated node pool to a Container Service for Kubernetes (ACK) cluster or use the computing power of ACS in an ACK Pro cluster.

By reading this topic, you can learn about:

Background information on LoRA and Multi-LoRA technologies.
Implementation principles of gray release scenarios for LoRA fine-tuning models.
Procedures for implementing LoRA model gray releases using Multi-LoRA technology.

Background information

LoRA and Multi-LoRA

LoRA is a widely adopted technology for fine-tuning large language models (LLMs) cost-effectively to meet the specific needs of various sectors, including healthcare, finance, and education. It enables the deployment of multiple LoRA model weights on a single base LLM for inference, allowing for efficient GPU resource sharing, known as Multi-LoRA technology. The vLLM platform supports the loading and inference of multiple LoRA models.

Gray release scenarios for LoRA fine-tuning models

In Multi-LoRA scenarios, multiple LoRA models can be loaded into a single LLM inference service. Requests for different models are distinguished by the model name in the request, enabling gray testing between various LoRA models to assess the fine-tuning effects on the base LLM.

Prerequisites

Ensure you have either created an ACK managed cluster with a GPU node pool or selected an ACS cluster in a recommended zone for GPU computing power. For more information, see Create an ACK managed cluster and Create an ACS cluster.
You can install the ACK Virtual Node component in your ACK managed cluster to utilize ACS GPU computing capabilities. For more information, see ACS GPU computing power in ACK.
A cluster is added to the ASM instance of v1.24 or later. For more information, see Add a cluster to an ASM instance.
An ingress gateway is created and enable the HTTP service on port 8080. For more information, see Create an ingress gateway.
(Optional) A Sidecar is injected into the default namespace. For more information, see Enable automatic sidecar proxy injection.
Note
You may skip the Sidecar injection if you do not want to explore the practical operations of observability.

Procedures

This practice involves deploying the Llama2 large model as the base model in the cluster using vLLM and registering 10 LoRA models based on this base model, named sql-lora to sql-lora-4, and tweet-summary to tweet-summary-4. Choose to verify in the ACK clusters with GPU-accelerated nodes or ACS clusters as needed.

Step 1: Deploy the example LLM inference service

Create vllm-service.yaml using the content provided below.

Note

The image discussed in this topic requires a GPU with more than 16 GiB of video memory. The T4 card type, which has 16 GiB of video memory, does not provide sufficient resources to launch this application. It is recommended to use the A10 card type for ACK clusters and the 8th generation GPU B for ACS clusters. For detailed model information, please submit a ticket for further assistance.

Due to the large size of the LLM image, it is advisable to pre-store it in ACR and use the internal network address for pulling. Pulling directly from the public network may result in long wait times depending on the cluster EIP bandwidth configuration.

ACK cluster

Expand to view YAML content

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
          - "--model"
          - "/model/llama2"
          - "--tensor-parallel-size"
          - "1"
          - "--port"
          - "8000"
          - '--gpu_memory_utilization'
          - '0.8'
          - "--enable-lora"
          - "--max-loras"
          - "10"
          - "--max-cpu-loras"
          - "12"
          - "--lora-modules"
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template'
          - '/etc/vllm/llama-2-chat.jinja'
          env:
            - name: PORT
              value: "8000"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 2400
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 6000
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/vllm
              name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: chat-template
          configMap:
            name: chat-template

ACS cluster

Expand to view YAML content

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
        alibabacloud.com/compute-class: gpu  # Specify to use GPU computing power
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: "example-model" # Specify GPU model as example-model, please fill in according to the actual situation
    spec:
      containers:
        - name: lora
          image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
          - "--model"
          - "/model/llama2"
          - "--tensor-parallel-size"
          - "1"
          - "--port"
          - "8000"
          - '--gpu_memory_utilization'
          - '0.8'
          - "--enable-lora"
          - "--max-loras"
          - "10"
          - "--max-cpu-loras"
          - "12"
          - "--lora-modules"
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template'
          - '/etc/vllm/llama-2-chat.jinja'
          env:
            - name: PORT
              value: "8000"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 2400
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 6000
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 16
              memory: 64Gi
              nvidia.com/gpu: 1
            requests:
              cpu: 8
              memory: 30Gi
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/vllm
              name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: chat-template
          configMap:
            name: chat-template

Deploy the LLM inference service using the kubeconfig file of the cluster on the data plane.
```
kubectl apply -f vllm-service.yaml
```

Step 2: Configure ASM gateway rules

Deploy gateway rules to enable port 8080 listening on the ASM gateway.

Create a file named gateway.yaml with the following content.

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: llm-inference-gateway
  namespace: default
spec:
  selector:
    istio: ingressgateway
  servers:
    - hosts:
        - '*'
      port:
        name: http-service
        number: 8080
        protocol: HTTP

Create a gateway rule.
```
kubectl apply -f gateway.yaml
```

Step 3: Configure LLM inference service routing and load balancing

Enable routing for the LLM inference service using the ASM kubeconfig.

kubectl patch asmmeshconfig default --type=merge --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'

Deploy the InferencePool resource.
The InferencePool resource defines a set of LLM inference service workloads in the cluster through a label selector. ASM will enable vLLM load balancing for the LLM inference services based on the InferencePool you create.
1. Create inferencepool.yaml using the content provided below.
```
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  name: vllm-llama2-7b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama2-7b-pool
```
2. Create the InferencePool resource using the kubeconfig of the data plane cluster.
```
kubectl apply -f inferencepool.yaml
```

Deploy the InferenceModel resource.

The InferenceModel specifies traffic distribution policies for specific models within the InferencePool.

Create inferencemodel.yaml using the content provided below.

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  modelName: lora-request
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-llama2-7b-pool
  targetModels:
  - name: tweet-summary
    weight: 10
  - name: tweet-summary-1
    weight: 10
  - name: tweet-summary-2
    weight: 10
  - name: tweet-summary-3
    weight: 10
  - name: tweet-summary-4
    weight: 10
  - name: sql-lora
    weight: 10
  - name: sql-lora-1
    weight: 10
  - name: sql-lora-2
    weight: 10
  - name: sql-lora-3
    weight: 10
  - name: sql-lora-4
    weight: 10

This configuration directs 50% of the requests with the model name lora-request to the tweet-summary LoRA model for inference, while the remaining 50% are sent to the sql-lora LoRA model.

Create the InferenceModel resource.
```
kubectl apply -f inferencemodel.yaml
```

Create the LLMRoute resource.

Set up routing rules for the gateway by creating the LLMRoute resource, which directs all requests received on port 8080 to the example LLM inference service, referencing the InferencePool resource.

Create llmroute.yaml using the content provided below.

apiVersion: istio.alibabacloud.com/v1
kind: LLMRoute
metadata:  
  name: test-llm-route
spec:
  gateways: 
  - llm-inference-gateway
  host: test.com
  rules:
  - backendRefs:
    - backendRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool

Create the LLMRoute resource.
```
kubectl apply -f llmroute.yaml
```

Step 4: Verify the execution result

Run the following command multiple times to initiate the test:

curl -H "host: test.com" ${ASM gateway IP}:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "lora-request",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}' -v

You can see the output similar to the following:

{"id":"cmpl-2fc9a351-d866-422b-b561-874a30843a6b","object":"text_completion","created":1736933141,"model":"tweet-summary-1","choices":[{"index":0,"text":", I'm a newbie to this forum. Write a summary of the article.\nWrite a summary of the article.\nWrite a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}

The model field indicates which model is providing the service. After several requests, you will observe that the request volume ratio between the tweet-summary and sql-lora models is approximately 1:1.

(Optional) Step 5: Configure observability metrics and dashboard for LLM services

After declaring LLM inference services in the cluster using InferencePool and InferenceMode resources and setting up routing policies, you can view the observability of the LLM inference service through logs and monitoring metrics.

Enable LLM traffic observability feature on the ASM console to collect monitoring metrics.
1. Improve the observability of LLM inference requests by incorporating additional log fields, metrics, and metric dimensions. For detailed configuration instructions, see Traffic observation: Efficiently manage LLM traffic using ASM.
2. Upon configuration completion, a model dimension will be incorporated into the ASM monitoring metrics. You can gather these metrics by either utilizing Prometheus within the observability monitoring framework or integrating a self-hosted Prometheus for service mesh monitoring.
3. ASM introduces two new metrics: asm_llm_proxy_prompt_tokens, representing the number of input tokens, and asm_llm_proxy_completion_tokens, representing the number of output tokens for all requests. You can incorporate these metrics by adding the following rules to Prometheus. For ins, see Other Prometheus service discovery configurations.
```
scrape_configs:
- job_name: asm-envoy-stats-llm
  scrape_interval: 30s
  scrape_timeout: 30s
  metrics_path: /stats/prometheus
  scheme: http
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_container_port_name
    action: keep
    regex: .*-envoy-prom
  - source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:15090
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels:
    - __meta_kubernetes_namespace
    action: replace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    action: replace
    target_label: pod_name
  metric_relabel_configs:
  - action: keep
    source_labels:
    - __name__
    regex: asm_llm_.*
```

Collect monitoring metrics for vLLM service.

The monitoring metrics for LLM inference requests mainly involve the throughput of external LLM inference requests. Add the annotations for Prometheus collector to the vLLM service pod to collect metrics exposed by the vLLM service and monitor its internal state.

...
annotations:
  prometheus.io/path: /metrics # The HTTP path to which the metrics is exposed.
  prometheus.io/port: "8000" # The port to which the metrics are exposed, which is the listening port of the vLLM Server.
  prometheus.io/scrape: "true" # Whether to scrape the metrics of the current pod.
...

Retrieve metrics related to the vLLM service using Prometheus's default service discovery mechanism. For detailed instructions, see Default service discovery.

Key metrics from the vLLM service provide insight into the internal state of the vLLM workload.

Metric name	Description
vllm:gpu_cache_usage_perc	The percentage of GPU cache usage of vLLM. When vLLM starts, it will pre-occupy as much GPU video memory as possible for KV cache. For the vLLM server, the lower the cache utilization, the more space the GPU has to allocate resources to new requests.
vllm:request_queue_time_seconds_sum	The time spent in the waiting state queue. After the LLM inference request arrives at the vLLM server, it may not be processed immediately, but needs to wait for the vLLM scheduler to schedule the prefill and decode.
vllm:num_requests_running vllm:num_requests_waiting vllm:num_requests_swapped	The number of requests running inference, waiting, and swapped to memory. It can be used to evaluate the current request pressure of the vLLM service.
vllm:avg_generation_throughput_toks_per_s vllm:avg_prompt_throughput_toks_per_s	The number of tokens consumed by the prefill stage and generated by the decode stage per second.
vllm:time_to_first_token_seconds_bucket	The latency level from the time the request is sent to the vLLM service to the time the first token is responded to. This metric usually represents the time it takes for the client to get the first response after outputting the request content and is an important metric affecting the LLM user experience.

Configure a Grafana dashboard to monitor LLM inference services.

Observe LLM inference services deployed with vLLM through the Grafana dashboard:

Monitor the request rate and token throughput using ASM monitoring metrics;
Assess the internal state of the workloads for the LLM inference services with vLLM monitoring metrics.

You can create a data source (Prometheus instances) in the Grafana console. Ensure the monitoring metrics for ASM and vLLM has been collected by the Prometheus instances.

To create an observability dashboard for LLM inference services, import the content provided below into Grafana.

Expand to view JSON content

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "description": "Monitoring vLLM Inference Server",
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 49,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 23,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "exemplar": false,
          "expr": "sum by(model) (rate(istio_requests_total{model!=\"unknown\"}[$__rate_interval]))",
          "instant": false,
          "interval": "",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Request Rate",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 0
      },
      "id": 20,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "editorMode": "code",
          "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_completion_tokens{}[$__rate_interval]))",
          "instant": false,
          "legendFormat": "generate tokens (from proxy)",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "editorMode": "code",
          "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_prompt_tokens{}[$__rate_interval]))",
          "hide": false,
          "instant": false,
          "legendFormat": "prompt tokens (from proxy)",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Tokens Rate",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": -1,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 8
      },
      "id": 17,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "expr": "avg(vllm:gpu_cache_usage_perc)",
          "hide": false,
          "instant": false,
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Average gpu cache usage",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 8
      },
      "id": 18,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "avg(rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
          "hide": false,
          "instant": false,
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Average Queue Time",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Percentage of used cache blocks by vLLM.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 16
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "expr": "sum by(kubernetes_pod_name) (vllm:gpu_cache_usage_perc{model_name=\"$model_name\"})",
          "instant": false,
          "legendFormat": "GPU Cache Usage ({{kubernetes_pod_name}})",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "vllm:cpu_cache_usage_perc{model_name=\"$model_name\"}",
          "hide": false,
          "instant": false,
          "legendFormat": "CPU Cache Usage",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Cache Utilization",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "seconds",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 16
      },
      "id": 14,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "edx8memhpd9tsa"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "sum by(kubernetes_pod_name) (rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "__auto",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Queue Time",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "P50, P90, P95, and P99 TTFT latency in seconds.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 24
      },
      "id": 5,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P99",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P95",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P90",
          "range": true,
          "refId": "C",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P50",
          "range": true,
          "refId": "D",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "expr": "sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])) / sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval]))",
          "hide": false,
          "instant": false,
          "legendFormat": "Average ({{kubernetes_pod_name}})",
          "range": true,
          "refId": "E"
        }
      ],
      "title": "Time To First Token Latency",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of tokens processed per second",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 24
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
          "fullMetaSearch": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "Prompt Tokens/Sec",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "sum by(kubernetes_pod_name) (rate(vllm:generation_tokens_total{model_name=\"$model_name\"}[$__rate_interval]))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "Generation Tokens/Sec ({{kubernetes_pod_name}})",
          "range": true,
          "refId": "B",
          "useBackend": false
        }
      ],
      "title": "Token Throughput",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
      },
      "description": "End to end request latency measured in seconds.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 32
      },
      "id": 9,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P99",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P95",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P90",
          "range": true,
          "refId": "C",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P50",
          "range": true,
          "refId": "D",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "editorMode": "code",
          "expr": "rate(vllm:e2e_request_latency_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:e2e_request_latency_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
          "hide": false,
          "instant": false,
          "legendFormat": "Average",
          "range": true,
          "refId": "E"
        }
      ],
      "title": "E2E Request Latency",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of requests in RUNNING, WAITING, and SWAPPED state",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 32
      },
      "id": 3,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "vllm:num_requests_running{model_name=\"$model_name\"}",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Num Running",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "vllm:num_requests_swapped{model_name=\"$model_name\"}",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Num Swapped",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "exemplar": false,
          "expr": "sum by(kubernetes_pod_name) (vllm:num_requests_waiting{model_name=\"$model_name\"})",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Num Waiting for {{kubernetes_pod_name}}",
          "range": true,
          "refId": "C",
          "useBackend": false
        }
      ],
      "title": "Scheduler State",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Inter token latency in seconds.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 40
      },
      "id": 10,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P99",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P95",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P90",
          "range": true,
          "refId": "C",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P50",
          "range": true,
          "refId": "D",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "rate(vllm:time_per_output_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:time_per_output_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
          "hide": false,
          "instant": false,
          "legendFormat": "Mean",
          "range": true,
          "refId": "E"
        }
      ],
      "title": "Time Per Output Token Latency",
      "type": "timeseries"
    },
    {
      "datasource": {
        "default": false,
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": [
          {
            "__systemRef": "hideSeriesFrom",
            "matcher": {
              "id": "byNames",
              "options": {
                "mode": "exclude",
                "names": [
                  "Decode"
                ],
                "prefix": "All except:",
                "readOnly": true
              }
            },
            "properties": [
              {
                "id": "custom.hideFrom",
                "value": {
                  "legend": false,
                  "tooltip": false,
                  "viz": true
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 40
      },
      "id": 15,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "edx8memhpd9tsa"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "rate(vllm:request_prefill_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Prefill",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "rate(vllm:request_decode_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
          "hide": false,
          "instant": false,
          "legendFormat": "Decode",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Requests Prefill and Decode Time",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Heatmap of request prompt length",
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 48
      },
      "id": 12,
      "options": {
        "calculate": false,
        "cellGap": 1,
        "cellValues": {
          "unit": "none"
        },
        "color": {
          "exponent": 0.5,
          "fill": "dark-orange",
          "min": 0,
          "mode": "scheme",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Spectral",
          "steps": 64
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": true
        },
        "rowsFrame": {
          "layout": "auto",
          "value": "Request count"
        },
        "tooltip": {
          "mode": "single",
          "show": true,
          "showColorScale": false,
          "yHistogram": true
        },
        "yAxis": {
          "axisLabel": "Prompt Length",
          "axisPlacement": "left",
          "reverse": false,
          "unit": "none"
        }
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "sum by(le) (increase(vllm:request_prompt_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
          "format": "heatmap",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "{{le}}",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Request Prompt Length",
      "type": "heatmap"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Heatmap of request generation length",
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 48
      },
      "id": 13,
      "options": {
        "calculate": false,
        "cellGap": 1,
        "cellValues": {
          "unit": "none"
        },
        "color": {
          "exponent": 0.5,
          "fill": "dark-orange",
          "min": 0,
          "mode": "scheme",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Spectral",
          "steps": 64
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": true
        },
        "rowsFrame": {
          "layout": "auto",
          "value": "Request count"
        },
        "tooltip": {
          "mode": "single",
          "show": true,
          "showColorScale": false,
          "yHistogram": true
        },
        "yAxis": {
          "axisLabel": "Generation Length",
          "axisPlacement": "left",
          "reverse": false,
          "unit": "none"
        }
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "sum by(le) (increase(vllm:request_generation_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
          "format": "heatmap",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "{{le}}",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Request Generation Length",
      "type": "heatmap"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 56
      },
      "id": 11,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "sum by(finished_reason) (increase(vllm:request_success_total{model_name=\"$model_name\"}[$__rate_interval]))",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "interval": "",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Finish Reason",
      "type": "timeseries"
    },
    {
      "datasource": {
        "default": false,
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": [
          {
            "__systemRef": "hideSeriesFrom",
            "matcher": {
              "id": "byNames",
              "options": {
                "mode": "exclude",
                "names": [
                  "Tokens"
                ],
                "prefix": "All except:",
                "readOnly": true
              }
            },
            "properties": [
              {
                "id": "custom.hideFrom",
                "value": {
                  "legend": false,
                  "tooltip": false,
                  "viz": true
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 56
      },
      "id": 16,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "edx8memhpd9tsa"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "rate(vllm:request_max_num_generation_tokens_sum{model_name=\"$model_name\"}[$__rate_interval])",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Tokens",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Max Generation Token in Sequence Group",
      "type": "timeseries"
    }
  ],
  "refresh": false,
  "schemaVersion": 38,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "current": {
          "selected": true,
          "text": "prom-cec64713b1aab44d0b49236b6f54cd671",
          "value": "prom-cec64713b1aab44d0b49236b6f54cd671"
        },
        "hide": 0,
        "includeAll": false,
        "label": "datasource",
        "multi": false,
        "name": "DS_PROMETHEUS",
        "options": [],
        "query": "prometheus",
        "queryValue": "",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "current": {
          "selected": false,
          "text": "/model/llama2",
          "value": "/model/llama2"
        },
        "datasource": {
          "type": "prometheus",
          "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
        },
        "definition": "label_values(model_name)",
        "hide": 0,
        "includeAll": false,
        "label": "model_name",
        "multi": false,
        "name": "model_name",
        "options": [],
        "query": {
          "query": "label_values(model_name)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "2025-01-10T04:00:36.511Z",
    "to": "2025-01-10T04:18:26.639Z"
  },
  "timepicker": {},
  "timezone": "",
  "title": "vLLM",
  "uid": "b281712d-8bff-41ef-9f3f-71ad43c05e9c",
  "version": 10,
  "weekStart": ""
}

You can see the dashboard similar to the following: