全部產品
Search
文件中心

Alibaba Cloud Service Mesh:Multi-LoRA情境下的LLM推理服務灰階策略

更新時間:Mar 12, 2025

在Kubernetes叢集中部署大型語言模型(LLM)推理服務時,基於低秩適應LoRA(Low-Rank Adaptation)技術對大模型進行微調並提供定製化推理能力,已成為高效且靈活的最佳實務。本文介紹在Service Mesh (ASM)中,如何基於Multi-LoRA的微調LLM推理服務,指定多LoRA模型的流量分發策略,從而實現LoRA模型灰階。

閱讀前提示

閱讀本文前,您需要瞭解

通過閱讀本文,您可以瞭解到

  • LoRA與Multi-LoRA技術的背景資訊。

  • LoRA微調模型灰階情境實現原理。

  • 基於Multi-LoRA技術,實現LoRA模型灰階發布情境的實踐操作。

背景資訊

LoRA與Multi-LoRA

LoRA是一種流行的大語言模型(LLM)微調技術,可以以較小的代價對LLM進行微調,以滿足LLM在垂直領域(例如醫學、金融、教育)的定製化需求。在構建推理服務時,可以基於同一個基礎大模型載入多個不同的LoRA模型權重進行推理。通過這種方式,可以實現多個LoRA模型共用GPU資源的效果,這被稱作Multi-LoRA技術。由於LoRA技術的高效性,因此被廣泛應用於部署垂直領域定製化大模型的情境中。當前,vLLM已經支援Multi-LoRA的載入與推理。

LoRA微調模型灰階情境

在Multi-LoRA情境下,多個LoRA模型可以被載入到同一LLM推理服務中,對不同LoRA模型的請求通過請求中的模型名稱進行區分。通過這種方式,可以在同一基礎大模型上訓練不同的LoRA模型,並在不同的LoRA模型之間進行灰階測試,以評估大模型的微調效果。

前提條件

實踐步驟

本實踐以在叢集中部署基於vLLM的Llama2大模型為基本模型,並註冊了基於該基本模型的10個LoRA模型,分別是sql-lorasql-lora-4,以及tweet-summarytweet-summary-4。您可以根據實際情況選擇在ACK GPU叢集或ACS叢集中進行驗證。

步驟一:部署樣本LLM推理服務

  1. 使用以下內容,建立vllm-service.yaml。

    說明

    本文使用的鏡像需要GPU顯存大於16GiB,T4卡型(16GiB顯存)的實際可用顯存不足以啟動此應用。因此ACK叢集卡型推薦使用A10,ACS叢集卡型推薦使用8代GPU B。具體對應型號請提交工單諮詢。

    同時,由於LLM鏡像體積較大,建議您提前轉存到ACR,使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置,會有較長的等待時間。

    ACK叢集

    展開查看YAML內容

    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      selector:
        app: vllm-llama2-7b-pool
      ports:
      - protocol: TCP
        port: 8000
        targetPort: 8000
      type: ClusterIP
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "10"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template

    ACS叢集

    展開查看YAML內容

    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      selector:
        app: vllm-llama2-7b-pool
      ports:
      - protocol: TCP
        port: 8000
        targetPort: 8000
      type: ClusterIP
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
            alibabacloud.com/compute-class: gpu  # 指定使用GPU算力
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: "example-model" # 指定GPU型號為example-model,請按實際情況填寫
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "10"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  nvidia.com/gpu: 1
                requests:
                  cpu: 8
                  memory: 30Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template
  2. 使用資料面叢集的kubeconfig,部署LLM推理服務。

    kubectl apply -f vllm-service.yaml

步驟二:配置ASM網關規則

部署生效於ASM網關的網關規則,該規則表示開啟網關8080連接埠的監聽。

  1. 使用以下內容,建立gateway.yaml。

    apiVersion: networking.istio.io/v1
    kind: Gateway
    metadata:
      name: llm-inference-gateway
      namespace: default
    spec:
      selector:
        istio: ingressgateway
      servers:
        - hosts:
            - '*'
          port:
            name: http-service
            number: 8080
            protocol: HTTP
  2. 建立網關規則。

    kubectl apply -f gateway.yaml

步驟三:配置LLM推理服務路由和負載平衡

  1. 使用ASM的kubeconfig,啟用LLM推理服務路由能力。

    kubectl patch asmmeshconfig default --type=merge --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'
  2. 部署InferencePool資源。

    InferencePool資源通過標籤選取器聲明一組叢集中啟動並執行LLM推理服務工作負載,ASM會根據您建立的InferencePool來對LLM推理服務開啟vLLM負載平衡。

    1. 使用以下內容,建立inferencepool.yaml。

      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: InferencePool
      metadata:
        name: vllm-llama2-7b-pool
      spec:
        targetPortNumber: 8000
        selector:
          app: vllm-llama2-7b-pool
    2. 使用資料面叢集的kubeconfig,建立InferencePool資源。

      kubectl apply -f inferencepool.yaml
  3. 部署InferenceModel資源。

    InferenceModel指定了InferencePool中具體模型的流量分發策略。

    1. 使用以下內容,建立inferencemodel.yaml。

      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: InferenceModel
      metadata:
        name: inferencemodel-sample
      spec:
        modelName: lora-request
        poolRef:
          group: inference.networking.x-k8s.io
          kind: InferencePool
          name: vllm-llama2-7b-pool
        targetModels:
        - name: tweet-summary
          weight: 10
        - name: tweet-summary-1
          weight: 10
        - name: tweet-summary-2
          weight: 10
        - name: tweet-summary-3
          weight: 10
        - name: tweet-summary-4
          weight: 10
        - name: sql-lora
          weight: 10
        - name: sql-lora-1
          weight: 10
        - name: sql-lora-2
          weight: 10
        - name: sql-lora-3
          weight: 10
        - name: sql-lora-4
          weight: 10

      上述內容配置了當請求的模型名稱是lora-request時,總和為50%的請求由tweet-summary LoRA模型進行推理,另外50%的請求發往sql-lora LoRA模型進行推理。

    2. 建立InferenceModel資源。

      kubectl apply -f inferencemodel.yaml
  4. 建立LLMRoute資源。

    為網關配置路由規則。此路由規則通過引用InferencePool資源的方式,指定ASM網關將8080連接埠上接收到的請求全部轉寄給樣本LLM推理服務。

    1. 使用以下內容,建立llmroute.yaml。

      apiVersion: istio.alibabacloud.com/v1
      kind: LLMRoute
      metadata:  
        name: test-llm-route
      spec:
        gateways: 
        - llm-inference-gateway
        host: test.com
        rules:
        - backendRefs:
          - backendRef:
              group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-llama2-7b-pool
    2. 建立LLMRoute資源。

      kubectl apply -f llmroute.yaml

步驟四:驗證執行結果

多次執行以下命令發起測試:

curl -H "host: test.com" ${ASM網關IP}:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "lora-request",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}' -v

可以看到和下面內容相似的輸出:

{"id":"cmpl-2fc9a351-d866-422b-b561-874a30843a6b","object":"text_completion","created":1736933141,"model":"tweet-summary-1","choices":[{"index":0,"text":", I'm a newbie to this forum. Write a summary of the article.\nWrite a summary of the article.\nWrite a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}

其中model欄位表示真正提供服務的模型。多次訪問之後,可以看到tweet-summary和sql-lora模型的請求總量比例大致為1:1。

(可選)步驟五:配置LLM服務可觀測指標與可觀測大盤

在使用InferencePool和InferenceMode資源聲明叢集中的LLM推理服務,並配置路由策略後,可以通過在日誌和監控指標兩方面查看LLM推理的可觀測能力。

  1. 開啟ASM的LLM流量可觀測能力,採集服務網格監控指標。

    1. 通過操作新增日誌欄位、新增指標以及新增指標維度增強LLM推理請求的可觀測性資訊。

    2. 完成配置後,ASM監控指標中會新增一個model維度。您可以通過將監控指標採集到可觀測監控Prometheus版,或者整合自建Prometheus實現網格監控進行指標採集。

    3. ASM新增提供兩個指標來表示所有請求的輸入token數asm_llm_proxy_prompt_tokens和輸出token數asm_llm_proxy_completion_tokens。您可以為Prometheus增加以下規則來新增這些指標

      scrape_configs:
      - job_name: asm-envoy-stats-llm
        scrape_interval: 30s
        scrape_timeout: 30s
        metrics_path: /stats/prometheus
        scheme: http
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels:
          - __meta_kubernetes_pod_container_port_name
          action: keep
          regex: .*-envoy-prom
        - source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:15090
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels:
          - __meta_kubernetes_namespace
          action: replace
          target_label: namespace
        - source_labels:
          - __meta_kubernetes_pod_name
          action: replace
          target_label: pod_name
        metric_relabel_configs:
        - action: keep
          source_labels:
          - __name__
          regex: asm_llm_.*
  2. 採集vLLM服務監控指標。

    1. ASM提供的LLM推理請求監控指標主要監控了外部LLM推理請求的吞吐情況。您可以為vLLM服務Pod增加Prometheus採集相關的註解,以採集vLLM服務暴露的監控指標,監控vLLM服務的內部狀態。

      ...
      annotations:
        prometheus.io/path: /metrics # 指標暴露的HTTP Path。
        prometheus.io/port: "8000" # 指標暴露連接埠,即為vLLM Server的監聽連接埠。
        prometheus.io/scrape: "true" # 是否抓取當前Pod的指標。
      ...
    2. 通過Prometheus執行個體預設的服務發現機制採集vLLM服務相關指標。具體操作,請參見。

      在vLLM服務提供的監控指標中,可以通過以下重點指標來直觀瞭解vLLM工作負載的內部狀態。

      指標名稱

      說明

      vllm:gpu_cache_usage_perc

      vllm的GPU緩衝使用百分比。vLLM啟動時,會儘可能多地預先佔有一塊GPU顯存,用於進行KV緩衝。對於vLLM伺服器,緩衝利用率越低,代表GPU還有充足的空間將資源分派給新來的請求。

      vllm:request_queue_time_seconds_sum

      請求在等待狀態排隊花費的時間。LLM推理請求在到達vLLM伺服器後、可能不會被立刻處理,而是需要等待被vLLM調度器調度運行預填充和解碼。

      vllm:num_requests_running

      vllm:num_requests_waiting

      vllm:num_requests_swapped

      正在運行推理、正在等待和被交換到記憶體的請求數量。可以用來評估vLLM服務當前的請求壓力。

      vllm:avg_generation_throughput_toks_per_s

      vllm:avg_prompt_throughput_toks_per_s

      每秒被預填充階段消耗的token以及解碼階段產生的token數量。

      vllm:time_to_first_token_seconds_bucket

      從請求發送到vLLM服務,到響應第一個token為止的時延水平。該指標通常代表了用戶端在輸出請求內容後得到首個響應所需的時間、是影響LLM使用者體驗的重要指標。

  3. 配置Grafana大盤檢測LLM推理服務。

    您可以通過Grafana大盤來觀測基於vLLM部署的LLM推理服務:

    • 通過基於ASM監控指標的面板觀測服務的請求速率和整體token吞吐;

    • 通過基於vLLM監控指標的面板觀測推理工作負載的內部狀態。

    請確保Grafana使用的資料來源Prometheus執行個體已經採集服務網格和vLLM的監控指標。

    將以下內容匯入到Grafana,建立LLM推理服務的可觀測大盤。

    展開查看JSON內容

    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": {
              "type": "grafana",
              "uid": "-- Grafana --"
            },
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "target": {
              "limit": 100,
              "matchAny": false,
              "tags": [],
              "type": "dashboard"
            },
            "type": "dashboard"
          }
        ]
      },
      "description": "Monitoring vLLM Inference Server",
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": 49,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 23,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "exemplar": false,
              "expr": "sum by(model) (rate(istio_requests_total{model!=\"unknown\"}[$__rate_interval]))",
              "instant": false,
              "interval": "",
              "legendFormat": "__auto",
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Request Rate",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 20,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_completion_tokens{}[$__rate_interval]))",
              "instant": false,
              "legendFormat": "generate tokens (from proxy)",
              "range": true,
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_prompt_tokens{}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "legendFormat": "prompt tokens (from proxy)",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Tokens Rate",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "min": -1,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 8
          },
          "id": 17,
          "options": {
            "colorMode": "value",
            "graphMode": "area",
            "justifyMode": "auto",
            "orientation": "auto",
            "reduceOptions": {
              "calcs": [
                "mean"
              ],
              "fields": "",
              "values": false
            },
            "textMode": "auto"
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "avg(vllm:gpu_cache_usage_perc)",
              "hide": false,
              "instant": false,
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Average gpu cache usage",
          "type": "stat"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 8
          },
          "id": 18,
          "options": {
            "colorMode": "value",
            "graphMode": "area",
            "justifyMode": "auto",
            "orientation": "auto",
            "reduceOptions": {
              "calcs": [
                "mean"
              ],
              "fields": "",
              "values": false
            },
            "textMode": "auto"
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "avg(rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "range": true,
              "refId": "C"
            }
          ],
          "title": "Average Queue Time",
          "type": "stat"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Percentage of used cache blocks by vLLM.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 16
          },
          "id": 4,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "sum by(kubernetes_pod_name) (vllm:gpu_cache_usage_perc{model_name=\"$model_name\"})",
              "instant": false,
              "legendFormat": "GPU Cache Usage ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "vllm:cpu_cache_usage_perc{model_name=\"$model_name\"}",
              "hide": false,
              "instant": false,
              "legendFormat": "CPU Cache Usage",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Cache Utilization",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "seconds",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 16
          },
          "id": 14,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "__auto",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Queue Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "P50, P90, P95, and P99 TTFT latency in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 24
          },
          "id": 5,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])) / sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "legendFormat": "Average ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "Time To First Token Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of tokens processed per second",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 24
          },
          "id": 8,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "Prompt Tokens/Sec",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:generation_tokens_total{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "Generation Tokens/Sec ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "B",
              "useBackend": false
            }
          ],
          "title": "Token Throughput",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "description": "End to end request latency measured in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 32
          },
          "id": 9,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "rate(vllm:e2e_request_latency_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:e2e_request_latency_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Average",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "E2E Request Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of requests in RUNNING, WAITING, and SWAPPED state",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "none"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 32
          },
          "id": 3,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "vllm:num_requests_running{model_name=\"$model_name\"}",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Running",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "vllm:num_requests_swapped{model_name=\"$model_name\"}",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Swapped",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "exemplar": false,
              "expr": "sum by(kubernetes_pod_name) (vllm:num_requests_waiting{model_name=\"$model_name\"})",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Waiting for {{kubernetes_pod_name}}",
              "range": true,
              "refId": "C",
              "useBackend": false
            }
          ],
          "title": "Scheduler State",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Inter token latency in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 40
          },
          "id": 10,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "rate(vllm:time_per_output_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:time_per_output_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Mean",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "Time Per Output Token Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "default": false,
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "__systemRef": "hideSeriesFrom",
                "matcher": {
                  "id": "byNames",
                  "options": {
                    "mode": "exclude",
                    "names": [
                      "Decode"
                    ],
                    "prefix": "All except:",
                    "readOnly": true
                  }
                },
                "properties": [
                  {
                    "id": "custom.hideFrom",
                    "value": {
                      "legend": false,
                      "tooltip": false,
                      "viz": true
                    }
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 40
          },
          "id": 15,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "rate(vllm:request_prefill_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Prefill",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "rate(vllm:request_decode_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Decode",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Requests Prefill and Decode Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Heatmap of request prompt length",
          "fieldConfig": {
            "defaults": {
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "scaleDistribution": {
                  "type": "linear"
                }
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 48
          },
          "id": 12,
          "options": {
            "calculate": false,
            "cellGap": 1,
            "cellValues": {
              "unit": "none"
            },
            "color": {
              "exponent": 0.5,
              "fill": "dark-orange",
              "min": 0,
              "mode": "scheme",
              "reverse": false,
              "scale": "exponential",
              "scheme": "Spectral",
              "steps": 64
            },
            "exemplars": {
              "color": "rgba(255,0,255,0.7)"
            },
            "filterValues": {
              "le": 1e-9
            },
            "legend": {
              "show": true
            },
            "rowsFrame": {
              "layout": "auto",
              "value": "Request count"
            },
            "tooltip": {
              "mode": "single",
              "show": true,
              "showColorScale": false,
              "yHistogram": true
            },
            "yAxis": {
              "axisLabel": "Prompt Length",
              "axisPlacement": "left",
              "reverse": false,
              "unit": "none"
            }
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(le) (increase(vllm:request_prompt_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
              "format": "heatmap",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "{{le}}",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Request Prompt Length",
          "type": "heatmap"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Heatmap of request generation length",
          "fieldConfig": {
            "defaults": {
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "scaleDistribution": {
                  "type": "linear"
                }
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 48
          },
          "id": 13,
          "options": {
            "calculate": false,
            "cellGap": 1,
            "cellValues": {
              "unit": "none"
            },
            "color": {
              "exponent": 0.5,
              "fill": "dark-orange",
              "min": 0,
              "mode": "scheme",
              "reverse": false,
              "scale": "exponential",
              "scheme": "Spectral",
              "steps": 64
            },
            "exemplars": {
              "color": "rgba(255,0,255,0.7)"
            },
            "filterValues": {
              "le": 1e-9
            },
            "legend": {
              "show": true
            },
            "rowsFrame": {
              "layout": "auto",
              "value": "Request count"
            },
            "tooltip": {
              "mode": "single",
              "show": true,
              "showColorScale": false,
              "yHistogram": true
            },
            "yAxis": {
              "axisLabel": "Generation Length",
              "axisPlacement": "left",
              "reverse": false,
              "unit": "none"
            }
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(le) (increase(vllm:request_generation_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
              "format": "heatmap",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "{{le}}",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Request Generation Length",
          "type": "heatmap"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 56
          },
          "id": 11,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(finished_reason) (increase(vllm:request_success_total{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "interval": "",
              "legendFormat": "__auto",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Finish Reason",
          "type": "timeseries"
        },
        {
          "datasource": {
            "default": false,
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "__systemRef": "hideSeriesFrom",
                "matcher": {
                  "id": "byNames",
                  "options": {
                    "mode": "exclude",
                    "names": [
                      "Tokens"
                    ],
                    "prefix": "All except:",
                    "readOnly": true
                  }
                },
                "properties": [
                  {
                    "id": "custom.hideFrom",
                    "value": {
                      "legend": false,
                      "tooltip": false,
                      "viz": true
                    }
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 56
          },
          "id": 16,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "rate(vllm:request_max_num_generation_tokens_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Tokens",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Max Generation Token in Sequence Group",
          "type": "timeseries"
        }
      ],
      "refresh": false,
      "schemaVersion": 38,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "current": {
              "selected": true,
              "text": "prom-cec64713b1aab44d0b49236b6f54cd671",
              "value": "prom-cec64713b1aab44d0b49236b6f54cd671"
            },
            "hide": 0,
            "includeAll": false,
            "label": "datasource",
            "multi": false,
            "name": "DS_PROMETHEUS",
            "options": [],
            "query": "prometheus",
            "queryValue": "",
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "type": "datasource"
          },
          {
            "current": {
              "selected": false,
              "text": "/model/llama2",
              "value": "/model/llama2"
            },
            "datasource": {
              "type": "prometheus",
              "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
            },
            "definition": "label_values(model_name)",
            "hide": 0,
            "includeAll": false,
            "label": "model_name",
            "multi": false,
            "name": "model_name",
            "options": [],
            "query": {
              "query": "label_values(model_name)",
              "refId": "StandardVariableQuery"
            },
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "sort": 0,
            "type": "query"
          }
        ]
      },
      "time": {
        "from": "2025-01-10T04:00:36.511Z",
        "to": "2025-01-10T04:18:26.639Z"
      },
      "timepicker": {},
      "timezone": "",
      "title": "vLLM",
      "uid": "b281712d-8bff-41ef-9f3f-71ad43c05e9c",
      "version": 10,
      "weekStart": ""
    }

    大盤效果如下:

    image